+1(978)310-4246 credencewriters@gmail.com
  

For Project 3, refer to the Google Apps dataset that we explored in previous videos. Below is a summary of how to set up the data for this project.

import numpy as npimport osimport pandas as pdimport matplotlib as mplimport matplotlib.pyplot as plt#read in the file and make a copy of the datasetapps = pd.read_csv(“http://www.jpstats.org/data/googleplaystore.csv”)dat = apps.copy()#separate features from labelsy = dat[“Installs”]X = dat.drop(“Installs”, axis=1)classnames, indices = np.unique(y, return_inverse=True)y = indices#first split into train and test setsfrom sklearn.model_selection import train_test_splitX_train_full, X_test, y_train_full, y_test = train_test_split(X,y, test_size=0.2, random_state=34, stratify=y)#now split the train_full into train and validationX_train, X_val, y_train, y_val = train_test_split(X_train_full,y_train_full, test_size=0.2, random_state=34, stratify=y_train_full)#separate numeric from categorical featuresX_num = X.select_dtypes(include=[np.number]) X_cat = X.select_dtypes(exclude=[np.number])#build pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputernum_attribs = list(X_num)cat_classes = np.unique(dat[“Category”])type_classes = np.unique(dat[“Type”])cont_classes = np.unique(dat[“Content Rating”])gen_classes = np.unique(dat[“Genres”])num_pipeline = Pipeline([ (‘imputer’, SimpleImputer(strategy=”median”)), (‘std_scaler’, StandardScaler()), ])full_pipeline = ColumnTransformer([ (“num”, num_pipeline, num_attribs), (“cat1”, OneHotEncoder(categories=[cat_classes]), [“Category”]), (“cat2”, OneHotEncoder(categories=[type_classes]), [“Type”]), (“cat3”, OneHotEncoder(categories=[cont_classes]), [“Content Rating”]), (“cat4”, OneHotEncoder(categories=[gen_classes]), [“Genres”]) ])X_train_prep = full_pipeline.fit_transform(X_train)X_val_prep = full_pipeline.transform(X_val)X_test_prep = full_pipeline.transform(X_test)

In a Jupyter notebook, fit a random forest classifier to this data. Use grid search to fine tune the model but only fine tune the n_estimators and max_leaf_nodes parameters. You decide what values to try for each of these in the grid search. In everything, set the random state to 34.

Determine the best model and fit the training data (X_train_prep and y_train, don’t include the validation set here). Now find the accuracy of the model on the test set. Recall that we got 15.7% accuracy on the test data when we used a Decision Tree classifier. Is the random forest any better? Explain why it is better or not.

You will now fit an MLP classification neural network to this data. For the number of layers and number of neurons, we will use a random number. For the number of hidden layers, use the following code:

from numpy.random import randint, seed#number of hidden layers to use in NNseed(birthdate)randint(2, 6)

Where it says birthdate, I want you to put your birthdate in the format mmdd. So if your birthday is October 4, then put seed(1004). If your birthday is before October, then do not put the leading zero. For example, if your birthday is Aug 4, then put seed(804).

For the number of neurons in each layer, use the following code:

#number of neurons for each hidden layerseed(birthdate)randint(3, 8)*100

Again, put your birthdate (mmdd) where it says birthdate.

When you set up the sequential model, there is no need to start with a flatten layer since you are not dealing with images. So the first hidden layer in your network will look like:

keras.layers.Dense([the number of nodes from the above code], activation=”relu”, input_shape =X_train_prep.shape[1:] )

Use relu for the activation for all hidden layers. For the output layer, just use softmax with the number of nodes equal to 20 (since there are 20 categories in the labels).

Run 200 epochs. After you finish training the model, plot the training and validation loss and accuracy for each epoch. Comment on this plot by stating if you think more epochs could help or if you think we could have stopped at a lower number of epochs.

Now get the accuracy for the test set.

Did the neural network outperform the random forest classifier? If it did not what do you suggest we do to improve the neural network?

Show all your code and output in a Jupyter notebook. Also, comment (using markdown) on what you are doing in each step.

In the Jupyter notebook, put Project 3 in a heading at the top. Underneath that, put your first and last name as a subheading (use three #’s for the subheading). For the Random Forest part, put “Random Forest” as another heading in a markdown cell. For the MLP part, put “MLP” as a heading in a markdown cell.

Name your file [last name]_Project3.ipynb (for example Patrick_Project3.ipynb), and submit it to this assignment.

  
error: Content is protected !!