For this question we will be using the BreastCancer data set from the mlbench library. To load the data into R, simply load the library, and then type: data("BreastCancer", package = "mlbench"). Once the data is loaded, we recommend removing the Id attribute, since this has no value for classification: BreastCancer$Id = NULL. Information about the data set can be found by typing
?BreastCancer into the console.
(a) Set the seed of the random number generator to 100 (set.seed(100)), and then generate a training data set of 400 data points using the sample function (the remaining 299 data points will be the test set).(d) Set the termination criteria to be a max depth of 3 for the following question (i.e. set maxdepth=3, minsplit=1 and cp=0.)
Download the file letters.csv from Canvas and read it into R; this file contains a data set, in which each of the 20,000 data points corresponds to a digitised capital letter that is known (this is given by the lettr attribute in column 1). There are also 17 independent attributes (columns 2–18), that have been computed from the digital image of each letter (e.g. onpix, is a count of the number of pixels that are black in the image). If you are interested, further details can be found here: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition.
(a) Set the seed of the random number generator to 50, and then generate a random training data set of 18000 data points using the sample function (the remaining 2000 data points will be the test set).
(b) Filter the data set to create a training set called letters.train and a test set called letters.test
(c) The question uses geom jitter() for ggplot(); this plot is just like a scatter plot, but ‘jitters’ the data so discrete values don’t overlap. Plot two jitter plots for the letters.test data set:
Which of these pairs of attributes would be better for classifying the data? Explain why.
(d) Using the letters.train data set create three random forests, with ntree set to 10, 100 and 1000, to predict the lettr attribute, given the other 16 attributes. Note the following:
(e) For the random forest with 1000 trees, apply the predict function to the test set.
We now wish to see if the Na¨?ve Bayes classifier can be also be used to predict the correct letters for the same data set.
(f) Apply the Na¨?ve Bayes method to the training data set to determine the class lettr using all the other attributes.
(g) Using the predict() function determine the in-sample and out-of-sample accuracy for this method. (R will report some warnings, but you can ignore them.)
(h) Show the confusion matrix for the out-of-sample predictions, above, and discuss this in compar- ison to the corresponding confusion matrix for the random forest.