COMP723 Data Mining and Knowledge Engineering
Question 1
In this question you will investigate two different methods of combining two types of classifiers. The dataset that you will be experimenting with is the Diabetes dataset that has been used in the lab class:
You will produce two optimized classifiers and then combine them using a method called meta classification. This will be done in 5 steps. Use Python throughout and provide labelled and referenced code snippets for each of the 5 steps. Randomly split the dataset into 70% training and 30% testing and keep the sets consistent throughout the experiments.
Step 1:
Use the decision tree classifier and tune the max_depth parameter. Often the default setting of 2 produces poor results due to overfitting. Vary the max_depth parameter in the range [2, 20] in steps of 2 and record the classification accuracy on the test set for each value of this parameter.
Generate a two-column table (max_depth, test accuracy).
Step 2 (4 marks)
Now use the neural network (MLP) classifier and perform tuning on the learning rate parameter. Vary the learning rate in the range of [0.001, 0.01] in increments of 0.001 and record the classification accuracy on the test set for each value of this parameter.
Generate a two-column table (learning rate, test accuracy).
Step 3
Use the best value of the max_depth parameter from step 1 and perform feature selection using Python’s SelectKBest() method with the decision tree classifier. Vary the K parameter in the range [2..7] in increments of 1 and evaluate the accuracy on the test set for each value of K.
Generate a two column table (K, test accuracy).
Step 4
Use the best value of the learning rate parameter from step 2 and perform feature selection using Python’s SelectKBest() method with the MLP classifier. Vary the K parameter in the range [2..7] in increments of 1 and evaluate the accuracy on the test set for each value of the learning rate.
Generate a two-column table (K, test accuracy).
Step 5
In this step we will combine the two classifiers into a single classifier. To do this first look up the sklearn documentation on the use of the predict_proba() method that returns a vector of probabilities for each class for a given test sample. For the diabetes dataset, this will return a vector of size 2 as there are two classes.
a) For each test sample apply predict_proba() for the decision tree model produced from step 3 and select the class that gives the highest probability value.
b) Repeat this process for the MLP model produced from step 4.
The two models are combined by classifying each test instance into the class that produced the highest probability from steps 5 a) and 5 b).
Generate the average classification accuracy on the test set that is produced by combining the two classifiers.
Question 2
In this question you will explore different architectures for building a neural network. Once again you will use the Diabetes dataset.
1) Use the sklearn.MLPClassifier with default values for parameters and a single hidden layer with k=20 neurons. Use default values for all parameters other than the number of iterations which should be set to 150. Also, as is standard for an MLP classifier, we will assume a fully connected topology, that is, every neuron in a layer is connected to every other neuron in the next layer.
Record the classification accuracy as it will be used as a baseline for comparison in later parts of this question.
2) We will now experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest classification accuracy. In part 1 of the question we had all k neurons in a single layer. In this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus for example in the first iteration, the first hidden layer will have k-1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons will be in the first layer with 2 in the second and so on. Summarise your classification accuracy results in a 20 by 2 table with the first column specifying the combination of neurons used (e.g. 17, 3) and the second column specifying the classification accuracy.
From the table created in part 2 of this question you will observe a variation in accuracy with the split of neurons across the two layers. Give explanations for some possible reasons for this variation.