SVM Tuning & False-Positive Detection

SVM Model Tuning and False-Positive Detection

Answered

In this task, you need to use Principal Component Analysis (PCA) to understand the characteristics of the datasets.

Use Pandas to load both the training data set and the test set (1 mark).

Show one scatter plot, that is, one feature against another feature on the training set. It is your choice to show which two features you want to use (2 marks).

Normalise the training set and the test set using StandardScaler() (Hint: the parameters should come from the training set only) (2 marks).

Perform a PCA analysis on the scaled training set and plot the scree plot to report variances captured by each principal component (2 marks).

Obtain projections of the test set by projecting the scaled test data on the same PCA space produced by the training set (1 mark).

Plot two subplots in one fifigure: one for projecting the training set in the PC1 and PC2 projection space and label the training data using difffferent colours in the picture according to its class; the other one is for the test set in the same PC space and also label the test data using difffferent colors according to its class (2 marks).

Write a short critical analysis (no more than 100 words) on your observations. (2 marks)

Divide the training dataset into a smaller training set (II) and a validation set using train test split and report the number of points in each set. Usually, we use 20%-30% of the total data points in the whole training set as the validation data. It is your choice on how to set the exact ratio. (1 mark)

Normalise both the training set (II) and the validation set (Hint: the parameters should come from the training set (II) only).(1 mark)

When using the C-SVC SVM with the Gaussian radial basis kernel there are two tunable parameters, C (cost) and γ (gamma). To achieve the highest classifification rate possible it is very important to search for an optimal pair of these values.

You have been given the following combinations: [C=50, γ=0.01], [C=50, γ=10], [C=5, γ=1], [C=100, γ=0.01], [C=100, γ=10] and [C=100, γ=1].

You should train an SVM model for each combination from the given 6 combinations and then test it on the normalised validation set. The accuracy rate for each combination on the validation set should be reported. Finally, you need to select the best combination of parameters and report your result.

Basic task (2 marks)

You should now be in a position to further test your model with the selected parameters by classifying the test data. With the normalised whole training set as the input fifile, you will need to train an SVM model with the suitable parameter values discovered for C and γ during Task 3. When the classifification model is built you will then need to use it to classify the normalised testing set, and report the accuracy rate.

Advanced task (6 marks)

Write a Python function that can locate false-positives, that is those patterns originally labeled as non-defective which are incorrectly predicted as defective, and report the results on the test set (3 marks).

3Summarize your fifindings and write your conclusions in critical thinking. For example, can you fifind any reason/reasons as to why you think those instances are misclassifified? (3 marks)

Get instant help from 5000+ experts for