Wine analysis via logistic regression.

Red Wine dataset analysis and classification using logistic regression

Task 1

In this part, you will continue to work on the dataset you have used and modified in Part one, that is, the red-wine dataset with two classes, where Class One includes those red wine with a quality value in between 6-10 (inclusive) and Class Two includes those red-wine with a quality value in between 1- 5 (inclusive)

Divide the data set into a training set (I) and a test set. Usually, we use 20%-30% of the total data points as the test data. It is your choice on how to set the exact ratio. But you need make it clear in your report. You should further divide the training set (I) into a smaller training set (II)and a validation set using the same ratio.

PCA Analysis on the red-wine two classes dataset

a) Perform a PCA analysis on the training data set (I)

b) Plot the training data in the PC1 and PC2 projection and label the data in the picture according to its class.

c) Report variances captured by each principal component

Do a classification using the logistic regression model

a) In your report, describe the model you have used, including (6 marks):

What is the cost function? You need to give a mathematical expression describing it.

Which optimization algorithm has been used in your code?

Did you use a regularisation term? If you used one, what is it?

b) Define your own function ([num1, num2]=misPatterns(predictions, labels)) using Python: the inputs of this function are predictions and labels; and the outputs of this function are the number (num1) of misclassified patterns whose label is 1 but prediction is 2, and the number (num2) of misclassified patterns whose label is 2 but prediction is 1.

c) Train the model on the training set and report the performance on the test set including accuracy rate and results obtained using the misPatterns function you have defined in b)

Investigate how the size of the training dataset affects the model performance on the test set

a) Produce a learning curve of the size of training set (II) against the accuracy rate. The accuracy rate should be measured on both the training set and the validation set

b) Report what is the best training data size you would like to use for this work and explain why you chose it

c) Report the performance on the test set obtained using the model trained from the best size.

Investigate how the number of features extracted from PCA affects the model performance on the test set

a) Perform a PCA analysis on the training data set (II) and obtained projected training set.

b) Producing a learning curve of the number of principal components against the accuracy rate. The accuracy rate should be measured on both the training set (II) and the validation set

c) Report what is the best number of principal components you would like to use for this dataset and explain why you chose it

d) Report the performance on the test set obtained using the model trained from the best number of principal components

Get instant help from 5000+ experts for