Red Wine Data: Pre-processing - PCA

Red Wine Data Analysis: Pre-processing, Probabilities, PCA and Classification

Part One

Task 1: Data pre-processing and data exploration

a. Use Pandas to load data

b. Merge all the data with “quality” labels between 6-10 into Class 1 and similarly form

c. Report the number of features and number of rows in each class

d. Choose an attribute and generate a boxplot for the two pre-defined classes.

e. Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use.

Task 2: Computing probabilities using Python code for the given red wine dataset

f. Prior probability:

i. What is the probability of a wine classified as Class 1 (P(Class 1))?

ii. What is the probability of a wine classified as Class 2 (P(Class 2))?

g. Conditional probability:

i. What is the probability of a wine having a pH value greater than 3.6 given it is classified as Class 1 (P(pH>3.6|Class 1))?

h. Posterior probability

i. What is the probability of a wine classified as Class 1 when it has a pH value greater than 3.6?

Task 3: Writing a report to summarize what you have done. Explain figures you have put into your report clearly and report your findings and conclusions. The maximum number of pages is two and it should include less than 400 words. (This report counts for 0 marks: this is a chance for you to practice on how to write a report and to obtain feedback from a tutor.) Please use a single column format. The font size should be set to 11 or 12-point size.

In this part, you will continue to work on the dataset you have used and modified in Part one, that is, the red-wine dataset with two classes, where Class One includes those red wine with a quality value in between 6-10 (inclusive) and Class Two includes those red-wine with a quality value in between 1- 5 (inclusive).

Task 1: Divide the data set into a training set (I) and a test set. Usually, we use 20%-30% of the total data points as the test data. It is your choice on how to set the exact ratio. But you need make it clear in your report. You should further divide the training set (I) into a smaller training set (II)and a validation set using the same ratio.

Task 2: PCA Analysis on the red-wine two classes dataset

a) Perform a PCA analysis on the training data set (I)

Task 1: Data pre-processing and data exploration

b) Plot the training data in the PC1 and PC2 projection and label the data in the picture according to its class.

c) Report variances captured by each principal component

Task 3: Do a classification using the logistic regression model

a) In your report, describe the model you have used, including:

What is the cost function? You need to give a mathematical expression describing it.

Which optimization algorithm has been used in your code?

Did you use a regularisation term? If you used one, what is it?

b) Define your own function ([num1, num2]=misPatterns(predictions, labels)) using Python: the inputs of this function are predictions and labels; and the outputs of this function are the number (num1) of misclassified patterns whose label is 1 but prediction is 2, and the number (num2) of misclassified patterns whose label is 2 but prediction is 1.

c) Train the model on the training set and report the performance on the test set including accuracy rate and results obtained using the misPatterns function you have defined in b).

Task 4: Investigate how the size of the training dataset affects the model performance on the test set

a) Produce a learning curve of the size of training set (II) against the accuracy rate. The accuracy rate should be measured on both the training set and the validation set (5 marks).

b) Report what is the best training data size you would like to use for this work and explain why you chose it

c) Report the performance on the test set obtained using the model trained from the best size

Task 5: Investigate how the number of features extracted from PCA affects the model performance on the test set

a) Perform a PCA analysis on the training data set (II) and obtained projected training set.

b) Producing a learning curve of the number of principal components against the accuracy rate. The accuracy rate should be measured on both the training set (II) and the validation set.

c) Report what is the best number of principal components you would like to use for this dataset and explain why you chose it.

d) Report the performance on the test set obtained using the model trained from the best number of principal components

Task 6: Writing a report

In this report, you need to summarize what you have done, which model you have used, what results you have obtained, and what are your findings and conclusions. The highest mark will give a report with outstanding presentation and clarity, no significant grammatical/ spelling or structural errors, and outstanding level of analysis with critical evaluation/reflection where it is required.

Get instant help from 5000+ experts for