Note that you should work on the red wine dataset (winequality-red.csv) only, which can be downloaded from ‘Data Folder’ in the link given above.
Task 1: Data pre-processing and data exploration (15 marks)
a. Use Pandas to load data
b. Merge all the data with “quality” labels between 6-10 into Class 1 and similarly form Class 2 for the data with “quality” labels between 1-5.
c. Report the number of features and number of rows in each class
d. Choose an attribute and generate a boxplot for the two pre-defined classes.
e. Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use.
Task 2: Computing probabilities using Python code for the given red wine dataset (5 marks)
f. Prior probability:
i. What is the probability of a wine classified as Class 1 (P(Class 1))?
ii. What is the probability of a wine classified as Class 2 (P(Class 2))?
g. Conditional probability:
i. What is the probability of a wine having a pH value greater than 3.6 given it
is classified as Class 1 (P(pH>3.6|Class 1))?
h. Posterior probability
i. What is the probability of a wine classified as Class 1 when it has a pH value greater than 3.6?
Task 3: Writing a report to summarize what you have done. Explain figures you have put into your report clearly and report your findings and conclusions. The maximum number of pages is two and it should include less than 400 words. (This report counts for 0 marks: this is a chance for you to practice on how to write a report and to obtain feedback from a tutor.) Please use a single column format. The font size should be set to 11 or 12-point size.
What to submit:
Hand in two files:
1) A .ipynb file showing your completed programming code (worth 20 marks)
2) A report of maximum two pages in pdf format (worth 0 mark). The aim of this report is to give you a chance to practise how to write a report and a tutor will give you feedback during the demo.
Task 1: Divide the data set into a training set (I) and a test set. Usually, we use 20%-30% of the total data points as the test data. It is your choice on how to set the exact ratio. But you need make it clear in your report. You should further divide the training set (I) into a smaller training set (II)and a validation set using the same ratio. (5 marks)
Task 2: PCA Analysis on the red-wine two classes dataset (5 marks)
a) Perform a PCA analysis on the training data set (I) (2 marks)
b) Plot the training data in the PC1 and PC2 projection and label the data in the picture according to its class. (2 marks)
c) Report variances captured by each principal component (1 mark)
Task 3: Do a classification using the logistic regression model (13 marks)
a) In your report, describe the model you have used, including (6 marks):
What is the cost function? You need to give a mathematical expression describing it.
Which optimization algorithm has been used in your code?
Did you use a regularisation term? If you used one, what is it?