Python Assignment: Pre-Processing - PCA

Python Assignment on Data Pre-Processing, PCA Analysis, and Linear Regression

Task 1: Data pre-processing and data exploration

This coursework is an individual assignment. You need to write your own Python code in a Jupyter Notebook.

a. Use Pandas to load the data and report the number of data points (rows) in the dataset.

b. Consider “quality” as class labels. Report the number of features in the dataset and the number of data points in each class.

c. Perform random permutations of the data using the function, shuffle, from sklearn.utils. You must set a value to the parameter, random_state. Assign the data to a new variable as white_wine.

d. Produce one scatter plot, that is, one feature against another feature. You are free to choose which two features you want to use

a. Perform a PCA analysis on the whole white_wine dataset.

b. Plot the data in the PC1 and PC2 projections and label/colour the data in the plot according to their class labels.

c. Report the variance captured by each principal component.

a. Take out the first 1000 rows from white_wine and save it as the validation set.

b. Take out the last 1000 rows from white_wine and save it as the test set.

c. Save the rest of rows from white_wine as the training set.

In this task, let us consider the last column ‘quality’ of the white_wine dataset as a real-valued target rather than a class label. You need to use the linear regression model to finish the following tasks (a)- (c). Note that you should use all available features in the dataset

a. Produce a learning curve of the size of training set against the performance measurements. The performance should be measured on both the training set and the validation set. You need to choose at least 10 different sizes for the training set. For example, the first size may be 10% of the total training set produced in Task 3.

• Remember to scale the corresponding training set and the validation set.

b. Report what the best training data size you would like to use for this work is and explain why you choose it.

c. Report the performance on the test set obtained using the model trained from the best size.

• Remember to scale the corresponding training and test sets.

a. Summarize your findings for each task.

b. For Task 4, discuss whether there is any problem with that experimental design. If there is, what is it? How may you further improve it so that the experimental results are more reliable?

Get instant help from 5000+ experts for