Diabetes Dataset Analysis: PCA - Logistic Regression

Diabetes Dataset Analysis: PCA, Logistic Regression, and Feature Selection

Question:

The dataset includes a certain number of diagnostic measurements (features), which can be used to diagnostically predict whether or not a participant has diabetes. Note that the last column is the label information: participants who have diabetes are classified as ‘positive’; participants who do not have diabetes are classified as ‘negative’.

Preparation: To do tasks set in this piece of work, you need to load the data using Pandas and change the labels: change ‘positive’ to the value of 1 and ‘negative’ to the value of 0.

Task 1:

Divide the data set into a training set (from now on we shall refer to this as training set (I)) and a test set: write Python code to make the first 500 rows in the original data comprise the training set (I); the rest of rows in the original dataset will form the test set. Use Python code to check and report how many data points are labelled as 0’s (negative) in the training set and the test set, respectively, and how many data points are labelled as 1’s (positive) in the training set and the test respectively.

Task 2: PCA Analysis on the training set

a) Normalise the training set and the test set using StandardScaler() (Hint: the parameters should come from the training set only)

b) Perform a PCA analysis on the training data set (I) and plot a scree plot to report variances captured by each principal component

c) Plot two subplots in one figure: in one subplot project the training set in the first two principal components’ projection space and label the training data using different colours in the picture according to its class; in the other subplot project the training set in the third and fourth principal components’ projection space and also label the test data using different colours according to its class

Task 3: Do a classification using the logistic regression model with a regularisation term

a) In your report, describe the model you have used, including :

What is the cost function? You need to give a mathematical expression describing it.

Which optimization algorithm has been used in your code?

Which regularisation term have you used?

b) Define your own function ([num1, index1, num2, index2]=misPatterns(predictions, labels)) using Python. The inputs of this function should be the predictions and labels in the test set; and the outputs of this function should ne the number (num1) of misclassified patterns whose label is 1 but was given prediction of 0 and their indices (index1) in the test set, and the number (num2) of misclassified patterns whose label is 0 but was given a prediction of 1 and their indices (index2) in the test set.

: Investigate how the number of features in the training dataset affects the model performance on the validation set

a) Divide the training dataset (I) into a smaller training set (II) and a validation set using train_test_split and report the number of points in each set. Usually, we use 20%-30% of the total data points in the whole training set as the validation data. It is your choice on how you set the exact ratio.

b) Use the training set (II) to train 8 logistic regression models, with 8 different feature sets. That is:

the first one is to use the 1

The third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features. In

other words, the nth feature set should make use of the first n features.

Measure the precision score on both the training set (II) and the validation set. Report the results by plotting them in a figure: that is, a plot of the precision score against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set .

c) Report what is the best number of features you would like to use in this work and explain why you choose it.

d) Use the selected number of features to train the model and report the performance on the test set.

Task 5: Writing a report

In this report, you need to summarize what you have done, which model you have used, what results you have obtained, and also your findings and conclusions. The highest mark will be given to reports with outstanding presentation and clarity, no significant grammatical/ spelling or structural errors, and which show an outstanding level of analysis with critical evaluation/reflection where it is required. Hand in date: by 12 noon on 23/06/2020 via Canvas.

What to submit:

Submit two files: identified by your student’s ID number.

1) A Jupyter notebook (.ipynb file) to show your completed Python code.

2) A report consisting of no more than 4 pages and fewer than 1200 words (please use a single column format. Font size should be set to 11 or 12 point) in the pdf format.

Get instant help from 5000+ experts for