Diabetes Hospital Re-admittance: PCA Analysis Assignment

Question

Provided is this jupyter notebook. Contained are all instructions necessary to complete the assignment.

This assignment is weighted with 30% of the total marks you can score in this module.

What to submit

You must submit two files:

1. this notebook, completed, in .ipynb format.

2. This notebook, completed, in .html format

A total of 100 marks can be obtained. 10 marks are obtained when submitting a notebook that runs without errors on an anaconda installation such as installed on the UH practical machines. The remaining 90 marks are distributed as indicated in the tasks.

Only answers with correct code will be marked. Any parts that do not run because of coding errors will receive zero marks. This includes cells that do not run because previous cells contained errors. Please test your notebook before submitting. You can use "Kernel" -> "Restart & Run All" in the jupyter notebook menu bar.

You will work on a a dataset about diabetes hospital re-admittance. The dataset is available via canvas as part of this assignment. Diabetes is the condition when the body loses its ability to process glucose. It is often associated with overconsumption of food, and obesity. But there are also genetic factors that can cause the disease.

The dataset is a CSV file.

It contains seven features. It also contains one dependent variable, 'Outcome'.

Task 1 - Load the data

Use pandas to load the provided csv file.

Use the DataFrame.head command to show the first 10 rows of the data set.

Task 2 - Normalisation and PCA

Perform a PCA on the data set. Perform the PCA on the correlation matrix, not the covariance matrix. Exclude the dependent variable from processing. The result should contain the maximum number of Principal Components possible.

Make a DataFrame that contains the result of the PCA in the first n columns, n being the number of Principal Components. Columns should be named "PC 1" to "PC n". The last column of the data frame should contain the unprocessed dependent variable. Name the column according to the dependent variable.

Use the DataFrame.head function to display the 10 first rows of the dataset.

Task 3 - PCA Scatter plot and interpretation (15 marks)

Make a scatter plot of the first two principal components. Use small dots as markers. Color the dots according to the dependent variable. Use different colors for all discrete values of the dependent variable. Add a legend to the plot that explains which color is associated with which value of the dependent variable. Label the axes accordingly.

Interpret the findings: By visual inspection, which principal component gives a better separation of the dependent variable? Justify your answer. Three sentences maximum.

Interpretation: *Please write here. *

Task 4 - Covariance/correlation matrix and interpretation (20 marks)

Plot the covariance/correlation matrix as an image. Use a divergent colormap and center it on zero. Display a colorbar. Label each row and column according to the feature they represent, i.e. set the xticklabels and yticklabels accordingly. Rotate the labels on the columns by 90 degress to make sure they are legible.

Interpret the result: Name the three pairs of features with the highest correlation.

Interpretation: Most correlated feature pairs:

Task 5 - component matrix and interpretation (20 marks)

Plot the PCA component matrix as an image. Use a diverging colormap. Center it on zero. Display a colorbar. Use labels for rows and columns to indicate which PC or feature the row or column refers to. Make sure all labels are legible.

Interpret the component matrix, taking your observation from the scatter plot into account. Which features are most likely to be correlated with the outcome? Justify your answer. Five sentences maximum.

Task 6 - Scree plot and interpretation (10 marks)

Plot the explained variance ratio against the number of Principal Components ("scree plot"). Use 'x' as a marker and a dashed line, both colored black. Label the axes accordingly. Set the y-axis limit such that it covers the whole range of values, starting at zero.

Interpret the findings: What fraction of the total variance do the first two PCs explain? How many PCs are required to explain 95% of the variance? Feel free to use extra code to provide exact answers.

Get instant help from 5000+ experts for