Data prep for heart disease & Iris datasets: exploration

Data Exploration, Correlation, and Cleaning for Heart Disease and Iris Datasets

Question 1: Data Exploration

Another dataset is the heart disease dataset (file uploaded on learn dropbox).The dataset contains 14 features (attributes) and 303 instances. The featuresare multivariate with types - categorical, numeric, ordinal, binary. The targetis a binary variable indicating the presence and absence of heart disease using1 and 0 respectively.Answer Question 1, 2, and 3 for both the datasets 1 and 2.Question 1: Data Exploration1.[CM1]To begin understanding the dataset, generate a “pairs plot” (alsocalled a scatter plot matrix,seaborn.pairplotis one method to do this)of the data. Note that the pairs plot includes the scatter plots of everydimension versus every other dimension. From the pair plot, identify thesubplots corresponding to the pairs of features where you see correlation.•For Iris:Make a single pair plot of all the features and data.•For Heart Disease:Choose your own subset of 3-5 features forthe plot which highlight someinterestingpattern. You will need toexplore different subsets of features or their correlation, distributionetc, in order to choose a set of features.2Justify why you chose those features.3.[CM2]Question: Calculate and reportthe correlation coefficient forthe pair of features. To what extent are the features correlated? Do youfind any interesting or significant relationships?4.Calculate the mean, variance, skew, kurtosis for the datasets andexplain your observationabout the nature of data and the relationshipsbetween the features of the dataset.5Are there any notable outliers in the data that should be removed?Provide a short justificationfor your answer in plots and/or words.6.[CM4]For the Heart Disease dataset only:Group the features bytheir variable types andplot a histogramof the features to determinethe number of present and absent heart disease cases.7.Data Cleaning:deal with any missing values in the data (use anyof the methods discussed in class: dropping data, interpolating, replacingwith approximations,. . . ). You can also remove any noise from the data byapplying smoothing on some features.Report any changes you makeand justify them.You can make comparisons if any of these approacheshave an impact on classification performance using your validation set.2

Question 2: KNNClassify the data using a KNN classifier. You will tune the parameter of theKNN classifier using sklearn functions, plot the different validation accuraciesagainst the values of the parameter, select the best parameter to fit the modeland Report the resulting accuracy. Carry out the following activities and re-porting:Basic Model:The intent for the steps 1-4 is to confirm your numericalanswer, so follow the steps exactly.1. Divide the data into train, validation, and test sets (60%, 20%, 20%)Note:set the random seed for splitting, use randomstate=275 in thesci-kit learntraintestsplitfunction to get the same split every timeyou run the program.2. Train the model with the classifier’s default parameters. Use the train setand test the model on the test set. Store the accuracy of the model.3. Then, you should find the best parameters for the classifier, in this case,kfor KNN. To find the best parameter you should:(a) Pick a value of parameter. Test the following values for validation:k:{1, 5, 10, 15, 20, 25, 30, 35}(b) Fit the model using the train set.(c) Test the model with the validation set. Store the accuracy.(d)When you finish trying all the possible parameter,plot a fig-urethat shows the validation relationship between the accuracy andthe parameter. Report the best k in terms of classification accuracy.4.Now, using the best found parameters, fit the model using thetraining set and predict the target on the test set.Report the accuracy,AUC, f-score of your kNN classifier.Your Improved Model:Try to improve your classification results using anyof the performance metrics we have discussed by exploring different ways toimprove using your validation set.5.Normalization:Normalize the data using methods we discussed andexplain what you used and explain briefly what worked best.6.Weighted KNN:TheKNeighborsClassifierclass has an option forweightedKNN where points that are nearby to the query point are moreimportant for the classification than others. Try using different weightingschemes (default, manhatten, eculidean) to see the effect. You can alsodefine your own distance metric to try to improve performance further(testing on validation only of course).7.[CM7]After making these improvements compute your new classificationresults on the test set andreport the accuracy, AUC and f-score.