Task:
All analyses must be performed in R using the tidyverse and glmnet packages discussed in class. Fill in all your solutions in the appropriate spaces provided in this Word document, and then upload a PDF copy of your solutions to Canvas. Only PDF copies will be graded.
Brief overview of assignment
In this assignment you will be using the dataset GlobalAncestry.csv, which is available on Canvas. You will be analyzing genetic data from 242 humans sampled across the world from six ancestries. The first column in each dataset, labeled ancestry, takes the following values:
African San and Yoruban individuals from sub-Saharan Africa
European Italian and Russian individuals from Europe
EastAsian Chinese and Japanese individuals from East Asia
Oceanian Melanesian and Papuan individuals from Oceania
NativeAmerican Pima and Mayan individuals from the Americas
Mexican Mexican individuals from the Americas
Unknown1 Unknown ancestry
Unknown2 Unknown ancestry
Unknown3 Unknown ancestry
Unknown4 Unknown ancestry
Unknown5 Unknown ancestry
The GlobalAncestry.csv is a large dataset with genetic data for individuals 242 at 8916 genomic locations. As we discussed in our introductory lecture for this course, each individual will have a value of 0, 1, or 2 at each of these genomic locations, indicating “genotype” that the individual has at this location.
Training a lasso penalized multinomial regression classifier
The goal is to train a multinomial regression classifier to predict K=5 ancestries (African, European, EastAsian, Oceanian, and NativeAmerican). The training dataset will consist only of individuals with African, European, EastAsian, Oceanian, and NativeAmerican ancestries, and the best classifier will be determined by lasso-penalized multinomial regression and 10-fold cross validation. You will consider 100 tuning parameter values (λ), taking values between 0.001 and 1000 evenly on a base-10 logarithmic scale, as we have highlighted several times in class. You will then choose the classifier that is the simplest classifier that is within 1 standard error of the best classifier.
Predicting ancestry of individuals with unknown ancestry
You will then use this classifier to predict the ancestries of the five unknown individuals (Unknown1, Unknown2, Unknown3, Unknown4, and Unknown5) based on their genetics.
Predicting ancestry proportions of individuals with Mexican ancestry
You will also use predicted class probabilities to estimate the fraction of ancestry that each individual of Mexican descent has from each of the five continental ancestries used to train the classifier. You will then use violin plots to visualize the distributions of these probabilities across the set of individuals of Mexican ancestry, and hypothesize about the historical reasons for the ancestry distributions you observe.
Instructions for loading GlobalAncestry dataset into your RStudio Cloud environment
Recall that to upload a file to RStudio Cloud, you first must download the GlobalAncestry.csv file to your computer. Once the file is downloaded, within the “Files” panel of the RStudio Cloud environment, click “Upload” and browse to the appropriate directory on your computer to upload the GlobalAncestry.csv file.
The GlobalAncestry.csv file can be loaded using the read_csv() function of the readr package that comes loaded with tidyverse, and assigned to an object called GlobalAncestry as
GlobalAncestry <- read_csv("GlobalAncestry.csv")
If you are having trouble loading the file, then refer back to the video lecture on Linear Regression where this was demonstrated in class.
Note about using glmnet for classification
When using glmnet, you will not need to recode classes as values 1, 2, 3, etc. We only performed this recoding in class to illustrate the connection with using linear regression applied to a response with values 0 and 1, as linear regression requires a quantitative response. Therefore, do not recode the ancestry values in the dataset, and simply use the values as is.
Questions and problems
1. [10%] Load the GlobalAncestry.csv dataset, and split and store the dataset into three separate datasets: training dataset, test dataset of unknown ancestries, and test dataset of Mexican ancestry. That is, create the following three datasets:
1. Training data frame called train, which only includes observations with ancestry values African, European, EastAsian, Oceanian, and NativeAmerican.
2. Test data frame called test, which only includes observations with ancestry values Unknown1, Unknown2, Unknown3, Unknown4, and Unknown5.
3. Test data frame called testmex, which only includes observations with ancestry value Mexican.
Provide code below:
2. [20%] Apply glmnet to the training dataset train from Question 1, to train a multinomial regression classifier with a lasso penalty across 100 tuning parameter (λ) values, taking values between 0.001 and 1000 evenly on a base-10 logarithmic scale. The response will be ancestry, and the input features will be the values at the set of 8916 genomic locations. Train this lasso-penalized multinomial regression model across the 100 tuning parameter values, and plot the regression coefficients for each of the K=5 classes as a function of log(λ). Based on these results, does it appear that regularization and feature selection is working? Explain your answer.