Please read the following instructions carefully before start answering the questions:
•Please answer ALL TWO questions.
•Please write your code in the R code template or the R markdown template. After you have done all your questions, you should run your R code and save R outputs (in cosole or in R markdown) in the order they are produced by copy and pasting them into a separate Word file. If you are coding the in .R file template,you may use souce function on the top-right corner to run all lines in the file at once.
•If you are using Rmd template, do not miss out console outputs when your cell generates both console output and figure output.
•Plots and Diagrams should be saved in PNG format and uploaded as separate files. Make sure you name your saved figures using the question number!!
•Upload your file (or files) to FASER.
•There are 100 marks in total
The “Default” data set, from library(ISLR), is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. Default contains10000 observations on the following variables.
•default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
•student: A factor with levels No and Yes indicating whether the customer is a student
•balance: The average balance that the customer has remaining on their credit card after making their monthly payment
•income: Income of customer.
l
Using R to work with following questions:
1) Show the first 10 entries in the default dataset.[5 marks]
2).Load boot libarary to allow cross-validation. Conduct a cross-validation on the data set by doing the following steps:
(i) Using your last three digits of your registration number as the random seed, split the data set into training and test set. Your training set should contain 7000 randomly selected observations from the ‘default‘ data set.
(ii) Taking ’default’ as the response variable and all other columns as features, fit a logistic model on the training set.
(iii) Define a cost function to compute the classification error rate for the cross-validation pipeline. Use 0.3 as the prediction threshold, i.e. classify as ’default’ if the predicted probability is > 0.3.
(iv) Compute the coss-validation error of your model on the training set, with threshold being 0.3.
(v) Define another cost function that uses 0.5 as the threshold instead. Compute the cv-error using the new cost function.
3) Using the fitted model in 2), make predictions using 0.3 and 0.5 as threshold respectively on the test set and compute the test classification error rate. Note that the default columns takes values from No and Yes.
4) Using the predictions in 3) and table() function, create the confusion matrices on the test set. You should create two matrices, one corresponds to 0.3 as threshold and another using 0.5 as threshold.
You are given the following information Let X denote the claim from a car insurance and assume that X follows a Normal distribution N(θ,σ2) and θ follows a Normal prior distribution N(μ,5). Assume that we observe x1,x2,···,x12 claims, and ?x = 22.5 is
the average value of these 12 claims. Then the posterior distribution of θ is a Normal distribution N(θ1,σ21) where
θ1 = 60 ?x + σ2μ
60 + σ2 , σ21 = 5σ2
60 + σ2
Given σ2 = 30, μ = 3. Use R to work with following questions:
1) Set random seed as the last three digit numbers of your registration number and generate 50 random samples of θ from posterior distribution N(θ1, σ21) and calculate the sample mean and sample variance.
2) Plot the histogram of θ that generated from 1) with option freq=FALSE and add the line of the posterior density of θ to the histogram.
3) Repeat 1)-2) with sample size 5000. Comments the results in 1)-3).
4) Use Monte Carlo integration method with sample size 10000 and set random seed as the last three digit numbers of your registration number to calculate the estimate of the expectation of g(θ) =exp(−0.1θ2 + 25), where θ follows the posterior distribution N(θ1,σ21). Use same 10000 samples of theta, calculate the estimate of g(E(θ)).