ETF5952 Quantitative Methods for Risk Analysis
Task
Question 1 (30 points: 5+5+5+5+5+5)
This question asks you to analyze the data, “Wage”, in the ISLR package. Load the ISLR package and show variables in the data set “Wage”. Before your analysis, check variables in the data set by either reading a pdf file for ISLR at Moodle (R section) or searching on-line information.
1. Create a new variable, log of wage, and store in the data set (or data frame) “Wage”. Report summary statistics of wage and log wage.
2. Compare mean and median of wage, and discuss why they are different or similar (less than 30 words).
Compare mean and median of log wage, and discuss why they are different or similar (less than 30 words).
3. Obtain and provide histogram of wage and log wage, separately.
4. Conduct nonparametric bootstrap to obtain the 95% confidence intervals for the mean of wage and log wage. Set the number of repetition for Bootstrap to be 5000 (B=5000). Compare lengths of the confidence intervals and discuss uncertainty of the estimators of mean of wage and log wage (less than 30 words).
5. Obtain a box plot of wagefor each education level (education). (not log wage here)
6. Estimation two linear regression model:
1st. regressing wage on age and education level (dummies).
2nd. regressing log(wage) on log(age) and education level (dummies).
Provide the estimation results.
(a) Interpret effects of age on wage at the first regression (less than 30 words).
(b) Interpret effects of post graduate education (Advanced Degree) on wage at the first regression (less than 30 words).
(c) Interpret effects of age on wage at the second regression (less than 30 words).
Question 2 (40 points=10+10+10+10)
In this question, you analyze a data set on the number of new infections (positive cases) in Australia. Use the data set, “aus covid.csv”, which includes four variables:
- time is a sequence from 1 to 77 in order.
- date is a variable of date.
- case is the number of new positive cases.
- lag.case is a lagged case with one day.
1. Obtain and present a time-series plot, in which x-axis is time and y-axis is case. Use a red line.
2. Estimate the following models:
- case = β0 + β1time + ,
- case = β0 + β1time + β2time2 + .
(a) For each regression, plot fitted values (a blue line) with a time-series plot (a red line).
(b) From the two plots, explain performance of the two models.
(c) Obtain and provide autocorrelation function of residuals from two regressions. Explain the result (less than 30 words).
3. Estimate autoregressive models
- case = β0 + αlag.case + β1time + ,
- case = β0 + αlag.case + β1time + β2time2 + .
(a) Report the estimation results and explain the autoregressive parameter, α (less than 30 words).
(b) Plot and present fitted values from each regression (blue) with a time-series plot (red), separately. Explain performance of the two models (less than 50 words).
(c) Obtain and provide autocorrelation function of residuals from two regressions. Explain the result (less than 50 words).
4. Use information criteria (AIC, AICc and BIC) to compare four models. Report the result and explain the best model according to information criteria (less than 30 words).
Question 3 (30 points=6+6+6+6+6)
For this question, use a data set, “spam.csv”, which was used in class. Here, the dependent variable is spam, which takes 1 for a spam email and 0 otherwise, and the remaining variables are potential regressors. For regression, consider the logit model (“binomial” option).
1. Apply forward stepwise procedure to select regressors (Hint: use function step). Report the number of variables selected by the procedure.
2. Obtain and report R2 for two logit models with all available regressors and the regressors selected by the forward stepwise procedure.
3. By using cross-validation, compare the full model and the model selected by the forward stepwise procedure. Report the result and explain which model is better.
4. By using information criteria (AIC, AICc and BIC), compare the full model and the model selected by the forward stepwise procedure. Report the result and explain which model is better.
5. Use lasso to estimate a regression model. Report the number of non-zero coefficients.