Dataset and Variables
1. dataset. (20 points)
The dataset contains a collection of recent real estate listings in San Luis Obispo county- and around it. The dataset is provided in tealKstate.
You may use “one-hot keying to expand the categorical.
The dataset contains
Yoli and package for this question.
• Price: the most recent listing price of the hose (in dollars).
• Bedrooms: number of bedrooms.
• Bathrooms: number of bathrooms.
• size of the house in square feet.
(a) Fit the Ridge regression model to predict P rice from all variable. You cab keying to expand the categorical variable Status.
the regularizer opt imal parameter. arrd show the CV eur ve. H eport t he fitted model (i.e., the parameters). arrd the sum-of-square residuals.
(b (lO points) Use limo to select variables. Use fold cross i&idation to select the regularizer optimal parameter, arid show' the €N curre. £teport the fitted model (i.e the parameters selected and their coefficient). Shon' the Lasso solution path. YOU had any package for this. The suggested search range for the regularizat ion parameter.
Consider the following datamt, plotting in the followinig figure. The first two coordinates represent the value of two festureo, and the last coordinate is the binary label of the data
In this problem, you uiI1 run through Z'-3 iterations of AdaBomt with decision stumps (an explained in the lecture) as weak hrs.
(a) (15 pointa) draw t& decision
(b) (10 points} What is the training error of this AdaBoost? Give a short explanation for wby AdaBoest outperfsorms 'aingle decision stump.
Table 1: Vaium of AdaBo+mt parameters at each timmtep.
3. Ram:1Om foot andnne-clam iSVM for email spazn clamifier (2 points}
Please down load the data from that website. The collection of now-spam emails came from iilerl work and personal emails. and hence the nard ’george' and the area code '650‘ are indicators nf non-spam. Thee are inefiil when constructing a personalized spam :b1ter. You are free to choose and' package for t hi.s homeowk. Note: there may be some missing values. You can just fill in.
(a) (5 points) Build a PART model and the classilicatinn tree.
(b) (1f1 points) Non' aiso build a random ferent model. Randomly shuttle the data and partition to for training and the remaining 2O for
classiification tree and random fastest the Itumber of try for the random forest. and plot the test error for the €jART model (which should be a stant wit h respect to the number of trees}.
(c) lf 1 points Now zu will use a one approach for spam filtering. Randomly shufhe the data and partition to use 80 for training and the remaining for testing.Extract all non-spare emails from the training block (80To nf data you hash selected) to build the Lernel.
Then apply it on the for testing (thus this is a novelty detection nituatinn), report the total misclassificatinn enor rate on thee testing data.