Stats problems: CI - var - hypo test - regression - ANOVA

Statistics Problems on Confidence Interval, Variance, Hypothesis Testing, Regression, ANOVA, and Pre

Problem 1

Suppose µ is the average height of a college male. You measure the heights (in inches) of twenty college men, getting data x1, · · · , x20, with sample mean ¯x = 68.55 in. and sample variance s

2 = 16.24 in2

Suppose that the xi are drawn from a normal distribution with unknown mean µ and unknown variance σ

(a) Using the sample mean and variance, construct a 90% confidence interval for µ.

(b) Now suppose you are told that the height of a college male is normally distributed with standard deviation 3.27 in. Construct a 90% confidence interval for µ.

(c) In part (b), how many people in total would you need to measure to bring the width of the 90% confidence interval down to 1 inch?

(d) Consider again the case of unknown variance in (a). Based on this sample variance of 16.24 in2, how many people in total should you expect to need to measure to bring the width of the 90% confidence interval down to 1 inch? Is it guaranteed that this number will be sufficient? Explain your reasoning.

In a study on cholesterol levels a sample of 12 men was chosen. The plasma cholesterol levels (mmol/L) of the subjects were as follows:

6.0, 6.4, 7.0, 5.8, 6.0, 5.8, 5.9, 6.7, 6.1, 6.5, 6.3, 5.8

(a) Estimate the variance of the plasma cholesterol levels with a 99% confidence interval.

(b) What assumptions did you make about the sample in order to make your estimate?

During the 1999 and 2000 baseball seasons, there was much speculation that the unusually large number of home runs that were hit was due at least in part to a livelier ball. One way to test the “liveliness” of a baseball is to launch the ball at a vertical surface with a known velocity VL and measure the ratio of the outgoing velocity VO of the ball to VL. The ratio R = VO/VL is called the coefficient of restitution. Following are measurements of the coefficient of restitution for 40 randomly selected baseballs. The balls were thrown from a pitching machine at an oak surface.

0.6248 0.6520 0.6226 0.6230 0.6237 0.6368 0.6280 0.6131

0.6118 0.6220 0.6096 0.6223 0.6159 0.6151 0.6100 0.6297

0.6298 0.6121 0.6307 0.6435 0.6192 0.6548 0.6392 0.5978

0.6351 0.6128 0.6134 0.6275 0.6403 0.6310 0.6261 0.6521

0.6265 0.6262 0.6049 0.6214 0.6262 0.6170 0.6141 0.6314

(a) Is there evidence to support the assumption that the coefficient of restitution is normally distributed? Use α = 0.01.

Problem 2

(b) Does the data support the claim that the mean coefficient of restitution of baseballs exceeds 0.623? Use the relevant test statistic approach to support your response, assuming α = 0.01.

(d) Compute the power of the test if the true mean coefficient of restitution is as high as 0.63.

(e) What sample size would be required to detect a true mean coefficient of restitution as high as 0.63 if we wanted the power of the test to be at least 0.75?

The electric power consumed each month by a chemical plant is thought to be related to the average ambient temperature (x1), the number of days in the month (x2), the average product purity (x3), and the tons of product produced (x4). The past year’s historical data are available and are presented in the following table:

y x1 x2 x3 x4

240 25 24 91 100

236 31 21 90 95

270 45 24 88 110

274 60 25 87 88

301 65 25 91 94

316 72 26 94 99

300 80 25 91 97

296 84 24 86 96

267 75 24 88 110

276 60 25 91 100

288 50 25 90 105

261 38 23 89 98

(a) Fit a multiple linear regression model to these data. Estimate regression coefficientsand σ 2

(b) What are the standard errors of the regression coefficients.

(d) Use the t-test to assess the contribution of each regressor to the model. Using α = 0.05,

what conclusions can you draw?

(e) Find 99% confidence intervals on β1, β2, β3 and β4.

(f) Predict power consumption for a month in which x1 = 75?F, x2 = 24 days, x3 = 90%, and x4 = 98 tons. Find a 90% confidence interval on the mean of Y in this case.

(g) Find a 90% prediction interval on the power consumption for the case in part (f).

(h) Calculate R2 for this model. Interpret this quantity.

(i) Plot the residuals versus the fitted values. Interpret this plot.

(j) Construct a normal probability plot of the residuals, and perform Shapiro test on them. Then, comment on the normality assumption.

An experiment is conducted to determine the effect of C2F6 flow rate on the uniformity of the etch on a silicon wafer used in integrated circuit manufacturing. Three flow rates are used in the experiment, and the resulting uniformity (in percent) for six replicates is shown below.

Problem 3

C2F6 Flow

Observations

1 2 3 4 5 6

125 2.7 3.6 2.6 3 3.2 3.8

160 4.9 4.6 5 4.8 3.6 4.2

250 4.6 3.4 2.9 3.8 4.1 5.1

(a) Does C2F6 flow rate affect etch uniformity? Construct box plots to compare the factor levels and perform the analysis of variance. Use α = 0.05.

(b) Check the homogeneity of variances and normality of residuals. Do the residuals indicate any problems with the underlying assumptions?

(d) Suppose that the true mean for uniformity (in percent) on the indicated C2F6 flow rates (125, 160, 250) are 3.3, 4.4, 4.0, respectively. Moreover, assume σ 2 = 0.5. Then, how much is the probability of detecting the effect of C2F6 flow rate on etch uniformity in this experiment? Use α = 0.05.

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence.

They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.

Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release.

In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

We have a dataset of parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:

male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolee’s age (in years) when he or she was released from prison
state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.

time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).

max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).

multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.

violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

Load the dataset “parole” ( from “parole” sheet in the data file) into a data frame called parole, and investigate it using the str() and summary() functions.

(a) Which variables in this dataset should be considered as categorical variables (factors)? Change the type of those variables to factor. (2 points)

(b) Install the “caTools” package in R. To ensure consistent training/testing set splits, run the following 5 lines of code (do not include the line numbers at the beginning): set.seed(144) library(caTools)

split = sample.split(parole$violator, SplitRatio = 0.7)

train = subset(parole, split == TRUE)

test = subset(parole, split == FALSE)

Using glm (and remembering the parameter family=“binomial), train a logistic regression model on the training set. Your dependent variable is ”violator”, and you should

use all of the other variables as independent variables.

What variables are significant in this model? Consider α = 0.05. (3 points)

The following two properties might be useful to you when answering this question:

1) If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.

2) If we have a coefficient c for a variable, then that means the odds are multiplied by for a unit increase in the variable.

(d) Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. According to the model, what are the odds this individual is a violator? (3 points)

(e) In previous part, what is the probability this individual is a violator? (2 points)

(f) Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=”response”. What is the maximum predicted probability of a violation? (2 points)

(g) Use a threshold of 0.5 to evaluate the model’s predictions on the test set. What are the model’s sensitivity, specificity, accuracy and G-mean? (4 points)

(h) What is the accuracy of a simple model that predicts that every parolee is a nonviolator? (2 points)

(i) Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole. Should they weight more cost on false negative or false positive? How should they change the cutoff threshold in logistic regression to consider this concern? (2 points)

(j) Using the ROCR package, what is the AUC value for the model? Describe the meaning of AUC in this context. (2 points)

Get instant help from 5000+ experts for