1. Consider the “FoodExpenditure” data built into R. It is available within the betareg package and can be brought into your workspace using the command:
After applying this command, you should be able to access the data under the name “Food- Expenditure”. (The data have also been posted as a text file.) The data set contains the variables “food” (household expenditures on food) and “income” (household income), among others. Consider a model of the proportion food / income, using income and “persons” (number of people in the household) as predictors.
a. Present an appropriate normal linear regression model to answer this research question.
b. Discuss all of the assumptions of the model, including tests of the assumptions where appropriate. In particular, describe the meaning of the “normality” and “constant variance” assumptions in the context of this data
c. Propose an appropriate model based on the apparent distribution of the response, food / income.
d. Explain why these data should not be modeled according to a logistic “events / trials” format.
e. Using your appropriate model, provide interpretations of model coefficients, and answer the general question of whether household income or household size impact the rate of food.
2. For this exercise please use the “Ship Accidents” data, in which records have been kept about the number of health-related accidents that occur on naval ships. The data set includes the following variables: the number of accidents reported over one-month periods; an indicator of whether the ship had been operational for more than 10 years; three indicators for different “eras” of construction, with “construction1” indicating the oldest and “construction3” indicating the newest; a continuous score of exposure to aggressive contact; and the number of months the ship had been on duty prior to data collection.
a. Provide at least three descriptive statistics to investigate the distribution of the outcome of interest, number of accidents per month.
b. Based on your exploratory analyses of part i, explain why a normal linear regression model would not be appropriate.
c. Fit a traditional Poisson Count Regression model to these data. Discuss the “fit” of the model and also the assumptions associated with the Poisson regression model.
d. Discuss overdispersion with respect to this model. Evaluate whether overdispersion is a problem for your model, and explain any necessary model changes you would use.
e. Fit a Negative Binomial Count Regression model to these data. Discuss the “fit” of the model.
f. Using either your model from part iii or your model from part v, provide interpretations of estimated coefficients, and answer the general question of whether any of the predictors significantly impact expected health-related accidents per month.
3. Consider the “bioChemists” data built into R. It is available within the pscl package and can be brought into your workspace using the command:
After applying this command, you should be able to access the data under the name “bio- Chemists”. (The data have also been posted as a text file.) The data set contains the variables “art” (number of articles published by a graduate student in the last 3 years), “fem” (an indicator of female gender), “mar” (an indicator of married status), and “phd” (a measure of the quality of the PhD program), among others. Assume an interest in predicting “art” using the other listed variables.
a. Describe the type of data being analyzed as the response. Provide descriptive statistics to analyze the outcome on its own. How would you describe the distribution of “art”?
b. Propose an appropriate model for the response of interest.
c. Fit your proposed model from part Provide interpretations of the estimated coefficients, and answer the general questions of whether any of the predictors impact the likelihood of publishing and also the expected publication volume for those who do publish.
4. Please respond to the following.
a. How can the interpretations of coefficients from a Beta Regression model differ from those of a Logistic Regression model? When would they be similar?
b. Describe in general the data situations in which you would apply a count regression model instead of a traditional normal regression model.
c. Explain what overdispersion How does it affect a count regression model? How can it be detected? What do you do if you find meaningful overdispersion?
d. Describe how the modeling philosophy and interpretations differ for “hurdle” count models as opposed to traditional Poisson count model.