For this problem set you will need to load the candidate.RData workspace. You will be using the data frame called c and later t. Imagine that you are a researcher with that theory that predicts a relationship between the attractiveness of political candidates in the United States and their vote percentage margin of victory. You have gathered information on a random sample of 100 candidates from survey responses and other data sources to analyze. You will complete the following problem set using this data you’ve collected. You have gathered the following variables:
a) attractive e Primary independent variable. Poll question that asks respondents how attractive a candidate is on a 77point scale from “Very Unattractive” to “Very Attractive”. Designed to measure the perceived attractiveness of the candidate.
b) coverage e Percentage of election media coverage focused on candidate.
c) million n Dummy variable indicating whether or not the candidate is a millionaire.
d) pmov v Dependent variable. Percentage margin of victory of the candidate in the election.
e) age e Age of candidate at the time of election.
f) friend d Survey questions that asks about each candidate: On a 7-point scale from “Very Un likely” to “Very Likely”, how likely do you think you’d get along with candidate x?
g) ideology y Location of the candidate on a one-dimensional scale of political ideology (moving negative is more liberal, moving positive is more conservative)
h) Regional variables:
• south h Variable indicating the candidate is from the southern United States
• north h Variable indicating the candidate is from the northern United States
• midwest t Variable indicating the candidate is from the mid-western United States
• west t Variable indicating the candidate is from the western United States
Your objective is to create a systematic walk-through of your data, your results, and an evaluation of your confifidence in your results using the following problems as your guide.
1. For all models in this problem set, report the constants, coeffificients, standard errors, and p values, the N and the R values in one table in your writeup. Label them, Model 1, Model etc, as per the example table provided below. When you are fifinished, you should have models in you table. Use the format of Table 1 as a guide for creating this table in your word document. In the left hand column, your variable names should be included. Name the variables appropriately so a reader would know what it is, but DO NOT USE THE R CODE NAMES. In the columns to the right of the variable names, insert the different models you will make for this problem set. Each cell should contain the values of the coeffificients and their standard errors (in parentheses) that are included in that model. Since the fifirst model is bivariate, the only cells that will have numbers will be the primary independent variable (the coeffificient with SE’s in parentheses underneath), the constant (the coeffificient with SE’s in parentheses underneath), and the N and R
Fill the subsequent columns as the problem set instructs.
Load the data; it should show you two dataframes. Use c for now. View the data frame and the variables. Make sure you know what each variable is and what it is measuring.
Report the descriptive statistics for all the above variables. This includes the type and level of measurement; for categorical variables, provide a properly labeled frequency graph using the freq() command, and include the frequency, and percentage of each category in your write-up; for continuous, provide a histogram using hist() command and report the n, median, mean and standard deviation in your write-up. For the set of categorical variables that indicate regional differences, you will not need to make a graph, but you should consider them here as one category and report their number and frequency. Label each response as a-h.
Note: Use the options you used in Problem Set #2 to properly label every graph: this includes the main label, and the x label. Use whatever color you’d like, other than the default color.
Your primary interest is the relationship that attractiveness has on the percentage of mar gin of victory. Create a scatter-plot using the plot() command for these two variables. The syntax for that command is plot(x,y). Use the label options you used for your his tograms and frequency graphs to label the Main Graph label (main=""), the x-axis label (xlab=""), and use ylab="" to label the y-axis. Put your scatterplot in your word document and spend a few sentences discussing whether you can distinguish the direction of the relationship between the variables. Be sure to answer the following questions: Is a relation ship apparent? What factors might explain why the scatterplot looks the way it does? Are there any concerns with outliers or leveraging observations? Note: Look in the book to fifigure out how to identify outliers and leverage.
Create a binary regression model using attractiveness and the percent margin of victory and the lm() command. Report the β coeffificient for attractiveness and report the standard error and p-value for the coeffificient in a table, and label the results Model 1. In your word document, report the statistical and substantive signifificance. Explain the relationship as you might to a person who is not familiar with statistics, but in such a way that a statistician would recognize what you’ve done and would appreciate your work.
Write the R code necessary to fifind the R2 for the binary model you just made using the following equation. Report the value, and interpret what this particular R2 value means as you would to someone not familiar with statistics.
We may be missing important confounding variables. Perceived candidate attractiveness can be correlated with a variety of other variables that could affect the percent margin of victory. Therefore, we need to include the necessary variables. Run a model that includes all of the above variables, and report the estimated values in your table. In the table, label this model Model 2. Also in your writeup, interpret the results of every variable in the model, both statistically (with p-values), and substantively (with size and directions of coeffificients).
Explain every relationship as you might to a person who is not familiar with statistics, but in such a way that a statistician would recognize what you’ve done and would appreciate your work. You should use at least 2-3 sentences to properly explain the results of each variable. Label this 7a-h.
If you followed instructions on the last part, R refused to include one of your variables. In a few sentences, identify which one and explain why it got dropped.
There are two particular control variables that may not have a linear relationship with the margin of victory. Identify the most likely candidate and create a new variable that is that variable but squared. Run a new regression model including this variable, and include it as Model 3 in your table. Explain the relationship that this variable has with the dependent variable.
From number 3 above, you may have identifified former Rep. Henry Waxman as an outlier in our sample. Unfortunately, for reasons completely beyond his control, Mr. Waxman may have affected our data. Mr. Waxman is observation 87. Report his attractiveness (for your own research, Google his image) and his percent margin of victory scores. In a few sentences, explain what impact he might have on our results.
We may need to remove Mr. Waxman from our sample. We will use a variant of the call function. For example, to remove the fififth observation from a data frame named x, you would write out: x<-x[-5,]. In a few sentences, provide a meaningful justifification for removing Mr. Waxman.
Re-run the same code you used for Model 2, and include it in the table as Model 4. What impact did Mr.Waxman’s removal have on our primary variable of interest? Use examples and be specifific.
Our research is catching some attention. Using some recent grant money we received because of our above results, we have expanded our sample to include 1000 observations. Using the data frame named t, re-run the same variables we did in Model 2 and report the results in your table as Model 5. Reinterpret the results for your main independent variable both statistically (with p-values), and substantively (with size and directions of coeffificients).
Explain the relationship as you might to a person who is not familiar with statistics, but in such a way that a statistician would recognize what you’ve done and would appreciate your work. What are specifific impacts of increasing the sample size? Why do they occur? Answer both of these latter questions with at least 3-4 sentences. Label each variable interpretation as 13a-h.