Steps for Data Analysis: Descriptive Stats - Correlation

Steps for Bivariate Data Analysis

Data Preparation

1. Inspect the data sets you have. Choose an appropriate year (or set of years to average over) that will give you many observations for both data sets. You will see my example data are cross-country population per square km and vehicles per 1000 people figures. The raw data for vehicle density is only available from 2002 to 2007. I took an average of the figures available across the years for each country. For comparable data, I took the same five-year average of population density for each country. (Alternatively I could have just used data for 2007 for both variables, but then some countries would not have had a vehicle density observation.)

2. Copy and paste the data you will use into a new sheet for each variable. It is wise to keep the country names alongside the data. In my example data I call these sheets Data 1 and Data 2.

3. Construct a bivariate (paired) data set - i.e. for each country you should have an observation for each variable. You may find that one of your variables has observations for many more countries than the other so you may have to pick out the observations from the larger set where there is a corresponding observation from the smaller set. In my example data I used the VLOOKUP function to automate the data matching.

4. In your report you should note any difficulties with the data preparation and implications of dropping countries from the data sets if such was required.

An educated guess

1. You can use the "Descriptive Statistics" tool in the data analysis tool pack and also calculate quartiles, coefficients of variation etc.

2. Draw a histogram and boxplot of each data set.

3. You should discuss the important and interesting features of the data revealed by the above in your report. In my example data you will see the population density data is strongly positively skewed, so much so that the boxplot is almost meaningless. Two options I had was to drop two or more of the largest observations, or to transform the data. I chose the latter - by taking the log of the data I end up with a data set that is almost normal. The population density data is then what we call "lognormal". It is a common feature of cross-country data like this. So you should be prepared to drop observations or transform data if necessary, and explain why you did this in your report.

4. Use the correlation coefficient and a scatterplot to see the strength and direction of the relationship (if any) between the two variables.

5. In your report, discuss the above and explain if you think the relationship might be causative, spurious, or driven by a third factor.

Construct confidence intervals (using the separate data).

1. Now assume that the data for each variable is a random sample and construct a confidence interval for the population mean of each variable. Since you don't know the population standard deviations you should use critical values from the Student t-distribution.

2. State your confidence interval in your report, explaining what it means (to a layperson) and also discuss if you have any doubts about the validity of the interval.

Get instant help from 5000+ experts for