Data Analysis: Techniques to Summarize & Visualize Business Data

Data Analysis Assignment: Techniques and Skills to Summarize, Visualize, and Interpret Business Data

Learning Outcomes

Knowledge and Understanding tested in this assignment:

• Recommend which visualisation and summary techniques to apply to data in a business context.

• Determine, critically evaluate and advise on which aspects of statistical methodology apply to problems occurring in a business context.

Skills and Attributes tested in this assignment:

• Design methods to collect/gather data in a way that is suitable for the associated business problems to be analysed.

• Create summaries of data appropriate to the business problem that exists.

• Communicate the nuances and ambiguities of statistical analyses they [the students] have undertaken and conclusions drawn from these.

• Evaluate output produced by software packages used to implement statistical techniques.

1.1 Initial Data Analysis

• Summarise and explore the data set that you have been provided with. You may like to use the following, suggested, approaches, plus any others that you believe to be relevant and useful.

• For each continuous variable you should calculate appropriate summary statistics and construct a minimum of one chart (perhaps obtained using the command “hist”) in order to investigate the location, distribution and spread of the data. If you can think of a good reason for having more than one chart per variable, you should act accordingly. You should discuss anything that you think is relevant.

• For each categorical variable you will need a minimum of one table, perhaps obtained using the command “table”. If you can think of a good reason for having more than one table per variable, you should act accordingly. Again, you should discuss anything that you think is relevant.

• For each variable, use the summary statistics, graphs and tables that you have generated to look for unusual observations, possible data entry errors and any outliers, missing values or other inconsistencies and draw appropriate conclusions.

• If you wish to remove any observations or to transform any variables, you should do this before conducting further analyses.

1.2 Modifying the Summary Variable

• You may have noticed that the Summary variable has a large number of over-lapping categories. Reducing the number of categories will help to simplify the analyses in Part 2. Suggest a new set of categories, based on the existing categories, that will reduce the number of categories to a more manageable number and add a new column to your data set with the appropriate (new) category for each observation. Remember to give your new column/variable a meaningful name.

• You should not remove any of the observations in order to reduce the number of categories!

• You will need to repeat your analysis of the Summary variable using the new categories.

• You should use your new Summary variable (with the new categories) for the remainder of the assignment. Do not use the original Summary variable again.

1.3 Pairwise Relationships with the Variables of Interest

• In anticipation of Part 2 of this assignment, you should look to see how the variables in the data set are related. You should investigate relationships between variables in pairs. In particular, you should investigate which variables are related to Actual Temperature and which variables are related to Apparent Temperature.

• For two continuous variables you may like to use some scatterplots obtained by using a command like plot(datasetname$variablename1, datasetname$variablename2)

• For two categorical variables you may like to construct a crosstabulation populated with frequencies.

• For one continuous and one categorical variable you may like to look at (e.g.) the mean of the continuous variable for different levels of the categorical variable by using a command like tapply(datasetname$continuousvariablename, datasetname$categoricalname, mean).

• Based on what you have done, suggest what you might find when doing the correlation and regression analyses for Part 2 of the assignment. It does not matter if you are ultimately right or wrong but it is good practice to go into an analysis with some idea of what you expect to find.

1.4 Hypothesis Testing

• Suggest two possible pairs of hypotheses that you can test using the data set provided, or an appropriate subset of the data.

• One pair of hypotheses (null and alternative) should be based on a single sample or the differences between two related samples. The other pair of hypotheses (null and alternative) should be based on two independent samples.

• You can base your hypotheses on any of the variables provided. Your hypotheses may relate to all of the observations in the data set or to a subset, or subsets, of observations.

• In addition to stating the hypotheses, you should explain in words what each pair of hypotheses is designed to test and why that is of interest.

• For each pair of hypotheses you should determine whether you should use a parametric or non parametric test and justify your decision before carrying out the appropriate test. You may need to tweak the wording of the hypotheses to reflect your choice of test, depending on whether you are testing means or medians.

• Carry out the appropriate parametric or non-parametric test for each pair of hypotheses and state your conclusions clearly.

If you made any changes to the data set in Part 1 (e.g. by modifying the Summary variable, by removing observations or transforming variables) you must use the amended data set for Part 2. Do not make any further changes to the data set.

2.1 Correlations

• Before applying regression analysis, you should use correlation techniques to see how the continuous variables in the data set are related. You should investigate relationships between continuous variables in pairs. In particular, you should investigate which continuous variables are related to Actual Temperature and which continuous variables are related to Apparent Temperature.

• To calculate correlation coefficients you should use a command like cor.test(datasetname$variablename1, datasetname$variablename2)

• Which, if any, of the correlation coefficients that you have calculated are statistically significant? What does “statistical significance” mean in this context?

2.2 Regression Models for Variables of Interest

• Use the data set provided to create a regression model with Actual Temperature as the dependent variable and using all of the other variables that you judge to be relevant as independent variables.

• Use the data set provided to create a regression model with Apparent Temperature as the dependent variable and using all of the other variables that you judge to be relevant as independent variables.

Get instant help from 5000+ experts for