This assignment assumes you have the understanding of R using the tidyverse.
Load the babynames tibble. Use dplyr to create a dataframe of all names that were not used prior to 2010. Use ggplot2 to produce a graph summarising how usage of this set of names changed over the period 2010-2017. This could be a boxplot graph summarising the distribution of the number of uses in each year. Which of those names were used most and least during this period?
You should submit a one-page document containing your R code, your graph, and your answers to the questions.
Find an interesting dataset to explore. It should have at least 5 variables, including at least 4 numerical variables. It should have at least 150 cases. (Therefore, a tidy dataframe would have at least 5 columns and 150 rows). Preview a few datasets before you decide. You want to find one that is of reasonable complexity, is not hugely untidy, and is in a format you can easily import such as CSV. Most data sites allow you to filter your search by data format, topic etc. Here are some places you might look for a dataset that interests you.
-UK Government Data: https://data.gov.uk/data/search.
-World Bank Data: http://datacatalog.worldbank.org/.
-Data from the US FiveThirtyEight website https://github.com/fivethirtyeight/data
-Census datasets are generally good candidates.
Import your chosen dataset into a dataframe/tibble. Make sure it is tidy; use dplyr to wrangle it into a tidy dataframe if necessary. Your goal is to perform some exploratory data analysis on your chosen dataset. You should aim to identify the following.
- major trends or patterns in the data
- any anomalies or outliers in the data (if there are any)
You should make use of the R tidyverse (including dplyr and ggplot2) to extract relevant data, compute descriptive statistics, and visualise variable distributions and relationships between variables using appropriate graphical techniques.
Think about what your exploration of the data is telling you. Does it suggest hypotheses to you about what might be causing the observations? Does it highlight aspects of the data that are surprising or interesting?
- briefly describe the dataset: where was it sourced? how was it gathered, by whom, and for
- present your main results using visualisations, descriptive statistics, short data extracts, and accompanying explanatory text. (You should not include R code in your report; the code should be submitted in a separate file.)
- include commentary, highlighting interesting findings and any suggested hypotheses.
- include a short concluding paragraph.
1. Your report
2. The original data file
3. Your R code. This code should reproduce the analyses in your report, including visualisations. Your code should be commented, and these comments should refer to the relevant Section numbers or Figure numbers in your report.
*Non-obvious: no-one else is likely to choose it
*You have someunderstandingof its contents
*Mixed data types
*Show off your R/tidyverseskills.
*Advanced statistical modelling not expected.
*Demonstrate thinking behind your exploration, not plots of arbitrarily selected variables.