Central Limit Theorem, Negative Binomial distribution, and Sampling with Boston City Earnings datase
Part1) Central Limit Theorem
Initialize the city of Boston earnings dataset as shown below:
boston <- read.csv(
"http://people.bu.edu/kalathur/datasets/bostonCityEarnings.csv",
colClasses = c("character", "character", "character", "integer", "character"))
The data in the file contains the total earnings of the employees of city of Boston.
- Show the histogram of the employee earnings. Use breaks from 40000 to 400000 in steps of 20000 and show the corresponding tick labels on the x-axis. Compute the mean and standard deviation of this data. What do you infer from the shape of the histogram?
- Draw 5000 samples of this data of size 10, show the histogram of the sample means. Compute the mean of the sample means and the standard deviation of the sample means. Set the start seed for random numbers as the last 4 digits of your BU id.
- Draw 5000 samples of this data of size 40, show the histogram of the sample means. Compute the mean of the sample means and the standard deviation of the sample means. Set the start seed for random numbers as the last 4 digits of your BU id.
- Compare of means and standard deviations of the above three distributions.
Suppose the input data follows the negative binomial distribution with the parameters size = 3 and prob = 0.5. Set the start seed for random numbers as the last 4 digits of your BU id.
- Generate 5000 random values from this distribution. Show the barplot with the proportions of the distinct values of this distribution.
- With samples sizes of 10, 20, 30, and 40, draw 1000 samples from the data generated in a). Use sample() function with replace as FALSE. Show the histograms of the densities of the sample means. Use a 2 x 2 layout.
- Compare of means and standard deviations of the data from a) with the four sequences generated in b).
Create a subset of the dataset from Part1 with only the top 5 departments based on the number of employees working in that department. The top 5 departments should be computed using R code. Then, use %in% operator to create the required subset.
Use a sample size of 50 for each of the following.
Set the start seed for random numbers as the last 4 digits of your BU id.
- Show the sample drawn using simple random sampling without replacement. Show the frequencies for the selected departments. Show the percentages of these with respect to sample size.
- Show the sample drawn using systematic sampling. Show the frequencies for the selected departments. Show the percentages of these with respect to sample size.
- Calculate the inclusion probabilities using the Earnings Using these values, show the sample drawn using systematic sampling with unequal probabilities. Show the frequencies for the selected departments. Show the percentages of these with respect to sample size.
- Order the data using the Department Draw a stratified sample using proportional sizes based on the Department variable. Show the frequencies for the selected departments. Show the percentages of these with respect to sample size.
- Compare the means of Earningsvariable for these four samples against the mean for the data.