Statistical Analysis of Total Annual Income of Households

Purpose of Assignment

a)To appreciate the usefulness and applications of major statistical concepts in data analysis and data inference.

b)To be able to apply the concepts learned in class to a real-world large data set. Working with real data highlights the true nature of Statistics, and allows students to draw useful and meaningful conclusions. Students will discover that insights gained from data leads to good decision-making in a specific context.

c)To enhance students’ ability to understand the what, whether and why of the context of the statistical calculations so that they can better communicate statistical ideas and strategies.

d)To recognize that many tedious statistical calculations, summaries, graphs, and charts can be implemented by appropriate software such as Microsoft Excel.

The following required tasks must be carried out at the minimum to satisfy the project’s requirements.

a)From the unzipped folder Project_Yourname, open the workbook named INCOMES.xlsx.

b)Click on the Documentation Sheet and complete the required documentation.

c)As you do your data analyses in this file, organize all your tasks in corresponding worksheets with appropriate names so that you can easily refer to them for explanations and insights when you write your final report.

d)Save your workbook often, and make sure to save it in the folder Project_Yourname.

Background: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. As you have seen, most statistical tests rest upon the assumption of normality. Deviations from normality, called non-normality, render those statistical tests inaccurate, so it is important to know if your data are normal or non-normal. This knowledge is also important because if the population has a normal distribution, then the sample means will automatically have a normal distribution.

Do the following:

a)Suppose the focus of our data analysis is Total Annual Income of the households. So, we must check if population total annual income data is normally distributed or not.

b)Use any TWO Excel methods of your choice to determine the normality of the Total Annual Income population data. Both of the methods must lead to the same conclusion. Note that you must take into consideration all of the given total income data (50,000+) in the population since you are checking the normality of the population (and not the sample).

c)Report your findings with clear justifications and support in the written report.

Background: Even though our main focus of statistical study is the population, we normally don’t have enough resources to collect data about it. Sometimes, even if we want to, there may not be enough time or the individuals in the population may not be easily accessible.

Task 1 Organize the Excel File

Therefore, we study a part of the population called a random sample and draw conclusions about the whole population using the sample.

There are two kinds of sampling techniques: “with replacement” and “without replacement”. In “with replacement”, the selected subject is thrown back into the hat so to speak and, therefore, may be picked up again in the next round as part of the sample. So, in “with replacement” there may be duplicate subjects in the sample.

Because of the possibility of duplicates in the sample with replacement, many people prefer sampling without replacement. To find a sample without duplicates, you will use Excel’s RAND() function that returns an evenly distributed random real number greater than or equal to 0 and less than 1. A new random real number is returned every time the worksheet is calculated.

Do the following:

a)Using Excel’s RAND( ) function, generate one good random sample of size n = 100.

b)Next, analyze the data of this random sample using the Five-Number Summary table. Focus on the salient points noticed in the table.

c)Investigate if this sample data is normally distributed? From the normality aspect, how does the sample data and its corresponding population data compare? Do both exhibit normality?

d)Report your findings with clear explanations and support in the written report.

Background: We cannot deny the fact the mean is the most common measure of central tendency since it can suggest a typical or central value and serves as a “balance point” is a set of data. Therefore, it is small wonder that mean figures prominently in many inferential statistical methods of analyses. Specifically, we will study and prove the following specifics in this section.

a.The sample mean is unbiased because the mean of all the possible sample means of a given sample size n, is equal to the population mean.

(Note that it says “all the possible sample means of a given sample size n”; if there are a million possible samples, then we must compute the sample means of these billion samples to prove unbiasedness!).

b.As the sample size n gets large enough, the sampling distribution of the mean is approximately normally distributed. This is true regardless of the shape of the distribution of the individual values in the population. This powerful and useful knowledge is known as the Central Limit Theorem.

Do the following:

a)It is required to take all possible random samples of size n = 100 total income data from the given population. Theoretically, how many total combinations of possible random samples of size n = 100 can be generated out of a population of N = 55899 pieces of data? (HINT: Use the Combination formula that represents how many different outcomes of size “r” items can be generated out of the given total “n” items. If you don’t know what that is, Google and find out. Even if you find and use the formula, it may not be able to help you since the result is going to be so huge that an ordinary calculator may not be able to handle it. Try it and report accordingly).

Task 2 Testing Population Normality

b)It is very easy to guess that there are going to be a very large number of possible different random sample combinations of size n = 100, and we cannot practically generate those samples and study them. Therefore, in this study, you are asked to generate only 10 different random samples of size n = 100. Hence, your findings will be not exact, but approximate.

c)Study the 10 generated random samples of size n = 100 carefully. Do they approximately prove the unbiasedness property of the mean?

d)Do they approximately prove the Central Limit Theorem?

e)Report your findings with clear explanations and support in the written report.

Background: Mean is the super star in many business data analyses scenario. The sample mean is often used to estimate the population mean for economic reasons. However, calculating just a single sample mean (called a point estimate) and using it to make a generalization about the population mean is not going to cut it. It is too risky. Therefore, statisticians use a confidence interval estimate that is a range of numbers, called an interval, constructed around a point estimate. The confidence interval is constructed such that the probability (or confidence) that the interval includes the population mean is set in advance (such as 95%) to inject some confidence into the result.

Do the following:

a)Select a random sample of size n = 100 of the total income data.

b)Construct a 95% confidence interval estimate for the population total income. (Assume population σ is unknown in this problem and future problems).

c)Investigate if the actual population mean (that you obtained in one of the tasks above) does indeed fall within your confidence interval (and not outside of it).

d)Interpret correctly the meaning of your Confidence Interval for the Population Mean with support in the report.

Background: The concept of a confidence interval also applies to categorical data (such as gender: male or female). With categorical data, we estimate the proportion of items in a population (e.g., proportion of females) having a certain characteristics of interest (e.g., annual total income).

Do the following:

a)Select a random sample of size n = 100 that includes the data from the SEX and EDUCATION fields.

b)Construct a 95% confidence interval estimate for the population proportion of females who have Bachelor’s Degrees or beyond (i.e., Education = 5 or 6). (Assume population σ is unknown in this problem and future problems).

Task 3 Random Sampling

c)Find the actual population proportion of these females using Excel. (HINT: Use the Excel formula =COUNTIF (range, criteria) to find the number of total females in the population of 55899 with criteria = “2”, and then use that number to find the population female proportion).

d)Investigate if the actual population proportion (that you obtained in part(c)) does indeed fall within your confidence interval (and not outside of it).

e)Interpret correctly the meaning of your Confidence Interval for the Population Proportion of females with support in the report.

Background: In confidence interval estimation, you are not making any assertions or making a claim about the population mean in advance. You simple estimate an interval for the population parameter with a given degree of confidence. However, in the business world, people suggest theories, make claims and assertions about population parameters as if there were clairvoyants or experts. This is where hypothesis testing comes in to test these claims. You set a level of significance first, and then test if there is enough evidence to support the claim.

Do the following:

a)Select a random sample of size n = 100 of the total income data.

b)Test the following hypotheses for the population mean of the total earning data:

H0: μ ≥ 40,000 (Mean total income is greater than or equal to $ 40,000)

H1: μ < 40,000 (Mean total income is less than $ 40,000)

Using level of significance = 0.05, verify if there is enough evidence to support the claim. (Assume population σ is unknown in this problem and future problems).

c) Using hypothesis testing always involves the risk of reaching an incorrect conclusions in the form of Type I error or Type II error. Explore the meaning of these errors in the context of the above test.

d)Report your findings with clear explanations and support in the written report.

Background: In this section, you extend hypothesis testing to two-sample tests that compare two sample statistics from two random samples selected from two populations. These two samples can come from entirely two different mother populations (i.e., a population of LED light bulbs and a population of Standard light bulbs), or the two samples can come from one and the same mother population data (i.e., SAT scores for all students) but for two different characteristics of the data (for example, one subset of SAT data for the males, and the second subset of SAT data for the females). In this case, these two samples will be defined as independent samples.

Task 4 Central Limit Theorem

However, if the two samples come from one and the same mother population data but the time spans for sample data collections are different (i.e., a group may be measured twice such as in a pretest-posttest situation [scores on a test before and after the lesson]), then the two samples will be defined as dependent because the data in the sample are related to each other in the post-pre matchups.

In addition to normality, this information of dependent or independent samples is also very important in doing the hypothesis testing.

Do the following:

a)First, you will be collecting two random samples of size n = 100 for total incomes: one for the males and one for the females from the same given population data of size = 55899. Follow these steps:

1.First, in Excel, sort population data first by gender = male, and then by annual total income. If you don’t know how to do that, refer to the following link:

a.Sort Two or More Columns

2.Copy all the sorted total income data for the males only onto a new worksheet. Note that this population data is sorted, and not random!

3.Using the RAND () function, generate a random sample of 100 data for the males’ total annual income.

4.Using the steps 1-3, generate a random sample of 100 female annual total income data.

5.Now, you have two set of random samples that will help you study the difference in mean total annual incomes of the two genders.

b)Using the data from the above two generated random samples, and assuming that the population variances (or standard deviations) for the two populations of males and females total earning data are equal, conduct a suitable t-test to determine if there is evidence that male’s mean total annual income is greater than females’ total annual income? Use the road map provided in one of the handouts in the past to select the correct t-test formula. Here is the hypothesis test that you are performing:

H0: μ1 ≥ μ2 (Male’s mean total income is greater than or equal to female’s)

H1: μ1 < μ2 (Male’s mean total income is less than female’s)

Or, equivalently:

H0: μ1 - μ2 ≥ 0 (Difference of two mean total income is greater than or equal to zero)

H1: μ1 - μ2 < 0 (Difference of two mean total income in less than zero)

c)Determine the p-value in (b) and how will you interpret its meaning in the context of the problem?

d)In addition to equal variances, list all other assumptions that you have used in Part (a).

e)Report your findings with clear explanations and support in the written report.

Background: Simple Linear Regression (SLR) is a basic and commonly used type of predictive analysis. Its purpose is to measure to what extent there is a linear relationship between two variables X and Y. If there is a linear relationship, what is the direction of the relationship: Positive or negative? What is the strength of the relationship: None, weak, moderate, strong, or very strong?

As an added bonus, once a viable linear regression model (or the least squares line) is generated, it can be used to "predict" the value of the dependent variable Y based upon the values of the independent variable X. Hence SLR is also known as predictive analysis.

In this section, we are interested to investigate if there is a good linear relationship between Annual Total Income (Y) and the Education level (X).

Do the following:

a)Using Excel’s RAND ( ) function, generate a random sample of size n = 100, and the sample must have two fields: Annual Total Income and its associated Education level. It is very important that the data for the two variables should be preferably placed in the adjacent columns, with the column with the Y variable (dependent/response) being in the column to the right of the column associated with the X variable (independent/explanatory).

b)Draw the relevant scatter plot.

c)Use the least-square method to compute the regression coefficients b0 and b1.

d)Interpret the meanings of b0 and b1.

e)Predict the mean total annual income for education level = 4.

f)Should you use the model to predict the mean annual total income for an individual whose education level = 10? Why or why not?

g)Determine the correlation coefficient “r” (i.e., square root of R2), and explain its meaning in this problem.

h)Perform a residual analysis. Is there any evidence of a pattern in the residuals?

i)At the 0.05 level of significance, is there a linear relationship between annual total income and the education level? Perform a suitable hypothesis test.

j)Construct a 95% confidence interval estimate for the population slope β1 and interpret the result.

k)Write a nice report of all your findings above with good explanations and support.

Get instant help from 5000+ experts for