## Dataset 1

Prepare a report in a document file (.doc or .docx) which includes all relevant tables and figures, using the following structure:

1.Section 1: Introduction

a.Give a brief introduction about the assignment and search related article and write a paragraph of summary which supports your assignment. You need to give the full citation of the article.

b.Dataset 1: Give a short description about this dataset. Is this primary or secondary data? What are types of variables involved? Explain briefly what are the possible cases used in this study.

c.Dataset 2: Explain how you collect the data and discuss its limitation (e.g. whether your sample is biased). Is this primary or secondary data? What is/are the type(s) of variable(s) involved? Give a description of cases you consider for this data set.

2.Section 2: Analysis of single variable in Dataset 1

a.To answer research question “Which type of public transport was most used by the

NSW people during 8th to 14th of August 2016?”, provide a suitable numerical summary and graphical display for the variables mode of Dataset 1. Give a detailed comment to answer the research question.

b.Now to answer research question “Are there more than 50% of public transport users in NSW use the particular mode of transport found in Part a?” setup an appropriate hypotheses, perform hypotheses test and answer the research question by writing the conclusion of the test.

3.Section 3: Analysis of two variables in Dataset 1

NSW Government need to decide on whether they have to build an underground Railway line from either Parramatta, Bankstown or Gosford to central. To prepare a recommendation for this;

a.Give a numerical summary and an appropriate graphical display for the variables location, by only considering those three stations; and the variable count by considering the data with trains only.

b.Perform a suitable hypothesis test at a 5% level of significance to test whether there is difference between mean counts of taps on and off.

c.Use the conclusion of the test in part b and the outputs in part a to write a recommendation to NSW government.

4.Section 4: Collect and analysis Dataset2

You are interested in finding whether there is a difference in preference between different gender in terms of their transport mode (Bus, Train, Ferry and Light Rail). by considering appropriate number of cases and variable, give a proper graphical display and use it to write a comments.

Section 5: Discussion & Conclusion

Write an executive summary by combining all your findings in the previous sections which must be a valuable recommendation for NSW Transport. Give a suggestion for further research

Dataset 1

The aim of this assignment is to test skills of collecting and analyzing data to answer a specific business problem. The assignment also seeks to present an opportunity to apply the theories learnt during the course such as finding numerical summaries, displaying with appropriate graphs and using statistical inferences to solve business problems, including constructing hypotheses, test them and interpreting the findings (Ryabko, Stognienko, & Shokin, 2004).

We are presented with data for NSW transport system in order to come up with decision based recommendations that aims at improving public transport system. The project presents a series of research questions which need to be answered based on the knowledge gained in the course of the study.

1. Dataset 1:

The first dataset (dataset 1) is a secondary data provided by NSW transport system. The data has a total of 1000 observations with six variables. The description of the variables is given below;

Table 1: Description of the variables

 Variable Description Values Variable Type mode Type of the public transport Bus, Train, Ferry and Light Rail Nominal Variable (qualitative) date Date of the tap on/off held Date/month/year Nominal Variable (qualitative) tap It is a tap on or off On and Off Nominal Variable (qualitative) loc Locations of stops. For bus postcodes and others name of the stations Postcodes and names of the stations Nominal Variable (qualitative) count Total number tap on or off on the certain location and the certain date Number Scale variable (quantitative)

The possible cases used in this study are 1000 cases (number of observations).

1. Dataset 2:

The second dataset (dataset 2) is a primary data provided that was collected by the researcher. A random sample of 50 individuals was selected and the persons interviewed in regard to their gender, age and the mode of transport they prefer to use most. The data has a total of 50 observations with three variables. Give a description of cases you consider for this data set.

For the dataset 2, a random sampling was employed to collect the data from individuals so as to understand the mode of transport they frequently use. This is a primary data since the data is collected directly from the subjects. The limitation of this data is the fact that only a small sample size of 50 cases was selected. The description of the variables is given below;

Table 2: Description of the variables

 Variable Description Values Variable Type Mode Type of the public transport Bus, Train, Ferry and Light Rail Nominal Variable (qualitative) Age Date of the tap on/off held Number Scale variable (quantitative) Gender Gender of the respondent Male and female Nominal Variable (qualitative)
1. Section 2: Analysis of single variable in Dataset 1

In this section, we attempt to answer the research questions posed. To answer the research questions, we use dataset 1.

1. Which type of public transport was most used by the NSW people during 8thto 14th of August 2016?

To answer this research question, we ran a frequency distribution test. Table 1 below gives the results.

Table 3: Frequency table for the mode of transport used

 Row Labels Count of mode Percent Bus 467 46.7% Ferry 25 2.5% Light-rail 24 2.4% Train 484 48.4% Grand Total 1000 100.0%

As can be seen, the top most used modes were use of bus and train. Train however came out as the most frequently used with 48.4% (n = 484) of the participants having used it in the last 1 week. The second most commonly used mode was the bus with 46.7% (n = 467) having used it in the last one week. Ferry and Light-rail were among the least used with only 2.4% (n = 24) having used light-rail in the last one week and 2.5% (n = 25) said to have used ferry in the last one week.

## Dataset 2

Figure 1: Bar chart on mode of transport used

1. Now to answer research question “whether the proportion of those using train is greater than 50%, the setup for an appropriate hypotheses is given below.

To answer the given research question, the following hypothesis was tested.

H0: The proportion of transport users who use train is not significantly different from 50%.

HA: The proportion of transport users who use train is significantly different from 50%.

To test this, a One-Sample t-test was used and it was tested at 5% level of significance. The results are given below;

Table 4: One-Sample Statistics

 N Mean Std. Deviation Std. Error Mean Train 1000 .4840 .49999 .01581

Table 5: One-Sample Test

 Test Value = 0.5 t df Sig. (2-tailed) Mean Difference 95% Confidence Interval of the Difference Lower Upper Train -1.012 999 .312 -.01600 -.0470 .0150

A one-sample t-test was run to determine whether the proportion of NSW transport users who rely on train as the mode of transport is more than 50%. The proportion of those who used train transport (0.484 ± 0.5) was not significantly different from 50% (95% CI, -0.05 to 0.02), t(999) = -1.012, p = .312.

1. Section 3: Analysis of two variables in Dataset 1

NSW Government need to decide on whether they have to build an underground Railway line from either Parramatta, Bankstown or Gosford to central. To prepare a recommendation for this;

1. Give a numerical summary and an appropriate graphical display for the variables location, by only considering those three stations; and the variablecount by considering the data with trains only.

In this section we first consider the number times the train left the three mentioned locations. This information is given in the table below;

Table 6: Frequency of train from the three locations

 Count Percent Parramatta Station 7 53.8% Gosford Station 2 15.4% Bankstown Station 4 30.8%

Figure 2: Bar chart for the count of times the train leaves the stations

Considering the data with trains only, it was established that the average number of counts was 103.38 with the standard deviation of the counts being 226.14

Table 7: Descriptive statistics for the variable count

 count Mean 103.379 Standard Error 7.151282 Median 53 Mode 18 Standard Deviation 226.1434 Sample Variance 51140.84 Kurtosis 238.9731 Skewness 13.04214 Range 4955 Minimum 18 Maximum 4973 Sum 103379 Count 1000

The mode of counts was found to be 18 with the median count being 53. The skewness value indicated that the data is highly and heavily skewed. This is evident from the fact that the minimum count was 18 while the maximum count was 4973. This presents a very huge range which suggests a probable presence of outliers in the dataset hence bringing about the skewness observed.

The histogram presented below further shows that the data is skewed. The shape of the histogram indicates that the data is skewed to the right (longer tail to the right).

Figure 3: Histogram of the variable count

1. Perform a suitable hypothesis test at a 5% level of significance to test whether there is difference between mean counts of taps on and off.

To answer this, the following the hypothesis was tested at 5% level of significance.

H0: There is no significant difference in the mean counts of taps on and taps off

HA: There is significant difference in the mean counts of taps on and taps off.

To test this, an independent samples t-test was used. The results are given below;

## Analysis of Single Variable in Dataset 1

Table 8: Group Statistics

 Tap N Mean Std. Deviation Std. Error Mean count On 481 106.65 269.081 12.269 Off 519 100.35 177.530 7.793

Table 9: Independent Samples Test

 Levene's Test for Equality of Variances t-test for Equality of Means F Sig. t df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper count Equal variances assumed .083 .774 .440 998 .660 6.296 14.319 -21.802 34.394 Equal variances not assumed .433 821.5 .665 6.296 14.535 -22.233 34.825

We performed an independent t-test was in order to compare the average number of counts for the taps on and the taps off. Results showed that the average number of counts for the taps on (M = 106.65, SD = 269.08, N = 481) did not significantly differ with the average number of counts for the taps off (M = 100.35, SD = 177.53, N = 519), t (998) = 0.440, p > .05, two-tailed. The mean difference of 6.30 observed was insignificant at 5% level of significance. Essentially the results indicate that whether the taps are on or off does not really affect the number of counts.

1. Use the conclusion of the test in part b and the outputs in part a to write a recommendation to NSW government.

We concluded that there is no significant difference in the average number of counts for the taps off and taps on. The chosen three stations also did not show much traffic. It is therefore recommended that the government’s plan to build an underground Railway line from either Parramatta, Bankstown or Gosford to central is not as ideal as would be required.

1. Section 4: Collect and analysis Dataset2

You are interested in finding whether there is a difference in preference between different gender in terms of their transport mode (Bus, Train, Ferry and Light Rail). By considering appropriate number of cases and variable, give a proper graphical display and use it to write a comments.

The results for this section are presented below;

 Count of Gender Column Labels Row Labels Female Male Grand Total Bus 16.7% 42.3% 30.0% Ferry 20.8% 7.7% 14.0% Light Rail 8.3% 11.5% 10.0% Train 54.2% 38.5% 46.0% Grand Total 100.00% 100.00% 100.00%

As can be seen, most of the male commuters (42.3%, n = 11) said to use bus while most of the female commuters (54.2%, n = 13) said to use train.

Chi-Square test

A Chi-square test was performed to determine whether there is significant association between gender and the preferred mode of transport (Bagdonavicius & Nikulin, 2011). The hypothesis tested is given below;

H0: There is no significant association between gender and preferred mode of transport

HA: There is significant association between gender and preferred mode of transport

This was tested at 5% level of significance and the results are given below;

Table 10: Chi-Square Tests

 Value df Asymp. Sig. (2-sided) Pearson Chi-Square 5.072a 3 .167 Likelihood Ratio 5.239 3 .155 N of Valid Cases 50 a. 4 cells (50.0%) have expected count less than 5. The minimum expected count is 2.40.

The p-value for the test is 0.167 (a value greater than 5% level of significance), we therefore fail to reject the null hypothesis and conclude that there is no evidence that there is significant association between gender and preferred mode of transport.

Section 5: Discussion & Conclusion

The main purpose of this study was to present analysis of NSW transport system.  We were provided with a secondary dataset (dataset 1) that comprised of 1000 cases with six variables. Apart from the provided secondary data on NSW transport system, we also gathered survey on 50 individuals. We sought to fight out the most commonly used mode of transport among the individuals. Results showed that the most commonly used mode of transport was train followed by bus though people used ferry and light rails, their usage was very minimal as compared to the use of bus and train.  In regard to the comparison of the mode of transport in terms of the males and the females using dataset 2, we noted that majority of female respondents  preferred to use the train while most of the male commuters preferred using bus as the mode of transport.  In regard to the findings we would like to make the following recommendations to NSW government;

• The use of train is very common among the many commuters; it would therefore prudent to improve on this particular mode of transport to make more and more effective. The building of an underground Railway line from either Parramatta, Bankstown or Gosford to central would indeed be a blessing to the commuters.

Future research should be broad enough to even understand the motivation behind the preference for the various mode of transports. This would help the management and the government to fully understand the needs and the desires of the people.

References

Bagdonavicius, V., & Nikulin, M. S. (2011). Chi-squared goodness-of-fit test for right censored data. The International Journal of Applied Mathematics and Statistics, 30–50.

Ryabko, B. Y., Stognienko, V. S., & Shokin, Y. I. (2004). A new test for randomness and its application to some cryptographic problems. Journal of Statistical Planning and Inference, 123, 365–376. doi:10.1016/s0378-3758(03)00149-6

