Suppose you are working for an agency who analyse NSW transport system data to make a recommendation to improve public transport system. You will be given series of research questions. Use your knowledge that you gain from this course to answer these questions by displaying appropriate outputs of Excel, StatKey or Wolfram alpha. Use these answers to write an executive summary which might be a valuable recommendation to Transport NSW.
There are two datasets involved in this assignment: Dataset 1 and Dataset 2, detailed below.
Dataset 1: You will receive an email that contains a dataset that is specifically allocated to you. This dataset is a subset of a data Opal Tap on and Tap Off Location - 8th to 14th August 2016 individual sample file, provided by the Transport for NSW Open Data and has been edited to only include a subset of the cases and variables. The original dataset can be obtained from https://opendata.transport.nsw.gov.au/dataset/opal-tap-on-and-tap-off and it is under the license of Creative Commons Attribution 3.0 Australia. Data dictionary of the edited dataset is given in the following table.
Variable |
Description |
Values |
mode |
Type of the public transport |
Bus, Train, Ferry and Light Rail |
date |
Date of the tap on/off held |
Date/month/year |
tap |
It is a tap on or off |
On and Off |
loc |
Locations of stops. For bus postcodes and others name of the stations |
Postcodes and names of the stations |
Dataset 2: Collect data (e.g. via a survey) that will answer research question given below:
- Give a numerical summary and an appropriate graphical display for the variables location, by only considering those three stations; and the variablecount by considering the data with trains only.
- Perform a suitable hypothesis test at a 5% level of significance to test whether there is difference between mean counts of taps on and off.
- Use the conclusion of the test in part b and the outputs in part a to write a recommendation to NSW government.
Over the past couple of years, there has been a rapid rise in the number of passengers across the Australian public transport system. This rise has been attributed to development of infrastructure as well as economic factor (Cosgrove, 2018). Cosgrove (2018) argues that for proper assessment the changes that may occur in the near or long future, there’s is need to analyse the trends in the modern transport system over time.
The purpose of this paper is to there apply appropriate statistical techniques to collect both secondary and primary data, analyse and interpret them so that the results can be used to provide knowledge to the NSW on the usage of the various modes of transport and provide recommendations for improvement of the various modes with the country.
To achieve the objectives of this task, two dataset labelled dataset 1 and data set 2 are utilized. Dataset 1 is obtained from the Australian website for transport and is a subset of the data “Opal Tap on and Tap off location- 8th to 14th August 2016” provided by the transport for NSW Open data (Opendata.transport.nsw.gov.au, 2016). Dataset 2 is collected from survey.
Dataset 1 is obtained from a secondary or other source therefore it can be described as secondary data (Lock et al., 2013). The advantages of this type of data is that it’s readily available hence saves the time and expenses that would otherwise have been used in collection. Consequently, this type of data has the limitation of the researcher not being in control (Velcheva, 2017). This data set has five variables; mode, tap, loc and count. Mode is a categorical variable with four cases; bus, train, ferry and light trail that indicate the type of public transport use. Tap is a categorical variable with two case; on and off indicating whether it’s a tap on or a tap off. Loc is a categorical variable composed of various cases inform of train stations and postal codes. Count is a numerical variable indicating the count of the mode of transport. Lastly date is a quantitative continuous variable indicating when the tape was held (Rumsey, 2007).
dataset 2 is primary data collected through approaching individuals and interviewing them for their favourite mode of transport (Lock et al., 2013). This method of survey is called face-to face. Face- to Face has the advantage that the interviewee gives his honest opinion leading to collection of a reliable data. However, the collection of primary data is expensive and time consuming (Velcheva, 2017). This dataset has three variables, the first variable; date is quantitative continuous variable indicating the date when the interviewee was approached (Rumsey, 2007), gender is a categorical variable with two case; male of female indicating the sex of the interviewee. Mode is categorical variable with cases similar to those of dataset 1 that represent the most preferred mode of transport.
Task Description
The mode of transport that was used the most between the 8th and 14th of August in the 2016 will be is determined using tow statistic summaries namely sum and proportion (Bruce, 2015). The mode of transport with the highest sum of count and highest proportion will be the one that was most used. The table of summary statistics for the various modes of transport is shown below:
Table 1: Summary Statistics
It is evident from the table that the mode of transport that was used the most was the buses, followed by train, then ferry and lastly light trail.
The above summary statistics can be visualized in two ways. The first way is the bar chart while the second method is a pie chart. The bar chart is a chart that represents data using rectangular bars with each length of bar equivalent to the data being represented (Wesley, 2018). It shows the most used mode of transport using bars whose length are equivalent to the sum of count. The length with the highest length indicates the mode with the highest sum of count and hence the most commonly used. The bar with the smallest height indicate the mode of transport with the lowest sum of count and hence the mode that was less used. The bar chart is as shown below:
Fig 1: Bar Chart
It’s that buses are the most used, followed by train, ferry and lastly light trail.
The pie chart is a circle divided in portion and each portion represents a proportion of data (Wesley, 2018). It will visualize the summary statistics in terms of proportions expressed as a percentage. The commonly used mode (buses), will occupy the largest space since it has the largest proportion, the second will be train, ferry and lastly the light trail will occupy the smallest space since it has the lowest proportion. The pie chart is as shown below:
Fig 2: Pie Chart
To determine whether more than 50% of the population preferred the mode with the highest proportion for transport (buses) we set up and test a hypothesis whose result will help us draw a conclusive deduction on whether claim is true of false.
The mode of transport with the highest proportion i.e. buses has a proportion of 0.48. Our sample size form Dataset 1 is 100. Setting up the hypothesis will begin by stating the null and alternate hypothesis as below:
Datasets and Description
Next we check whether the conditions for hypothesis test for the proportion are met. The conditions are met the product of the sample size and the proportion is greater than 10 or when the product of the sample size and the proportion minus one is greater than ten. This is as below:
Since the conditions for our hypothesis test are satisfied we determine the z-test statistic that will aid in the determination of p-values later. The z-test statistic is given by:
We use the default level of significance which is 0.05 our decision rule will be that we reject the null hypothesis when the P-value obtained for the Z-test statistic calculated above is outside the range of -1.96 to 1.96 (Schenkelberg ,2017). in our case P(Z>-1.26) =0.104 is within the required range, therefore we accept the null hypothesis and conclude that more than 50% of the population use buses the most.
In this section, we carry out statistical analysis that will help in preparation of a recommendation for the NSW government on whether to construct an underground rail line to central from either Parramatta, Bankstown or Gosford railway stations. Dataset one is filtered to only have train as the mode of the transport, the three stations in consideration and the column count. It is filtered using the excel filter function (Linoff, 2008) A sample of the filtered data is as shown below:
Table 2: Filtered Data
Using Stat key software available online, the summary statistics are prepared to help determine the station which offer the most service and hence deserving the underground railway station. The summary statistics are shown in the table below:
Table 3: Summary Statistic for Filtered Data
The summary statistics above indicate that Parramatta has the highest mean of all the stations. It can therefore be suggested that it offers most of the service and hence deserve the underground railway station.
To visualize the above data. We use a box plot. It will represent the data in terms of the 25th quartile, the 50th quartile and the 75th quartile (Krzywinski and Altman, 2014). It will also indicate the skewness of the data for above stations. The box plot is as shown below:
Fig 3: Box Plot
From the box plot, the data for Parramatta is skewed to the right, the 25th quartile is 133, the 50th quartile is 287, and the 75th quartile is 577. The maximum is 1425 and the minimum is 91 counts. This are the largest values compared to the rest of the stations meaning that the station provides more service and if the government was to construct the underground railway to central, the station from where it would be constructed would be Parramatta station.
Analysis of Single Variable in Dataset 1
To determine whether there is a difference in mean counts for the taps, we carry out a hypothesis at 5% level of significance. Unlike above, here the Stat Key software is used to aid in the process. Firstly, the data for tap and count is prepared as in the sample below:
Table 4: Sample Data for Taps and count
In this case, the null hypothesis is there is no difference in means while the alternate hypothesis is that there is a difference in means. The data prepared is loaded to the stat-key software and using the “ANOVA for difference in means” the following data that help us determine the sample sizes and their mean (Lock5stat.com, 2018).
Table 5: Sample size and Mean
It is evident that the sample size in each of the cases for Tap on and Tap off is greater than 30 and no standard deviation is greater than twice the other, therefore all the assumptions for hypothesis test are satisfied. An ANOVA table is created to aid in determining the degrees of freedom the denominator and the degree of freedom for the numerator to be used for the F-normal distribution that will produce the p-value needed to reach a conclusion that can be taken to be true. The ANOVA table is as shown below:
Table 6: ANOVA table
From the ANOVA table above, the degree of freedom the numerator 1 and that of the denominator is 998. From this, F-normal distribution is determined using a graph that also comes up with the P-value. The graph is as shown below:
Fig 4: F Distribution Graph
The p value is 0.025 and since its less than the significance level we reject the null hypothesis and conclude that there is a difference in the means for tap on and off.
Dataset 2 is collected from face to face survey and is meant to determine the preference of each gender’s mode of transport. Since its raw data collected form interview it is subject to inaccuracy and efficiency. To limit this challenge, the sample size is made as a large as possible (Nayak, 2010). The sample comprises of a total of a hundred and one male and female individuals who report on their most preferred mode of transport. A sample of the data is as shown in the table below:
Table 7: Survey Data
To know what mode of transport is prepared by either gender we use sum summary statistics. The mode of transport with the highest gender sum is the most preferred by the gender. The summary statistic table is shown below:
Table 8: Summary Statistic for Mode and Gender
From the summary statistics above, bus is the most preferred by both and male and female on equal basis, train follows but more with more male preference than female. Ferry is the third most liked and again more male like it than female. The last is light trail and again it has more male who like it compared to women.
The above statistic summary is visualized using the stacked bar chart. A stacked bar chart is a chart much similar to a bar chart with the only difference being that it visualizes categorical variables only (Bruce,2015). It is as shown below:
Fig 5: Stacked Bar Chart
Conclusion
The above analysis of the secondary and primary data indicate that buses are most preferred mode of transport with then 50% of the population preferring them. This can be attributed to ease of access of the buses, the convenient services offered by the bus companies and/or the low prices of the buses. The second most used is train, followed by ferry and lastly light rail. The NSW government can therefore focus on improving the services offered by the other modes of transport so they can match and attract many traveller’s as the bus. On the other hand, if the government is in need of developing the train infrastructure then it can consider constructing the underground rail to central from the Parramatta station since it provides most of the service. Future research should be conducted to determine what factors travellers look for while choosing their mode of transport so that the government can include this factors in its future plans and models for development.
References
Bruce, P. (2015). Introductory statistics and analytics. 2nd ed. New Jersey: Wiley.
Cosgrove, D. (2011). Long-term patterns of Australian public transport use. In: Australasian Transport Research Forum 2011. [online] The University of Western Australia. Available at: https://atrf.info/papers/2011/2011_Cosgrove.pdf [Accessed 18 Sep. 2018].
Rumsey, D. (2007). Intermediate statistics for dummies. 1st ed. Hoboken, N.J.: Wiley.
Krzywinski, M. and Altman, N. (2014). Points of Significance: Visualizing samples with boxplots. [online] Nature Methods. Available at: https://www.nature.com/articles/nmeth.2813 [Accessed 18 Sep.2018].
Linoff, G. (2008). Data analysis using SQL and Excel. 2nd ed. Indianapolis, Ind.: Wiley Pub.
Lock, R., Lock, P., Morgan, K., Lock, E. and Lock, D. (2013). Statistics: Unlocking the power of data. 1st ed. Hoboken, N.J.: Wiley.
Lock5stat.com. (2018). Theoretical distribution. [online] Available at: https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#normal [Accessed 18 Sep. 2018].
Nayak, B. (2010). Understanding the relevance of sample size calculation. [online] NCBI. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2993974/ [Accessed 18 Sep. 2018].
Opendata.transport.nsw.gov.au. (2016). Opal tap on and tap off | TfNSW Open Data Hub and Developer Portal. [online] Available at: https://opendata.transport.nsw.gov.au/dataset/opal-tap-on-and-tap-off [Accessed 18 Sep. 2018].
Schenkelberg, F. (2017). Hypothesis tests for Proportion - Accendo Reliability. [online] Accendo Reliability. Available at: https://accendoreliability.com/hypothesis-tests-for-proportion/ [Accessed 17 Sep. 2018].
Valcheva, S. (2017). Primary data vs secondary Data: Definition, sources, advantages. [online] Business Intelligence, Data science, and Management. Available at: https://intellspot.com/primary-data-vs-secondary-data/ [Accessed 18 Sep. 2018].
Wesley, S. (2018). Top 5 best data visualization techniques for 2018. [online] Big Data Made Simple-One source. Many perspectives. Available at: https://bigdata-madesimple.com/top-5-best-data-visualization-techniques-for-2018/[Accessed 17 Sep. 2018].
To export a reference to this article please select a referencing stye below:
My Assignment Help. (2021). Analyzing NSW Transport System Data To Improve Public Transport System. Retrieved from https://myassignmenthelp.com/free-samples/bus708-statistics-and-data-analysis/analyse-nsw-transport-system-data.html.
"Analyzing NSW Transport System Data To Improve Public Transport System." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/bus708-statistics-and-data-analysis/analyse-nsw-transport-system-data.html.
My Assignment Help (2021) Analyzing NSW Transport System Data To Improve Public Transport System [Online]. Available from: https://myassignmenthelp.com/free-samples/bus708-statistics-and-data-analysis/analyse-nsw-transport-system-data.html
[Accessed 18 December 2024].
My Assignment Help. 'Analyzing NSW Transport System Data To Improve Public Transport System' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/bus708-statistics-and-data-analysis/analyse-nsw-transport-system-data.html> accessed 18 December 2024.
My Assignment Help. Analyzing NSW Transport System Data To Improve Public Transport System [Internet]. My Assignment Help. 2021 [cited 18 December 2024]. Available from: https://myassignmenthelp.com/free-samples/bus708-statistics-and-data-analysis/analyse-nsw-transport-system-data.html.