Authorisation and Purpose
The study aims to analyse the vital statistics of East Asia and Pacific region from the year 2001 to the year 2015 by collecting the dataset from the World Bank. The implication of this analysis will be done by government planners to improve the health of that region.
The primary constraint of this study is that the research and analysis is limited for East Asia and Pacific region only. Moreover, the data is collected from World Bank which is secondary of nature and it is another limitation.
The present study consists of 26 attributes holding information about health of the above mentioned region. Besides, the data contains information for a long time period of 15 years. The analysis can be performed using statistical analysis and interpreting the graphs. However, the data has lots of missing observations.
The analysis is proceeded through one-variable analyses, two-variable analyses. On the next step, the data is clustered using k-means clustering technique and finally, the data is analysed by fitting linear regression lines between two attributes.
The information has been generated from World Bank. The dataset is quantitative in nature and contains information about health for the time period of 2001 to 2015.
The data is loaded into the “R” program before the analysis. A pop-up window gets opened after running the first line of the code and then the data file (in csv format) is selected by inputting the location of the data file. The missing values are addressed in the first line of the code as missing values.
At the second step, the necessary library files are loaded to the “R” program to perform the required statistical analyses and to display all the graphical presentations.
Exploratory Data Analysis
One variable analysis
One Variable Analysis – 1
The per capita gross national income (GNI) is analysed under the section of one-variable study. The average amount of GNI per capita in that region is 11522.45 and the standard deviation is 15406. The minimum amount and the maximum percentage of immunized one-year old children is 310 and 76300 respectively. The boxplot analysis shows that the dataset has many outliers.
One Variable Analysis – 2
The second variable is percentage of Tertiary school enrolment where the minimum value of the percentage is 37.81 and the standard deviation is 22.19. The minimum percentage of school enrolment is 3.1 and the maximum is 83.6. The boxplot of the distr8ibution of tertiary school enrolment percentage is negatively skewed.
One Variable Analysis – 3
The distribution of the rate of total unemployment is analysed in the course of the one-variable analysis. The distribution is graphically represented with the help of histogram that shows the distribution is positively skewed.
Two-variable analysis 1
In the course of two-variable analysis, the Gross national income is analysed country wise for the region and it is graphically represented by side-by-side Box-plot. The graph shows that there has been huge variation in GNI during the time period of 2001 to 2015. The maximum gross national income has been obtained for the country Macao SAR, China.
Two-variable analysis 2
The total distribution of unemployment has been analysed here with respect to its change for each country for the tie period of 15 years. The side-by-side Box-plot has been used to represent the variation in the unemployment rate for the countries. There are outliers for the countries having country codes ‘KIR’, ‘PLW’, ‘SLB’. The box-plot having longest whiskers is for country having country code ‘MAC’ that indicates that the spread of the distribution of unemployment rare for this country is widest.
Brief explanation of k-means and clustering
Clustering means segregating the entire dataset into smaller groups having similar characteristics. K-means clustering is a special type of non-hierarchical clustering technique that uses the centroid distances for group segmentation (Oleiwi 2016). The centroids are initially selected and the data points are assigned into them on the basis of the nearest distance from the centroid. The process is repeated until all the data points are assigned into groups (Cohen et al. 2015).
The per capita gross national income and the total unemployment rate has been taken account for performing k-means clustering analysis for the year 2014. There are three optimal clusters that was found after scaling. From the graphical analysis it is seen that there are three groups-
- Low GNI and High rate of total unemployment
Brief definition of linear regression
The linear regression analysis predicts the linear relationship between the explained variable and one or more explanatory variable(s) (Theobald and Freeman 2014).
Linear Regression 1
The dependent variable in this case is Total unemployment rate and the independent variable is tertiary school enrolment. The total unemployment rate is predicted by the follow8ing regression equation:
Total unemployment rate = 3.769558 + 0.009769* Tertiary school enrolment
Linear Regression 2
The relation between total unemployment rate (independent variable) and the GNI per capita (dependent variable) is shown in the following graph. The predicted regression equation is given by
GNI = 12974.4 + 395.2 * Total unemployment rate
The slope is positive here that indicates that there would be increase in per capita GNI for corresponding increase in total unemployment rate (Darlington and Hayes 2016).
The report about the health and population statistics shows important analysis of the East Asia and Pacific region. From the analysis, it can be concluded that there is high level of GNI per capita for the country having country code MAC. Besides, there are outliers in the distribution of Gross national income if analysed country wise. In addition, there are three optimal clusters if the Total unemployment rate is grouped on the basis of GNI per capita. On the other hand, it has been found that, if the tertiary school enrolment is increased then the total unemployment will also be increased. Besides, if there is any increase in total unemployment rate, then there will be increase in GNI per capita.
The entire analysis was made interesting with the analysis of different attributes for different time periods. This study shows variation in the total unemployment rate and also in the change of GNI per capita for the East Asia and Pacific region.
Cohen, M.B., Elder, S., Musco, C., Musco, C. and Persu, M., 2015, June. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (pp. 163-172). ACM.
Darlington, R.B. and Hayes, A.F., 2016. Regression analysis and linear models: Concepts, applications, and implementation. Guilford Publications.
Oleiwi, W.K., 2016. Using the Fuzzy Logic to Find Optimal Centers of Clusters of K-means. International Journal of Electrical and Computer Engineering, 6(6), p.3068.
Theobald, R. and Freeman, S., 2014. Is it the intervention or the students? Using linear regression to control for student characteristics in undergraduate STEM education research. CBE-Life Sciences Education, 13(1), pp.41-48.