A research on a subject has some objectives to fulfill, especially from statistical research analysis the major objectives are to find the description of the data using summary statistics, it is common for the data to include dependent as well as independent variables. Generally for business and market related studies the data is generally found to be multivariate consisting of many dependent and independent variables. So it becomes a necessity to choose which of the independent variables are more suitable for the data analysis. Here our topic is regarding multicollinearity of the data, why it emerges and how can it be controlled. The discussion followed the article by Jeeshim and Kucc (2003), “Multicollinearity in Regression Models” (sites.stat.psu.edu, 2003). Therefore all the discussions will be considered on the basis of this article.
Review of the Article
Multicollinearity is a problem in case of regression and must be checked before final prediction. The topic gives a complete reference to multicollinearity in different independent variables. It also gives a detailed process with respect to the data with which we can check for multicollinearity between the variables. Different data results has been used as examples for proper explanation.
From the correlation matrix it can be often observed that there is a strong linear association between two independent variables like the area of the plot of the house and area of a house. These two variables represent the same thing , i.e. one variable can be significantly predicted from the other variable. This is when the problem of multicollinearity arises. We can then just take any one of the variables i.e.,replace one variable by another variable.
Analysis and Discussion
If multicollinearity is faced at a very low level then it is not a major issue but for variables whose correlations are very strong can make problems in predictions of the regression equation. The values of the variances or standard errors of the independent variable can be much more than usual. Another implication can be the p value which will be insignificant at times. As earlier stated there will be inevitably large correlation coefficient between the variables . Again if the data are modified to a slight extent the resulting coefficients will be changed largely. If the problems of any of these is evident from the data then it it could be a problem of multicollinearity and must be checked beforehand otherwise the regression will provide spurious estimates (Fekedulegn, 2002).
The indications specified above only gives a hint of multicollinearity, like although two independent variables are highly correlated we cannot call for sure that the variables are having multicollinearity, neither can we confirm it from the significance level, standard error and coefficients of the independent variables. As to say there is no specified limit from which we can refer for sure occurrence of multicollinearity, however some measures like the tolerance value and the vif can be calculated besides regression and hence infer about multicollinearity to some extent. The tolerance value is 1 - R square value : which is the amount of the dependent variable that can be predicted via the independent variables. A low value of R square can be considered as a matter of concern. I/R square gives the VIF, a large value of VIF is a matter of concern but the exact cutoff value is not standardized.
In this analysis the analysis is run in SAS where to calculate multicollinearity three measures have been used : the tolerance value, VIF and the Collin analysis. The dependent variable considered is expenditure within independent variables age, rent, income and inc_sq. Therefore the regression equation is used to predict the value of expenditure from the values of the variable age, rent, income and inc_sq. The regression model as run in SAS and from the value of the anova table it is seen that the regression equation is a good fit as the significance value is .0008 which is much less than the desired significance level. The value of R square is .2436. Age and inc_sq shows negetive association while rent and income shows positive association with expenditure. The values of the standard errors are very large. From the tolerance value it is seen that both income and inc sq have a very low tolerance level of .061 and .065 and thereby very high variance inflation of 16.33 and 15.21, showing that the variability of both the variables are more than usual. Therefore these two variables have multicollinearity.
Again from the collinearity diagnostics carried out in SAS the association between the variables is checked by the factors eigen value and the conditional index. Very small eigen values shows more collinearity . Conditional index is the square root of the eigen value having greatest value divided by the corresponding eigen value. Large values of conditional index indicates the problem of collinearity. From the table in the article it is seen that the eigen values of income and income squared are very close to zero and thus are collinear. Again from conditional index column it is seen that both of these variables have high values, the variable income squared show a value greater than 20. Also the proportion of variations table generated by SAS which shows the proportion of variation generated by the variables. The variable showing more proportion of variation compared to the Eigen value is considered to have multicollinearity (Neeleman, 1973).
Thus it has been verified from all aspects that the variables income and income squared show multicollinearity. The major problem faced due to multicollinearity is that it reduces the rank of the correlation matrix and a matrix without having full rank will give false solutions and results and interpretations will be in vain. Apart from factor analysis principal component analysis could be used to reduce the size of unwanted variables. But it must be assured that there are some space for data reduction like in this analysis we verified that the variables income and income_sq show multicollinearity. In the principal component analysis the original matrix with dimension n is divided via n eigen vectors and n eigen values and a diagonal matrix where the sum of the diagonal matrix equals to 1. The eigen vectors and the eigen values are useful ways to infer about the variance of a variable (Jolliffe, 1986). To every eigen vector there exists an eigen value. The principal components are decided from the eigen values and the eigen vectors. Before making calculations from the new matrix it is verified from the values of earlier regression results and also from the vif values the factors or variables showing multicollinearity. Here also from the articles it has been verified from the VIF values the variables showing multicollinearity. A transformed matrix is formed by multiplying the old matrix by the eigen vectors. Final regression is again carried on the transformed variables. Dimension is reduced for the variable having least eigen values and high conditional indices. As evident from the data in the analysis the variables income and income squared show the maximum amount of variation.
But a confusion is created regarding the variable to be removed from the data to get proper predictions.
For this reason a correlation matrix is created to check the association between the data. As expected the correlation between income and income square is very strong with a correlation of .963. to clarify which among these two variable must used for reduction in dimension two graphical plots are conducted one age versus income and the other income versus income square. It is evident from the graph of income of income_sq about their strong collinearity, but income can be considered as an important variable it has its effects with other variables,i.e. it not only affects the prediction itself also plays a major role in predicting the data with association to other variables like age. It is known that in regression it is not always the individual effects of the variabes but also a combined effects of the variables that could help in proper prediction. Therefore income is seen as an important variable which can be in no case removed from the prediction. Income_sq represents almost the same thing as income and thus repeating a variable of same usage twice is of no use for prediction. Also the variable being square of income creates unnecessary confusion and weightage to the data. Therefore the income squared variable was decided to be included for dimension reduction (Neeleman, 1973).
This concept of dimension reduction is the concept of principal component analysis including only the factors or variables that account for maximum variance in the data through the Eigen values. There principal component analysis is an important aspect for reducing the unwanted variables by including only the variables that are needed for data prediction by using the variables that makes the data to differ by different aspect and excluding the variables that has no part in this prediction and acts as an extra baggage : intuitively this variables are often seen to be those variables that makes the same representation as the other variables. Therefore variables like this must be removed beforehand. There are some conditions for conduction of the principal component analysis. Only numerical variables are to be included and also Uncorrelated variables cannot be part of the principal component analysis. Again there must be proper data collection or sample collection implemented otherwise the analysis would be in vain. Before computing the principal component analysis it must be checked via other sources of calculation that there are some variables included in the data that show multicollinearity. PCA analysis nay not always be significant if there is a strong problem of outliers.
After the variable income squared was removed from the data, a regression was carried out with the other variable so check whether the data was a good fit. However from the first regression analysis the significance level was found to be .0008 and this time the significance level is seen as .001 therefore an increment in the p value is seen from which it can be said that the equation was a better predictor with the variable income squared being included. But since the equation is still significant predictor at 5% level of significance it can be used as a predicting equation. But again previously the value of R squared was .245 and this time the value of R squared has reduced to .198. Therefore less amount of the dependent variable can be explained by the independent variable after the reduction in the data. This proves that reducing the variable was of no use to the analysis. The coefficients, standard error , t values of the variables were calculated. Now to check whether there is any further space for multicollinearity again the VIF values and the conditional indices are calculated. From the VIF value and the values of the conditional indices it may be concluded that there exists a further chance of multicollinearity.
Principal Component analysis is a well known procedure for data reduction and selection of relevant variables in the data. But unfortunately with this data the results don’t prove completely justified. Therefore there may be other methods that could be used like new variables with importance to expenditure that is having strong relations with expenditure can be adopted. Sometimes if the variables are transformed then it may show relevant significance however this is not the case in income and income squared since both represents almost the same feature. If a variable is transformed then it could be transformed back to its original form by conducting the same transformation steps but in a reverse direction, therefore in many cases transformation can be very useful which is also used in principal component analysis to transform the variables back their older forms. Thus inclusion of more variables is the possibility and then their significance can be easily judged by factor analysis which is a very important analysis to get a good prediction equation with variables that are genuinely important with respect to the dependent variables.
Fekedulegn, B. (2002). Coping with multicollinearity. Newtown Square, PA: U.S. Dept. of Agriculture, Forest Service, Northeastern Research Station.
Jolliffe, I. (1986). Principal component analysis. New York: Springer-Verlag.
Neeleman, D. (1973). Multicollinearity in linear economic models. [Tilburg]: Tilburg University Press.
sites.stat.psu.edu, (2003). Multicollinearity in Regression Models. [online] Available at: https://sites.stat.psu.edu/~ajw13/SpecialTopics/multicollinearity.pdf [Accessed 16 Jan. 2015].