Discuss about the Organisation in Decision Making Process.
This research is aimed to help the national veteran’s organisation in decision making process, so that the organisation can better target the donors. If the organisation is able to identify the most likely donors then it can only target such group and save money and effort on targeting the whole group. As per the given information, organisation has data base of more than 3.5 million individuals. In this research the identification of the probable donors will be on the basis of their previous behaviours. Through the analysis a certain group of individuals will be targeted who have donated recently (12 and 24 months ago).
- As shown in the table below the class level threshold at kept at 2. This has been done so that the binary variables are taken as the categorical variables.
- Similarly the table also shows that rejection level count threshold has been set as 100. This will ignore the variable which has more than 100 different values.
In this case the target variable is “Median Income Region ( DemMedIncome)” . As shown the figure above this variable has very skewed distribution. The mode value for the variable is zero.
As discussed in the previous section the median income is zero. However the median income cannot be zero as the median value is the middle value of the entire data set. Since the data is related to the probable donors the income of the individuals are expected to be higher than the average income of the people. So there might be some issues while collecting the data(Kara 2013).
Firstly, the sanity check should be done to understand whether the data collection was done properly or not. In case the zeros are some code for the people, it should be clearly stated so that it should not confuse the analyst.
After the rectification of the income variable the next step was data partition. As the given information partition of the data set conducted and the data was distributed into training and test data where each set was allotted 50 % of the original data.
As shown in the table above the data partition was done. There were total 4844 observations in the data set, out of which 2420 were taken as the training data.
Results from gradient boosting are shown in the figure below. Gradient boosting is used is one the most common machine learning technique used for classification and regression models. Results from the gradient boosting help to prediction of model, especially the decision trees.
Since the dependent variable is binary the logistic regression model has been used and the results from the regression analysis are shown in the table below(McCarty & Hastak 2007).
In this section the results from all the three models has been compared. For comparing the model comparison node was used. Three ROC curve from three different model are shown below. All the three models are compared against the baseline model which has been calculated only using the average values.
The lift with highest distance from the baseline is a better model. From the graph it can be observed that gradient boosting is having highest lift both in training as well as validation(Trebuna et al. 2014).
After the summary statistics the next step was to check the missing values in data set. Results from the missing value shows that there are 2407 missing values. So after removing the missing values from the data set the number of observation is now 7279.
Market Basket Analysis and Association Rules
The lift value shows the importance of the rule. In this case the lift value is greater than 1 which suggests that rule body rule head comes together more often than expected. In other words the occurrence of the rule body shows positive impact on the rule head.
People who buy “Magazine & Candy Bar” together are also likely to buy Greeting Cards also. So the company should target those customers to increase sale of both the product.
Similarly people are likely to buy Candy Bars with pencils. This is mainly because kids would be the consumer who buys pencils hence also goes for Candy Bars. The promotional strategy could be that they offer candy bars with purchase of more number of pencils.
People who buys photo processing also buys Magazine which makes sense logically as Magazine contains various photographs which might interest people. Hence they get ideas around photography if they go through Magazine as well. Promotional strategy could be to bundle the two goods & cross sell other magazines as well.
Fokin, D. & Hagrot, J., 2016. Constructing decision trees for user behavior prediction in the online consumer market. KTH ROYAL INSTITUTE OF TECHNOLOGY.
Kara, H., 2013. Collecting primary data: A time-saving guide, Policy Press.
McCarty, J.A. & Hastak, M., 2007. . Segmentation approaches in data-mining: A comparison of RFM, CHAID, and logistic regression. Journal of business research, 60(6), pp.656–662.
Trebuna, P., Halcinova, J. & Fil’o, M., 2014. The importance of normalization and standardization in the process of clustering. IEEE, 12, p.381.