Question 1
Using techniques covered in class, which of the variables look like good candidates to help separate defaulters from non-defaulters? For the two you consider good choices, include brief output supporting
this (e.g., one chart per variable), with one brief explanation for why these variables may be good choices.
Question 2
Using the original unbalanced dataset, build a Naïve Bayes classifier. (In JMP Pro, it is one of the options for the Partition tool.) Use between 70% and 90% of the data (your choice) for your training set, and the remainder for the test set.
Include in your answer the confusion matrices (training and validation). On your matrices, calculate the correct classification rates, and interpret each (both the percentage of true yeses and nos that were
classified properly, and the probability that a yes/no classification is correct. Also include the overall misclassification rates, and the ROC and lift curves. (Because we are interested in predicting defaults, we are more interested in ‘yes’ than ‘no’ responses. Thus, be sure your ROC and lift curves show the plots for ‘yes’ rather than ‘no’. One way to do this, is before (re)doing your analysis, in the original data table, right click on the ‘default’ column, and choose Column Properties > Value Ordering. Because ‘yes’is the most interesting, move it to the top of the list in the ‘Value Ordering’ box. Then (re)run your analysis.)
Interpret your results. Is this classifier doing a good job? Be careful to think of the nature of the data and the business problem.
Question 3
Now, build a decision tree from the same dataset. In this case, you will need to balance the data.Balance it using weighting, according to the rules that we have discussed. Show how you determined what size weights you used to balance your set.Use between 70% and 80% of the data (your choice) for your training set, 10% for the validation set, and the remainder for the test set.Include in your answer the tree diagram, the ‘split history’ diagram showing how the optimum size was obtained, and the confusion matrices (training, test, and validation). For only the test set confusion matrix: • Correct for the weighting process, to determine the performance for the original, unbalanced distribution. Show your calculations. (If you need help with this, it is covered in the exercises document for the textbook.)
• After doing so, calculate the correct classification rates, and interpret each (both the percentage of true yeses and nos that were classified properly, and the probability that a yes/no classification is correct.
Interpret your results. Is this classifier doing a good job? How does it compare to the Naïve Bayes classifier? Be careful to think of the nature of the data and the business problem.Consider your decision tree. Can you make general statements about who is likely to default, based on the rules in your tree?