Customer risk assessment with decision tree.

Implementing a Decision Tree for Assessing Customer RISK Levels

Goal

Goal

1. A decision tree is a data structure on which the classifier runs to make a prediction. Each internal node is attached with a ‘testing attribute’ and the branches are the outcomes for the testing. The leaf nodes are attached with the class labels. The object whose class label is being predicted starts from the root, and is guided at each node by the testing outcome to follow a specific path to a leaf node, where it is assigned the class label. The goal of this project is to implement a decision tree based on following application: customers are assessed for their RISK level, high or low, depending on the five attributes: age, credit_history, income, race, health. The subsequent section contains some details that you will need to know in completing your project.

2. How to represent a decision tree

A crucial step for building a decision tree is to select the testing attribute at each node. This has been discussed in detail in the class. An issue that has not been discussed in the class, however, is exactly how to use a data structure to represent a decision tree in a programming environment. This concerns with the implementation, and depends on the language you will use. In the general case, you need to represent the following information for each node: testing attribute, the values in the domain of the testing attribute, and for each branch the identity of the child that it leads to. In Python, you can use sequence type for this purpose. For each internal node, the list of the form below may be used:
[A, {v1: B1, …, vp: Bp}] where A is the testing attribute at the node. The key-value pairs in the dictionary represent branches where vi is a value in Domain(A) and Bi is a reference to a child. For a leaf node, simply use [L] to represent it where L is the class label attached to the node. For example, consider the following decision tree:

The above decision tree contains a total of eight nodes. They are represented as follows:

A: [‘age’, {‘young’: B1, ‘adult’: B2, ‘senior’: B3}]
B1: [‘salary’, {‘high’: C1, ‘low’: C2}]
B2: [‘YES’]
B3: [‘occupation’, {‘worker’: C3, ‘politician’: C4}]
C1: [‘YES’]
C2: [‘NO’]
C3: [‘NO’]
C4: [‘YES’]

In almost all the cases, the root is used to identify the entire decision tree. The above representation is easy to work with. For example, suppose each person is represented by a dictionary of the form {‘age’: v1, ‘salary’: v2, ‘occupation’: v3}. To predict the class label for an individual p, we can simply use the following loop: node ? A while node is a list v ? p[node[0]] node ?node[1][v] return node[0] It’s equally easy to predict the class labels for a group of instances, such as those represented in a matrix. (Detail omitted.)

3. Encoding the values for nominal attributes In many cases, you need to test the accuracy of the decision tree on a testing set. The testing set is usually represented as a mxn matrix where the rows are attributes and the columns are objects. In numpy, the most commonly used input function that operates on matrix is loadtxt(*). But this function requires the matrix to have numerical values while in most cases the values for a nominal age attribute are symbols.