The purpose of this project is to implement a decision tree-based classifier to predict whether a person’s income is more than $50k or not. The project consists of multiple components that involve data preprocessing, implementing a classification model, assessing the performance of the classifier with the different stopping criteria, and analyzing some of the models that were estimated. It is recommended to use Python to implement the algorithms. If you wish to use another language, you must ask for permission first. This project was designed for teams of 2 or 3 students. Dataset The dataset is derived from the Census income data. There is a training set (adult.train) and a test sets (adult.test). You need to train the classifiers on the training set and evaluate your model on the test set. The provided files do not have header files, but the attribute values are comma separated. The columns/attributes included in the data files are as follows:
1. age : The age of the individual.
2. type_employer : The type of the employer that the individual has.
3. fnlwgt : The number of people the census takers believe that observation represents.
4. education : The highest level of education achieved for that individual.
5. education_num : Highest level of education in numerical form.
6. marital : Marital status of the individual.
7. occupation : The occupation of the individual.
8. relationship : The family relationship of the individual.
9. race : The descriptions of the individual's race.
10. sex : Biological Sex.
11. capital_gain : Capital gains recorded.
12. capital_loss : Capital Losses recorded.
13. hr_per_week : Hours worked per week.
14. country : Country of origin for person.
15. income : Whether or not the person makes more than $50,000 per year income. Class label.
You need to develop an algorithm to train a binary decision tree classifier and use the confusion matrix to evaluate the performance of your classifier. For training the decision tree, you should use Gini Index to select the splitting feature and best split (if the attribute has more than two attribute values) at each node. In any subtree, you should not use the same attribute twice (i.e., the path from the root to any leaf node should not repeat an attribute as the splitting attribute). Splitting stops when one of the following happens:
• the node is pure (i.e., the labels of the data points in that node belong to the same class),
• the number of data points in the node is less than a threshold <minfreq>,
•all the features are used.
-You should set the label of the leaf node as the one that comes from the majority of the labels in the node. You need to develop four programs for pre-processing, training, testing, and evaluation, respectively. The command line for each program should be:
1) Pre-processing preprocessing.py <datafile> <datareadyfile> The <datafile> is the name of any file that has the format mentioned earlier. You need to read the file and perform the following actions:
1. remove instances that have any missing values
2. remove attributes: fnlwgt, education-num, relationship
3. binarize the following attributes:
• capital-gain (yes: >0, no: =0)
• capital-loss (yes: >0, no: =0)
• native country (United-States, other)
4. discretize continuous attributes:
• age: divide values of age to 4 levels: young (<=25), adult ([26,45]), senior ([46,65]), and old ([66,90]).
• hours-per-week: divide the values into 3 levels: part-time (<40), full-time (=40), over-time (>40)
5. merge attribute values together/reassign attribute values:
• workclass: create the following 4 values: gov (Federal-gov, Local-gov, State-gov), Not- working (Without-pay, Never-worked), Private, Self-employed (Self-emp-inc, Self-emp-not- inc)
• education: create the following 5 values: BeforeHS (Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th), HS-grad, AfterHS (Prof-school, Assoc-acdm, Assoc-voc, Some-college), UGrD, GrD (Masters, Doctorate).
• marital-status: create the following 3 values: Married (Married-AF-spouse, Married-civ- spouse), Never-married, Not-married (Married-spouse-absent, Separated, Divorced, Widowed)
• occupation: create the following 5 values: Exec-managerial, Prof-specialty, Other (Tech- support, Adm-clerical, Priv-house-serv, Protective-serv, Armed-Forces, Other-service), ManualWork (Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Transport-moving), Sales.
1) Write the data instances back to the <datareadyfile> in the same format (comma separated attribute values, the last value will be the class label) as before. Make sure to include a header line with the attribute names.
2) Trainingdtbuild.py <trainfile> <modelfile> <minfreq> The training program will read in the training set and output the trained model. It will take three parameters as input:
• <trainfile> is the name of the training file (after the pre-processing step).
• <modelfile> is the name of the file where you will write the learned decision tree. The format of the model file should have one line per node, and it should have the following for the internal and leaf nodes, respectively:
<nodeID>:<parentID>:<splitting_attribute>:<left_attr_values>:<left_nodeID>:<right_attr_values>:<right _nodeID>:<attr_used> <nodeID>:<parentID>:leaf <class_label> o<right/left_attr_values> are the attribute values for the right and left child nodes, respectively. If there is more than one value, the different values are comma-separated. o <right/left_nodeID> are the node IDs that correspond to the right and left children of the current node. o <attr_used> is a series of binary indicators, demonstrating which attributes were used so far, including the splitting attribute in this node. For example, if the attribute of age is used in the root, attr_used = 10000000000; it has ten “0”s and one “1” in the first position because the 1st attribute was used to split the node. o <parentID> is the nodeID of the parent node. If the node is the root, then parentID =NULL.
Example of acceptable format:
Line1: n1:NULL:age:young:n2:middle,senior,old:n3
Line2: n2:n1:leaf:>50k
Line3: n3:n1:education:UGrD,GrD:n4:beforeHS,HS-grad,afterHS:n5
• <minfreq> is the minimum number of data points in a node such that the node will keep splitting. Experiment with the following values: {5, 10, 40}.
3) Testing dtclassify.py <modelfile> <testfile> <predictions> The testing program will read in the model that was previously trained by the training program and classify the test set. It will take three parameters as input:
• <modelfile> is the produced model file.
• <testfile> is the name of the test file (after the pre-processing step).
• <predictions> is the output file with the produced predictions. Note that there should be two columns in the predictions file: the first column contains the true labels of the data points, and the second column contains the corresponding predicted labels. Note that the order of the predictions needs to match the order of the data instances in the initial data file. In other words,the true and the predicted labels in the 10th line need to correspond to the 10th person in the testfile.
4) Evaluation dtevaluate.py <predictions> This program takes the predictions as input, computes the full confusion matrix (including the number of positives, negatives, true, false, N) and calculates the error rate, accuracy, recall, and F1-score. The results should be printed in the standard output. Note: You should try to make your code as efficient as possible.
Deliverables:
- The four code files, named as indicated above. (total: 60%) (Due Nov. 24th)
- Initial report (total: 15%) (Due Nov. 14th )
Q1) For each attribute (expect the class label) remaining in the processed dataset (the <datareadyfile> produced by the preprocessing.py), create a bar plot that shows the distribution of the positive(>50K) and negative(<=50K) class label over the different attribute values.
Q2) Run your code and build a model that has only one internal node (the root) and two children, i.e., split only the root node and stop.
a. What is the Gini index before and after the split?
b. What is the output of the dtevaluate.py for the test set?
c. Submit the model file and the file with the predictions of the test file.
Q3) Describe the implementation of the decision tree classifier, how you treated attributes that had more than two values, and how you represented the tree in your code. What information do you keep for each node?
Q4) Run your code and report the output of the dtevaluate.py on the test set for three cases, when we use min_freq = {5, 10, 40}.
Q5) Submit the predictions file for each case in Q2, along with the processed input files for both the training and test data.