Applying decision tree classification and association rule evaluation to the Titanic dataset
You are given a training dataset, “trainDataset.csv”, and a testing dataset, “testDataset.csv”, which will be provided in electronic form. The data are extracted and pre-processed from the original Titanic dataset. The attributes of each object (a passenger in this case) are defined as follows:
- Survived: represent whether the passenger survived (1) or not survived (0);
- PC (Passenger Class): the class of the passenger on ship;
- Sex: indicate the passenger’s sex;
- Age: indicate the passenger’s age group at the time of ship departure;
- SS (Sibling Spouse): indicate the number of Siblings/Spouses that the passenger has on the ship;
You are required to apply decision tree classification technique and the association rule evaluation to the above case appropriately. Specifically, you are required to:
- Use the training dataset, apply the basic Hunt’s Algorithm to train a fullygrown decision tree model, where the selection of attributes should follow the sequence: If the attribute has multiple attribute values, please use multiway split (do not use binary split). Leaf nodes should be declared as a single class label (do not use probability/fraction).
- Use the training dataset, apply the Greedy strategy combined with the Gini impurity measure to rebuild a fully-grown decision tree. If the attribute has multiple attribute values, please use multiway split (do not use binary split). Leaf nodes should be declared as a single class label (do not use probability/fraction). Samples of the calculations and explanations should be provided to demonstrate the application process of the Greedy strategy and Gini impurity measure.
- Use the test dataset to test two fully-grown decision tree models, and discuss the results.
- Perform the post-pruning activities to two fully-grown decision trees by applying the following rules: (i) prune any sub-tree if its leaf nodes have the same class label, and (ii) prune any sub-tree if the number of objects (passengers) at each leaf node is not more than one. After pruning, please test two pruned decision trees using the test dataset. Discuss the results.
- From two pruned decision trees, extract the association rules for each leaf node based on the information on the path from the root node to the leaf node in the decision trees. Evaluate the support, confidence, and lift of the identified association rules using the training dataset. Discuss the results.
As the majority of the tasks in this assignment is problem-solving based, the word count will be treated as flexible in the sense that if all the required tasks have been appropriately addressed, you will not be penalised for having a word count few than should be treated as an upper limit.
Assessment Criteria:
The assessment criteria will generally follow the marking guidelines provided in the Management School Student Handbook. Specific assessment criteria are highlighted below:
- Demonstrate understanding and knowledge of the relevant concepts, theories and techniques in data mining and machine learning;
- Demonstrate ability to apply relevant techniques and tools of data mining and machine learning to solve the given problem and tasks;
- Critically and analytically discuss results in a structured and logical manner;
- Demonstrate ability to support your arguments with evidences and references;
- Appropriate structure, presentation, use of English and use of the Harvard referencing style, e.g. figures and tables should be displayed legibly at the 100% zoom scale in a full-screen mode.