Task 1: Implementing Random Forest
Learning outcomes assessed
The learning outcomes assessed are:
- develop, validate, evaluate, and use effectively machine learning models;
- apply methods and techniques such as ensemble methods and clustering;
- extract value and insight from data;
- implement machine-learning algorithms.
Task 1: Implementing Random Forest (50 marks)
You are asked to implement random forest for a regression problem. Random forest was discussed in Teaching Week 6. You are not allowed to use any existing implementations of decision trees or random forests in R, Python or any other language, and you must code random forest from fifirst principles.
You should apply your random forest program to the Boston dataset to predict medv. In other words, medv is the label, and the other 13 variables in the dataset are the attributes.
Split the dataset randomly into two equal parts, which will serve as the training set and the test set. Use your birthday (in the format MMDD) as the seed for the pseudorandom number generator. The same training set and test set should be used throughout this assignment. You need to complete the following parts:
(a) Generate B = 100 bootstrapped training sets (BTS) from the training set.
(b) Use each BTS to train for a decision tree of height h = 3. Be reminded that you are implementing random forest, so at each internal node you do not consider all attributes, but only a sample of them.
(c) Find the training MSE and test MSE. Include it in your report.
(d) Repeat the above parts using difffferent values of B and h. In your report, plot the training MSE and test MSE as functions of B or/and h, anddiscuss your observations.
In the code fifile, you should leave comments to clearly indicate which of your code snippets deals with which part (among (a), (b), (c) or (d) above). This helps the graders to understand your code more easily. Feel free to include in your report anything else that you fifind interesting.
Task 2: Implementing Agglomerative Hierarchical Clustering (50 marks)
You are asked to implement agglomerative hierarchical clustering (AHC). AHC was discussed in Teaching Week 7. You are not allowed to use any existing implementations of AHC in R or any other language, and you must code AHC from fifirst principles.
You should apply your AHC program to the NCI microarray dataset which can be downloaded from the module’s Moodle page. This dataset has 64 columns and 6830 rows, where each column is an observation (a cell line), and each row represents a feature (a gene). Therefore, the dataset is represented via its transposed data matrix. You can load the dataset using the following code snippet in R.
ncidata <- read.table("ncidata.txt")
ncidata <- t(ncidata)
After executing the above code snippet, ncidata has 64 rows, and each row is an observation. Each observation is a vector in R 6830 .
You need to complete the following parts:
(a) Implement AHC with the following linkage functions: single linkage, com plete linkage, average linkage and centroid linkage. Your output should be a data structure that represents a dendrogram. We will discuss how to design a class for the nodes in a dendrogram below.
(b) Implement a function getClusters that takes a dendrogram and a positive integer K as arguments, and its output is the K clusters obtained by cutting the dendrogram at an appropriate height.
(c) In your report, use the getClusters function to discuss the performance of AHC with the four difffferent linkage functions when applied to the NCI microarray dataset.
In the code fifile, you should leave comments to clearly indicate which of your code snippets deals with which part (among (a), (b) or (c) above). This helps the graders to understand your code more easily. Feel free to include in your report anything else that you fifind interesting.