If you are unable to complete your open assessment by the submission date indicated above because of Exceptional Circumstances you can apply for an extension. If unforeseeable and exceptional circumstances do occur, you must seek support and provide evidence as soon as possible at the time of the occurrence.
Suppose the Mars 2020 rover will be visiting a particular crater on the surface of Mars to collect some samples. The crater was previously visited by the Opportunity rover, and Opportunity analysed some rock samples there, so we already have some data about thetypes of rock that are present in that crater. Opportunity analysed 185 rock samples and made some simple measurements of each one.
The file rocks-assessment3.arff (included with this assessment brief) contains four measurements of each of the 185 samples analysed by the Opportunity rover. The fourmeasurements are: reflective-red, reflective-blue, hard, and porous. Each one isnormalised to be on a 0-10 scale.
Your assignment is to work out how many sample tubes will be needed by the Mars 2020 rover using the data collected by Opportunity. To maximise the scientific value of thesamples, we want the rover to collect exactly one sample of each distinct type of rock.For example, if there were three types of rock, we would collect one sample of each and use three tubes.
You should use k-means clustering for this task. The theory behind k-means clustering is described in Unit 6.5 of the online module (including 6.5.1).
Instructions for downloading and installing WEKA are contained in Unit 6.6 of the online module. The same page also has instructions for using WEKA that you may find useful for this assessment.
Load the rocks data set into WEKA in order to perform k-means clustering on it. As part of this task, you will need to discover how many clusters to use (i.e. a suitable value of k). To discover a suitable value of k, you will need to run the clustering algorithm with several values of k, and plot a measure of variance, or distance, within the clusters. Then you will need to analyse the plot and determine a suitable value of k for this data set. WEKA reports the within cluster sum of squared errors as a measure of variance within the clusters. In Unit 6.5 of the module, I used the sum of the Euclidean distance instead. Both are sensible measures of the quality of a clustering, and either one can be used with the elbow point method to determine a suitable value of k.