K-means for Mars 2020 sample tubes.

Using K-means clustering to determine the number of sample tubes required for Mars 2020 rover

Answered

Important information

If you are unable to complete your open assessment by the submission date indicated above because of Exceptional Circumstances you can apply for an extension. If unforeseeable and exceptional circumstances do occur, you must seek support and provide evidence as soon as possible at the time of the occurrence.

Suppose the Mars 2020 rover will be visiting a particular crater on the surface of Mars to collect some samples. The crater was previously visited by the Opportunity rover, and Opportunity analysed some rock samples there, so we already have some data about thetypes of rock that are present in that crater. Opportunity analysed 185 rock samples and made some simple measurements of each one.

The file rocks-assessment3.arff (included with this assessment brief) contains four measurements of each of the 185 samples analysed by the Opportunity rover. The fourmeasurements are: reflective-red, reflective-blue, hard, and porous. Each one isnormalised to be on a 0-10 scale.

Your assignment is to work out how many sample tubes will be needed by the Mars 2020 rover using the data collected by Opportunity. To maximise the scientific value of thesamples, we want the rover to collect exactly one sample of each distinct type of rock.For example, if there were three types of rock, we would collect one sample of each and use three tubes.

You should use k-means clustering for this task. The theory behind k-means clustering is described in Unit 6.5 of the online module (including 6.5.1).

Instructions for downloading and installing WEKA are contained in Unit 6.6 of the online module. The same page also has instructions for using WEKA that you may find useful for this assessment.

Load the rocks data set into WEKA in order to perform k-means clustering on it. As part of this task, you will need to discover how many clusters to use (i.e. a suitable value of k). To discover a suitable value of k, you will need to run the clustering algorithm with several values of k, and plot a measure of variance, or distance, within the clusters. Then you will need to analyse the plot and determine a suitable value of k for this data set. WEKA reports the within cluster sum of squared errors as a measure of variance within the clusters. In Unit 6.5 of the module, I used the sum of the Euclidean distance instead. Both are sensible measures of the quality of a clustering, and either one can be used with the elbow point method to determine a suitable value of k.

Perform k-means clustering for a set of suitable values of k. For each value of k, record (at least) the within cluster sum of squared errors and include it in your report.
Plot the within cluster sum of squared errors against k as a line plot. Include the plot in your report. You may use any suitable tool to make the plot. Spreadsheets such as Microsoft Excel, LibreOffice, and Google Sheets are able to make line plots.
Determine (using your plot) a suitable value of k for this dataset. Describe in your report how you determined this value, and why it is a suitable value.
Using your chosen value of k, read (from the output of WEKA) the number of examples allocated to each cluster, and include this in your report. Are the examples approximately equally split among the clusters?
Once again using your chosen value of k, test the k-means implementation in WEKA to see if it consistently produces the same set of cluster centroids. To do this, you will need to use several different values of seed (the seed of the random number generator). The value of seed can be changed using the same configuration window that is used to set the value of k. Report your results, and comment on whether the implementation of k-means produces consistent results on this data set.

Get instant help from 5000+ experts for