The data set is a slightly modified version of a real-world plant data set. The data concerns the classification of plant species from different measurements taken from photographs of these plants. Each record consists ofseveral attribute columns (input), and one class column (output) corresponding to the information about the type of plant.
The attributes are markers that have been determined by assessment of different features of each plant, and the class variable is a provisional labelling of the type of plant. The entire data set consists of over 700 instances (plants studied). Some of the variables contain missing values, which are indicated by empty entries.
Part 1 carries 35 marks, while Part 2 and Part 3 carry 20 marks each. In total, this report aggregates to 75 marks. Marks will only be awarded for the first twenty pages of the main body of your report.
The remaining 25 marks will be awarded to the quality of the code for Part 1 and Part 2.
The main assessment criteria for the report are:
• Correctness: that is, do you apply techniques correctly; do you make correct assumptions; do you interpret the results in an appropriate manner; etc.?
• Completeness: that is, do you apply a technique only to small subsets of the data; do you apply only one technique, when there are multiple alternatives; do you consider all options; etc.?
• Originality: that is, do you combine techniques in new and interesting ways; do you make any new and/or interesting findings with the data?
• Argumentation: that is, do you explain and justify all of your choices? The main assessment criteria for the code are:
• Correctness: is the code working as it is supposed to? does it solve the questions in the coursework? do you use the correct functions?
• Completeness: is your code doing everything it is supposed to? are you applying it to the correct datasets?
• Organisation: is your code well organised? is it easy to follow? is it consistent (i.e. consistent names for variables, functions, etc.)?
Style: what is the quality of your code? are you using informative names for variables and functions? are you taking advange of R’s functionality (i.g. using apply() or aggregate() instead of nested loops, etc.).
As you should know, plagiarism and collusion are completely unacceptable and will be dealt with according to the University’s standard policies. Having said this, we do encourage students to have general discussions regarding the coursework with each other in order to promote the generation of new ideas and to enhance the learning experience. Please be very careful not to cross the boundary into plagiarism. The important part is that when you sit down to actually do the data analysis/mining and write about it, you do it individually. If you do this, and you truly understand what you have written, you will not be guilty of plagiarism. Do NOT, under any circumstances, share code or share figures, graphs or charts, etc. As examples, saying to someone,
“I used a Pivot Table in Excel to do the cross tabulations” is completely fine; whereas Copying & Pasting the actual Pivot Table itself would be plagiarism.
Explore the data [5]
i. Provide a table for all the attributes of the dataset including measures of centrality, dispersion, and how many missing values each attribute has.
ii. Produce histograms for each attribute and characterise all the distributions. Provide details on how you created the histograms and comment on the distribution of data. You may also use descriptive statistics to help you characterise the shape of the distribution.
2. Explore the relationships between the attributes, and between the class and the attributes [6]
i. Calculate the correlations and produce scatterplots for the variables: orientation 4 and orientation
7. What does this correlation tell you about the relationships of these variables?
ii. Produce scatterplots between the class variable and orientation 4, orientation 6 and area variables.
What do these tell you about the relationships between these three variables and the class?
iii. Produce boxplots for all of the appropriate attributes in the dataset. Group each variable according to the class attribute.
General Conclusions
Take into considerations all the descriptive statistics, the visualisations, the correlations you produced together with the missing values and comment on the importance of the attributes. Which of the attributes seem to hold significant information and which you can regard as insignificant? Provide an explanation for your choice.
4. Dealing with missing values in R [5]
i. Replace missing values in the dataset using three strategies: replacement with 0, mean and median.
ii. Define, compare and contrast these approaches and its effects on the data.
5. Attribute transformation [6]
Using the three datasets generated in 1.4, explore the use of three transformation techniques (mean centering, normalisation and standardisation) to scale the attributes. Define, compare and contrast these approaches and its effects on the data.
6. Attribute / instance selection [8]
i. Starting again from the raw data, consider attribute and instance deletion strategies to deal with missing values. Choose a number of missing values per instance or per attribute and delete instances or attributes accordingly. Explain your choices and its effects on the dataset.
ii. Start from the raw data, use correlations between attributes to reduce the number of attributes. Try to reduce the dataset to contain only uncorrelated attributes and no missing values. Explain your choices and its effects on the dataset.
iii. Starting from an appropriate version of the dataset, use Principal Component Analysis to create a data set with eight attributes. Explain the process and the result obtained.
This part of the coursework has 20 marks. You must use Weka to perform the classification, but you may use R to present results. Using Weka classification techniques to create models that predict the given class from the input attributes.
1. Choose an appropriate dataset to obtain predictions using the following classifiers: ZeroR, OneR, NaïveBayes, IBk (k-NN) and J48 (C4.5). Which evaluation protocol did you use? Which dataset have you used? Which algorithm produces the best results? Use a combination of metrics to justify your reasoning [10]
2. Choose one classification algorithm of the above and 5-fold cross-validation. Optimise the classifier of your choose with at least two parameters. Describe each parameter and show the results of your experimentation [5]
3. Use J48 and the datasets below. Provide explanations on the performance of the datasets using a combination of metrics.
i. A reduced data set using 10 Principal Components.
ii. The dataset after deletion of instances and attributes.
iii. The three datasets after you replaced missing values with the three techniques.
iv. Which of the datasets had a good impact on the predictive ability of the algorithm? Provide
explanations using the results for each clustering of the alternative data set.