"Data Mining Course: Tasks - Outcomes

Data Mining Coursework: Tasks, Deliverables, and Learning Outcomes

Answered

Learning Outcomes

This coursework assesses the following learning outcomes:

1. Discuss, compare and contrast the advantages and disadvantages of applying a specific data mining technique to a given learning task.

2. Use a toolkit to develop a data mining application tailored to a given learning task and evaluate the results obtained.

3. Effectively interpret the results of learning through an understanding of the strengths and limitations of data mining technology and the selection of an appropriate evaluation technique.

4. Demonstrate knowledge of the state-of-the-art in data mining and an awareness of current areas of research.

5. Apply and, where necessary, adapt an appropriate data mining technique to a given problem.

There are 3 deliverables:

1. An Rmd file with the code, suitably labelled and commented.

2. An mp4 file containing the presentation for task. Time limit: 4 minutes.
3. A Word or pdf file containing
• Code.

• Results and plots.
• Descriptions/justification of choices and discussions.
• Critical discussion of results
• Explanation of any data preparation task achieved by means other than R.
Your answers to all the tasks (excluding the presentation task) should be included in this file.

PAGE LIMIT: 30 pages.

WORD LIMIT: individual word limits are included in the description of each task. Word limits exclude code and results.

You are required to conduct the following tasks using R where coding is required. If, for a specific task, you are unable to undertake data preparation using R but you are able prepare your data by other means, explain how you have achieved this and complete the rest of the task in R.

Note that data preparation undertaken without using R will not attract credit, but credit can be gained for completing the rest of the task in R.

The word limits apply to the comments/discuss sections only. They do not include code, results or text in plots.

The time limit applies to the presentation task only.

Where the seed needs setting, set it to 123.

1. Use univariate statistics to explore the ride4U data. Discuss the results obtained, highlighting any result which you consider particularly useful. Use at most 3 visualisations. Also use bi-variate statistics for the purpose of determining what attributes may impact on the level of complaints. [Word limit for discussion/comments: 200].

2. In later tasks you will undertake classification and clustering tasks. For the classification tasks, the class is complaints. Prepare the ride4U dataset for classification. Note: you may have to further pre-process the data later on, depending on the task and/or algorithm. You may also need to pre process the ride4UT dataset later on.

3. Use ride4U to obtain three further datasets as follows:

a. ride4U40: contains 40% of the data in in ride4U.

b. ride4U20: contains 50% of the data in ride4U40. (i.e. 20% of the data in ride4U)

c. dodgyRide4U: the data in ride4U but with noise introduction in 15% of the instances, in attributes outlook and temperature.
Note that instances affected by noise contain incorrect values for both outlook and temperature.

Discuss the advantages/disadvantages of using the reduce dataset (with 40% of the data) to obtain an even smaller one (with 20% of the data)
over using the main dataset and any criteria used to obtain the datasets. [Word limit: 100 words].

Discuss what insight can be obtained from using the dataset with noise. [Word limit 100 words].

Deliverables and Submission Method

4. Design and run an experiment which tests whether a reduction in dataset size leads to a reduction in performance for a tree classifier using datasets ride4U, ride4U40 and ride4U20. The model-training control must include cross validation. The tree classifier you choose must have been covered in the lab sessions for CMM510 and must not be an ensemble classifier. Evaluate the results, ensuring that you comment on the measure(s) which you are using to assess the classifier’s performance and the quality of the experiment. If there is any difference in results, comment on whether the difference is statistically significant. State which model is best, justifying your choice. [Word limit: 200].

5. Design and run an experiment which tests whether the introduction of noise in the dataset leads to a reduction in performance for the tree
classifier you chose in task 4 and an instance-based classifier using datasets ride4U, and dodgyRide4U. Use an instance-based classifier which
has been covered in the lab sessions for CMM510, using at most 13 neighbours. Evaluate the results, ensuring that you comment on the
quality of the evaluation and the measure(s) which you are using to assess the classifiers’ performance. If there is any difference in results,
comment on whether the difference is statistically significant. State which model is best, justifying your choice. [Word limit: 200].

6. Validate the models obtained in task 5 by testing them with the ride4UTdataset after pre-processing this dataset. Discuss the results obtained, ensuring you comment on measures used and state whether any performance difference is statistically significant. [Word limit: 150].

7. Cluster the ride4U dataset, undertaking any pre-processing required using ONE clustering algorithm covered in the CMM510 lab sessions. Test for the optimal number of clusters between 2 and 12. Discuss the ideal number of clusters. Justify your choice of clustering algorithm and discuss whether the resulting clusters correlate with any attribute. [Word limit: 150].

8. This task requires no code. Produce a 4 minute presentation in which you propose at least one new data mining task to be applied to any of the datasets that you have been given or have produced in the above tasks. The task(s) should involve only techniques which have been covered in the CMM510 labs. Ensure you justify your choice of task(s) and, for each task, include sufficient detail about its aim, how the task can be conducted and how the outcome can be evaluated. The task(s) must have been covered in the CMM510 labs.

Get instant help from 5000+ experts for