CMI3507 Data Mining
Requesting a Late Submission You are reminded to ‘back-up’ your work as late submission requests will not be given for lost work, which includes work lost due to hardware and software failure/s.
Late submission requests will only be approved if you can demonstrate genuine, unexpected circumstances along with independent supporting evidence (e.g. medical certificate) that may prevent you submitting an assessment on time.
Submit your request for Late Submission via University website within 2 working days of the due date.
Late submission requests, up to a maximum of 10 working days, but typically 1- 5 working days, will be considered provided that there is appropriate evidence which clearly indicates reasons for the request.
You will have 5 working days after submitting a request to provide the evidence. Failure to submit evidence will result in the request being rejected and your work being marked as a late submission (see below).
If you are unable to submit work within the maximum late submission period of 10 days, contact the School’s Guidance Team. as you may need to submit a claim for Extenuating Circumstances (ECs).
Extenuating Circumstances (ECs) An EC claim is appropriate in exceptional circumstances, when an extension is not sufficient due to the nature of the request, or it concerns an examination or In-Class Test (ICT).
You can access on the Registry website; where you can also find out more about the process.
You will need to submit independent, verifiable evidence for your claim to be considered.
Once your EC claim has been reviewed you will get an EC outcome email from Registry. If you are unsure what it means or what you need to do next, please speak to the
An approved EC will extend the submission date to the next assessment period (e.g July resit period).
Late Submission (No ECs approved) Late submission, up to 5 working days, of the assessment submission deadline, will result in your grade being capped to a maximum of a pass mark.
Data mining is a collection of tools, methods and statistical techniques for exploring and extracting meaningful information from large data sets. It is a rapidly growing field due to the increasing quantity of data gathered by organisations. This module looks at different data mining techniques and gives students the chance to use appropriate data-mining tools in order to evaluate the quality of the discovered knowledge.
This assessment consists of a contribution to an evaluative report (worth of 90% of your total marks).
Learners will be able to justify and critically discuss the key concepts of data mining (including legal implications such as GDPR) and the breadth of areas of application.
The learner will be able to make appropriate modifications to large datasets to prepare the data for analysis and exploration; select appropriate data mining techniques in order also to enable exploration of large data sets; interpret and evaluate the results of the analysis to draw conclusions and make informed decisions.
Learning outcomes covered in this coursework are as follows:
1.Knowledge of the underlying principles and general data analytic modelling and study of relationships in data, and visualization, and potential knowledge of other but quite possibly related domains, like statistics, and machine learning.
2.Ability to select and apply appropriate data analytics techniques for problem solving.
3.Ability to explain and describe clearly all aspects of the reasoning.
Individual piece of work. Demonstrate comprehensive knowledge and critical understanding of the use of data analysis to create a solution to a given problem, and that implies data mining for interpretation and for what can be following from data mining.
Evaluative report. You must choose from one of the following tasks:
1)Students performance dataset (available on Brightspace and UCI repository).
Q1. Study the dataset: find its size, number and describe the type of variables. Check if there’s any data missing (if yes, apply an appropriate cleaning technique). Perform a descriptive statistical analysis of the dataset: choose a range of the variables of your interest, find their frequencies and dependencies through bar plots, grouped bar plots, pie-charts, etc.. Draw conclusions.
Advanced: Perform a factor analysis. Comment on your findings.
Q2. Split the dataset on training and testing parts. Build a Random Forest Regression model (using randomForest R library) to predict a final year grade (G3). Evaluate your model using a test dataset.
Plot an importance graph. Estimate accuracy. Comment on your results.
Advanced: Divide the students into 3 categories: poor achieving students, average achieving, well achieving (based on the final grade). Build a classification Random
Forest model. Evaluate your model using test dataset. Print confusion matrix. Build conclusions.
Recommended for reading: Breiman, L., (2001). Random Forests. Machine Learning. 45(1), 5–32. Available from: doi: 10.1023/A:1010933404324.
2)Heart failure clinical records Data Set (available on Brightspace and UCI repository) Q1. Study the dataset: find its size, number and describe the type of variables. Check if there’s any data missing (if yes, apply an appropriate cleaning technique). Perform a descriptive statistical analysis of the dataset: choose a range of the variables of your interest, find their frequencies and dependencies through bar plots, grouped bar plots, pie-charts, etc.. Draw conclusions.
Advanced: Perform a factor analysis. Comment on your findings.
Q2. Split the dataset on training and testing parts. Build a Neural Network (using neuralnet R library. Start with two hidden layers size of 5 and 3) to predict if a risk of death (died/alive binary outcome). Evaluate the model using test dataset. Print confusion matrix. Draw conclusions.
Advanced: Experiment with parameters of the neuralnet function. For example, use a different number/different size of hidden layers or different activation functions. Compare your results with the original model. Draw conclusions.
Recommended for reading: Riedmiller, M. and Braun, H. (1993) A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. Proceedings of the IEEE International Conference on Neural Networks, San Francisco, 28 March-1 April 1993, 586-591. Available from: doi: 10.1109/ICNN.1993.298623
3)On-line retail dataset (available on Brightspace and UCI repository)
Q1. Study the dataset: find its size, number and describe the type of variables. Check if there’s any data missing (if yes, apply an appropriate cleaning technique). Perform a simple statistical analysis, for example: what countries do customers come from? What is the range of recorded times? What is the average spending? Is there any difference in spending for customers from different countries? Are there any preferences in meals? Draw conclusions.
Advanced: perform a pattern mining analysis: either frequent patterns or association rules. Comment on the patterns.
Q2. Perform clustering on customers and customer baskets. Notice, that you need to reorganise the dataset, so that each row of your data frame would contain products purchased by a single customer. You can include or exclude the information about countries. Comment on the choice of distance and the results.
Advanced: Experiment with the number of clusters. Study the indexes which valuate the clustering (such as silhouette, elbow method or Dunn). You may wish to look at the libraries NbClust or clValid.
Structure of the evaluation report:
1.Title with student’s name, name of the chosen dataset and the
corresponding Data Mining method.
2.Introduction which contains a short description of the chosen method.
3.Answers on the stated questions and conclusions.
4.A literature review which should include the reference to the original method, its extensions and improvements (if applicable) and a few recent applications of the method. You must use APA 6th style for referencing.
5.Appendix which must include R commands you have used in your analysis All plots, figures and graphs must be enumerated and have clear labels.
Portfolio. This is a written report on the requested analysis carried out on a provided data set (see a set of problems from section 2). In this part learning outcomes 3 and 4 are to be assessed, which include:
1)Presentational skills (Portfolio is well structured, text and diagrams are neat, legible and free from errors)
2)Knowledge of subject (understanding of data mining methodologies; application of data mining methodologies to real-life datasets; use of appropriate statistical software; interpretation of outputs; literature review).
3)Programming skills (R code is neat and free from error).