ML with Scikit-Learn: Wine Recognition.

Classical Machine Learning solutions using Scikit-Learn - Wine Recognition Dataset

Deliverables and Submission

Overview

This assignment counts for 30% of the overall mark for this module. Its subject is to implement Classical Machine Learning solutions in Python using the Scikit-Learn library and other libraries introduced in the class. Specifically, both clustering and classification methods should be applied to the Wine Recognition dataset:

You can form groups of up to four students (after a discussion and agreement with the tutor).

Deliverables and Submission

The coursework must be submitted by 23:59, on Friday 29^th January 2021. Follow the submission guidelines in Canvas. For each submission ensure that you include

· A zip file containing all runnable programs for the first three parts (see below in Project parts), with code written in Python.

o For each part, a single python script (.py) should be used to execute all relevant code. All three executables should be placed in a single folder. Any results presented should be directly reproducible from the code without any modification.

o A short README text file should be included in the folder to explain

§ The students’ names and k-numbers of the group

§ the contents of the folder

§ If any library additional to the ones used in the class is required, provide explicit guidance on how to install.

§ Any special instructions on how to run the code.

· A report (3000-5000 words, excluding references and appendices), in word or pdf format with the students’ names and k-numbers of the group on the front page.

Rules

· You are encouraged to look in the literature and identify methods that have already been applied to the particular problem. In this case, you must CLEARLY reference the relevant sources (e.g. scientific article, book, webpage)

· Any third-party source code must be CLEARLY highlighted and referenced by appropriate annotation in the report and/or adding comments in the code.

· Usage of any third-party libraries that have not used in the class must be approved by the Lecturer beforehand.

· Copies of the code in the Appendix must be in text format, not screenshots

· In case that the above rules are not obeyed, the submission may be considered for plagiarism and penalised according to the University regulations.

Project Parts

PART I – Application: Load and overview data related to your theme

The application should be able to load the data and identify its key aspects (number of dimensions/features, number and names of classes, number of samples per class, etc.).

PART II – Application: Clustering

a) You should use at least two clustering methods to partition the dataset.

Rules

b) Evaluate the clustering methods using appropriate metrics such as the Adjusted Rand index, Homogeneity, Completeness and V-Measure, using the ground truth.

c) Consider and implement any configuration of the parameters of your clustering methods that could further improve the results.

PART III – Application: Classification: Training and Testing

a) You should use at least two classification methods to distinguish between the classes. Both the following training/testing protocols should be used:

· Split the data into training (70%) and testing (30%).

· K-fold cross-validation for K=10.

b) For both protocols, evaluate the classification approaches using appropriate metrics such as the Balanced Accuracy, F1-Score, ROC AUC, and drawing ROC curves and appropriate confusion matrices. Ideally all ROC curves should be drawn into a single graph to allow for easy comparison between methods.

c) Consider and implement any configuration of the parameters of your classification methods that could further improve the results.

PART IV - Report:

The Project Report should be structured as follows:

· Data: Description of the data, including the information derived in Part I, as produced by your code.. There is no need to describe the general problem. All information and figures should be derived by the code, not from other sources.

· Clustering:

o Outline of the clustering methods used in Part II. There is no need to describe the theory behind the methods, only to explain any different configurations you may have used.

o Comparative analysis of all clustering methods used, including any improvements attempted. Ensure that any results/figures reported should be produced by your code.

· Classification:

o Outline of the classification methods used in Part III. There is no need to describe the theory behind the methods, only to explain any different configurations you may have used.

o Comparative analysis of all classification methods used, considering both training protocols, including any improvements attempted. Ensure that any results/figures reported should be produced by your code.

· Discussion and Conclusion:

o Critical Discussion of any challenges imposed by the specific dataset and the pipelines for clustering and classification.

o Critical Discussion of clustering results.

o Critical Discussion of classification results

o Conclusion

· References

· Appendix: Include copies of all the code produced. Copies of the code must be in text format, not screenshots. Ensure that there is sufficient annotation/comments to indicate where your code has been taken/adapted from.

Learning outcomes being assessed

· Select and specify suitable methods and algorithms relevant for a particular data analysis process;

· Build machine learning and artificial intelligence systems using software packages and/or specialised libraries;

· Articulate and demonstrate the specific problems associated with different phases or tasks of a machine learning or artificial intelligence pipeline;

Assess and evaluate machine learning methods using datasets and appropriate criteria;

Get instant help from 5000+ experts for