ML algorithms for real-world regression - classification & data modeling.

Machine Learning Algorithms for Solving Real-World Problems in Regression, Classification, and Unstr

Module Learning Outcomes Assessed:

Module Learning Outcomes Assessed:
On completion of this module the student should be able to:
1. Apply supervised and unsupervised learning applications using Gaussian process emulators.
2. Apply Dirichlet processes for unsupervised learning applications
3. Develop the knowledge and skills necessary to design, implement and apply the Graphical models to solve real world applications.
4. Evaluate the applications of fuzzy systems and their usage in hybrid intelligent systems, in combination with evolutionary computing and other machine learning methods.
5. Apply evolutionary computing methods to develop solutions for the real world optimisation problems and appraise their advantages and limitations.

Task and Mark distribution:
This coursework consists of two tasks and you should attempt both and submit one Word or pdf file (or similar) for each task. Each task is worth 50 marks and the marks breakdown for each task is provided with each task. This coursework contributes 100% to your overall module mark.

This document is for Coventry University students for their own use in completing their assessed work for this module and should not be passed to third parties or posted on any website. Any infringements of this rule should be reported.

Task 1: The Machine learning algorithms for solving real-world problems in Regression,

During this module, you learned about different advanced machine learning techniques, associated concepts and applications. We explored the Gaussian process model, which is computationally efficient method for Regression, Classification, optimization, etc. We have also covered the Bayesian networks as promising tools for modelling the data with complex dependency structure. Finally, you have learned how to use Dirichlet Latent processes for unsupervised learning applications, particularly text mining.

In this assignment, you will have to select an application related to a regression, classification, modelling un-structured data, or text mining problem, and explore how best to apply the machine learning algorithms to solve it. The selected application for each of the methods mentioned above should have the following features:

1. Gaussian Process regression and Classification: The application selected for any of these two methods must consist of at least four input variables and a single output variable. You must also implement Gaussian process classification by appropriately define a threshold on the output variable to create a binary or multiple classes first, and then apply the Gaussian process classification on the categorized output.
2. Bayesian network: If you are choosing an application for this method, this application must consist of at least eight random variables. The random variables could be all discrete or continuous or hybrid.
3. There is no restriction on selecting the application to apply the Latent Dirichlet allocation model for topic modelling. There are some potential projects listed below, which could be studied to get some ideas. However, I strongly recommend you to come up with your own idea(s) by reviewing these project and some other relevant and recent articles.

1. This dataset from the UCI repository is quite interesting. The task is to predict the depth in the body (effectively, the depth along the spine) given the properties of a two-dimensional "slice" of the body. The hard part about this problem is that it is actually the output causing the input rather than the other way around. I have not had luck designing a good regression method for this data. Can you do this?
2. Find a Bayesian interpretation of elastic net regularization, and compare this method for regression against "standard" Bayesian regression (with a Gaussian prior) on a dataset of your choosing.
3. Probabilistic PCA using Gaussian Process is a Bayesian interpretation of the classical PCA algorithm for dimensionality reduction. Implement Gaussian Process based PPCA in Python, R or Matlab, and compare its performance with other methods (such as "standard" PCA) on a dataset of your choosing.

Task and Mark distribution:

4. Bayesian optimization is very important issue with a wide range of applications. However, this was not fully studied during lectures, but it can be easily implemented using Gaussian Process. The Python codes and some examples can be found here!
5. The squared exponential covariance is widely used for Gaussian process regression. It is probably used in 90+% of all GP publications. That said, it is widely believed to be "too smooth" for many real-world regression tasks. Compare the squared exponential This document is for Coventry University students for their own use in completing their assessed work for this module and should not be passed to third parties or posted on any website. Covariance versus the Matéern covariance on several datasets via Bayesian model selection. How often is the squared exponential the right choice?

6. Latent Dirichlet allocation (LDA) is a Bayesian method for creating "topic models" of text documents. There are plenty of interesting text datasets available (e.g., DBpedia could be a good resource!). One idea would be to compare the behavior of LDA with other techniques, such as latent semantic analysis. You may be able to get relevant dataset and ideas by visiting the following sites:

This compentition site consists of some relevant data, and the relevant ideas could be developed by analyzing this data. Check also dataset in Kaggle competitions. This website has a fantastic compilation of 100 interesting, relevant datasets from all sorts of application areas.

The creators of libSVM have also compiled a great list of datasets, all in a standardized format. The libSVM codebase also includes libsvmread for reading these in MATLAB. The UCI Machine Learning Repository is a mainstay in machine-learning research. There is a wide range of datasets there from many different application areas and with many different properties (large, small, high-dimensional, low-dimensional, classification, regression, etc.).

Please note, the following guidelines are good practice and should lead to better result, but you
have the freedom to pick whatever is suitable for your style Working in groups of maximum 2 or 3, you have to select a challenging real world problem and one (or more) appropriate data set(s) as suggested above. You could also use the following
links, which have numerous problems and data sets.

This document is for Coventry University students for their own use in completing their assessed work for this module and should not be passed to third parties or posted on any website. Any infringements of this rule should be reported to

Notes:
1. You will not get the full marks of this section if you submit your proposal late.
2. If the final submission of your CW is the different to what you propose in your proposal, you will not get any
marks for parts 2 & 3.
2) Technical quality
1. Rigour and extent of the experiments.
2. Correct application of the selected algorithms and suitability of the methods.
3. Data preparation - technical quality.
4. Extent of evidence of running the experiments provided in appendices.

3) Evaluation
1. Evaluation and discussion of the results. Why the results are important? How would the results be useful to other researchers or practitioners?
2. Is this a “real” problem or a small “toy” problem? How does the paper advance the state of the art?

5) Clarity of the writing:
1. Is there sufficient information for the reader to reproduce the results? Is the language used in the paper good?
2. References and general presentation; Are results clearly presented, with appropriate visualisations?

Get instant help from 5000+ experts for