Eval ML models on news headlines dataset.

Implementing and Evaluating Multiple Machine Learning Models on News Headlines Dataset

Assessment Description:

In this assessment element, you should implement and evaluate multiple machine learning models and vectorization techniques (Implementation) on a given dataset, provide a reflective written report (Reflective Report) based on your experiments and critical analysis, and present your work orally (presentation) in a short presentation session.

This assessment aims to evaluate your practical skills in implementing an end-toend machine learning pipeline based on the provided dataset of News Headlines (Headlines.csv). This dataset consists of more than 200,000 news headlines that categorized based on their topics into 41 different categories as below:

'CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS', 'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES', 'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION', 'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS', 'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING', 'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS', 'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY', 'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT', 'CULTURE & ARTS'

Besides headlines and categories, this dataset also includes other attributes including author, Link, short description and date which can be utilized in your proposed solution.

Import and warehouse the given dataset into your workspace using efficient and suitable data structures.
Conduct comprehensive pre-processing process that may involve operations such as data cleaning, reshaping, resizing, dealing with missing values and other commonly practised industry-standard data wrangling operations.
Besides general-purpose data wrangling operations (above), you must conduct comprehensive NLP specific data preparation operations such as cleaning, case normalisation, stop-word removal, lemmatising and stemming, etc.
Prepare your training, validation and testing sets and their corresponding ground truth labels.
Perform comprehensive statistical analysis of the given dataset. This may include simple statistical analysis, histogram analysis, outlier analysis, correlation analysis, etc.
Develop at least 3 different natural language vectorization (feature extraction) techniques.
Develop a minimum of 3 different supervised machine learning classification models and train, validate, and test these models using feature vectors produced as a result of vectorization operations.
Express your results in both quantitative and qualitative manners. Use commonly practised industry-standard evaluation metrics including,
accuracy, F-1 score, Precision, Recall, and confusion matrix to express your results.