Topic clustering with TFIDF and NLP.

COMP 814 Text mining

Answered

Task Requirements In order to achieve the objectives of the project, you will firstly need to read in the data, extract the meta data and segment it into the required demographics.You will then need to design strategies to extract and cluster topics. To be consistent within the class, let us use the same definition for a topic. Let us define a topic to be the mention of an OBJECT or a THING. So, for instance you could simply take the THING that is mentioned as the highest number of times as the popular t opic and correspondingly the second most popular topic. Once you get the two most dominant “things” mentioned, expand the topic to be 2 verb/noun before and 2 verb/noun after the topic. Output them as “what has been said about the dominant “thing” in terms of the 4 surrounding nouns/verbs. Repeat what you done using frequency above, but this time use TFIDF. For this consider all the blogs from one person as a document.Compare the results from the two modes of counting and comment on which one is more accurate in your opinion with justifications.Note that you will need to use various techniques such as stemming, lemmatization, PCA, stop word removal, inter alia, in order to get as accurate results as possible. The results will need to be evaluated manually and the strategy for evaluation should be described in your writeup. Write up 1.You need to document the research project as a scientific paper using latex double column IEEE conference format. The latex template can be downloaded from Blackboard.

Get instant help from 5000+ experts for