Option 1: Explore contemporary trending topics.
if you decide to do option 2, skip this option
1) In option 1, you will work on twitter data about âGameStopâ
An existing dataset âreddit.csvâ is provided. These contain the data gathered from reddit.com under r/wallstreetbets, where the GameStop controversy originate. You will gather your own set of data from twitter about GameStop and compare your data with the provided data.Â
2) Gather your own data
Perform a bit of research on your own. Find out the controversy around Gamestop stock. From those research you are to come up with three keywords that will be used in your tweet collection.Â
A tutorial file âTutorial on obtaining tweets.docxâ is provided that explains step-by-step how to use the twitter data gathering tool. The tool uses keyword to filter real time tweets and download them into a data file. Once the data is downloaded, you can use the âtwitterparser.pyâ to parse your twitter data into a more structured dataset.
In your report, please explain in detail how your data is gathered, with discussions on:
1) Brief summarize the result of your research and the keywords you chose to collect the data
1) The dates and times of when you collected your data
3) The size of your data.
3) Explore Provided Data and Your Gathered Data
Another dataset previously gathered by us is provided datasets are under âAssignments->Porject 1â on blackboard. Now that you have the twitter dataset, you need to look at what the dataset structure is. Try to understand what each pre-defined feature represents and see if anything requires preprocessing. Also look into unstructured part of the data (e.g. the actual tweets) and think about any feature you need to extract. Discussion the following aspects of the data in your project report:
1. What feature has missing data? How much data is missing? How do you deal with missing data?
2. What feature is potentially noisy? What level of ambiguity is the data? How do you deal with such noise?
3. What feature might need normalization? What feature might need discretization?Â
4. Are there any feature that needs to be created from unstructured data?
5. Are there any other preprocessing steps needed for the data?
4) Hypothesis of What is different between twitter and reddit
Once you are familiar with the structure and contend of the data, you need to state in your project three aspects of the data that you think might be different or the same between reddit data and your data. For example, you can hypothesize that reddit data is more favorable to GameStop stock than the twitter data. Think about at least three such hypothesis.Â
Write one paragraph for each hypothesis of change you decided to explore. State what your hypothesis is and explain why you think this hypothesis is important in each paragraph.
5) Preprocessing the Data
This is the most important part of the project. You will perform at least 5 preprocessing techniques on both your dataset and the historical dataset. Do not randomly choose any techniques to perform. Keep your end goal of visualizing your hypothesis in mind. Your preprocessing should help you to visualize and discuss your hypothesis later.
For each preprocessing technique, discuss the following in your project, each point should have at least one paragraph of discussion:
1. Explain what technique is used and how the technique works.
2. Detailed explanation on how the technique is applied. If the technique requires parameters, explain what the parameters are and how they are chosen.
3. Show and discuss the result of the technique. Do not copy and paste raw results or screen capture. Summarize your results into figures and explain what the figure say about your preprocessing results.Â
4. Explain how this technique helps you discuss your hypothesis later
6) Data Visualization
After the data is preprocessed, you should try to have three different types of visualization about your three hypothesis. You are looking for patterns that confirm, or disprove your hypothesis. Patterns should be spatial, temporal, demographic aspect of the data. Visualization using tweet text summary is also OK, but advanced text mining is not required. For example, for the afore mentioned hypothesis, you can simply count favorable words such as âstrong, good, up, buyâ and see if there is difference in these words used between the reddit and twitter data.
Any visualization tool is OK, such as Excel or Tableau Public . The project is not graded on tools used but on the discussion. Make sure you explain what each visualization shows in detail. Discuss the following aspects for each visualization:
1. Why is this visualization created?Â
2. How is it created?
3. What does each element of the visualization mean?
4. What pattern does this visualization show?
5. What does this pattern say about your hypothesis?
6. Is your hypothesis correct or not based on this visualization?
Option 2: Explore change in topics through time
1) Your topic for this option is âCoronavirusâ. You will explore the change in tweets between this year and last year.Â
2) Gather your own data besides the provided corona.csv
corona.csv: A historical tweet data is collected using keyword âcoronavirusâ and âcovid-19â
A tutorial file âTutorial on obtaining tweets.docxâ is provided that explains step-by-step how to use the twitter data gathering tool. The tool uses keyword to filter real time tweets and download them into a data file. Once the data is downloaded, you can use the âtwitterparser.pyâ to parse your twitter data into a more structured dataset. Since we are comparing current data with historical data, it might be easier for your analysis if you use the same keywords as the historical data. However, if you feel the keywords provided is not adequate, you can collect data based on your own keyword. However, if you do this, please explain what keywords you have used and why you have selected them. Also, later in your discussion you need to always remember that your dataset is collected differently than provided dataset. This might impact some of your analysis and discussion. Â
In your report, please explain in detail how your data is gathered, with discussions on:
1) The dates and times of when you collected your data
2) The keywords you used for data collection
3) The size of your data.
3) Explore Provided Data and Your Gathered Data
Another dataset previously gathered by us is provided datasets are under âAssignments->Porject 1â on blackboard. Now that you have the twitter dataset, you need to look at what the dataset structure is. Try to understand what each pre-defined feature represents and see if anything requires preprocessing. Also look into unstructured part of the data (e.g. the actual tweets) and think about any feature you need to extract. Discussion the following aspects of the data in your project report:
1. What feature has missing data? How much data is missing? How do you deal with missing data?
2. What feature is potentially noisy? What level of ambiguity is the data? How do you deal with such noise?
3. What feature might need normalization? What feature might need discretization?Â
4. Are there any feature that needs to be created from unstructured data?
5. Are there any other preprocessing steps needed for the data?
4) Hypothesis of What Has Changed Through Time
Once you are familiar with the structure and contend of the data, you need to state in your project three aspects of the data that you think might have changed between last year and now. For example, for coronavirus you can hypothesize that the fight against covid is focused on vaccine push versus last yearâs focus lock down enforcement. The hypothesis doesnât have to be correct, since you will confirm it, or disprove it, with your own data gathering later in the project.
Write one paragraph for each hypothesis of change you decided to explore. State what your hypothesis is and explain why you think this hypothesis is important in each paragraph.
5) Preprocessing the Data
This is the most important part of the project. You will perform at least 5 preprocessing techniques on both your dataset and the historical dataset. Do not randomly choose any techniques to perform. Keep your end goal of visualizing your hypothesis in mind. Your preprocessing should help you to visualize and discuss your hypothesis later.
For each preprocessing technique, discuss the following in your project, each point should have at least one paragraph of discussion:
1. Explain what technique is used and how the technique works.
2. Detailed explanation on how the technique is applied. If the technique requires parameters, explain what the parameters are and how they are chosen.
3. Show and discuss the result of the technique. Do not copy and paste raw results or screen capture. Summarize your results into figures and explain what the figure say about your preprocessing results.Â
4. Explain how this technique helps you discuss your hypothesis later
6) Data Visualization
After the data is preprocessed, you should try to have three different types of visualization about some patterns discovered about your topics. Patterns should be spatial, temporal, demographic aspect of the data. Visualization using tweet text summary is also OK, but advanced text mining is not required. Any visualization tool is OK, such as Excel or Tableau Public. The project is not graded on tools used but on the discussion. Make sure you explain what each visualization shows in detail. Discuss the following aspects for each visualization:
1. Why is this visualization created?Â
2. How is it created?
3. What does each element of the visualization mean?
4. What pattern does this visualization show?
5. What does this pattern say about the changes between current data and historical data?
6. Is your hypothesis correct or not based on this visualization.