Main Objectives of the assessment
The purpose of this assessment is to enable you to demonstrate your achievement of learning outcomes (K1, C1, C2, and S1) of the module. Specifically:
1.Select appropriate statistical analysis methods and demonstrate the ability to effectively interpret and clearly communicate the results of such methods.
2.Describe the key concepts associated with a relational database model and be able to demonstrate (using for example SQL) how typical query and maintenance tasks are performed.
During this assessment you will be aiming to:
This is the only opportunity you will have to demonstrate your achievement learning outcomes (K1, C1, C2 and S1) as the examination only assesses LOK2 and LOK3. If you fail this assessment, you will be required to take the re-sit examination next semester but at this time you will be assessed on all LOs. It is therefore critical that you attempt this coursework to the best of your ability and submit it on time.
Description of the Assessment
The assessment consists of the following tasks:
1.Data table preparation: You will download a set of Twitter data (Tweets) from Moodle. These tweets including some topics. Note that you will be assigned to only one topic. All SQL operations should be achieved using the SQLite tool as described and used in the lab sessions.
2.Threshold questions: These mandatory questions assess your achievement of the basic module learning outcomes. There is a threshold section for each of the learning outcomes . You must pass these sections to achieve a basic pass grade (D-). The Data checklist is described in the Submission Form below which specifies the criteria that your data files must meet and also what should be included in your submission.
3.Higher questions: These optional questions are more challenging and allow you to raise your grade above a bare pass (D-A*). These questions require you to go beyond what you have learnt in the lectures and lab tutorials and, in some cases, do some additional research. There is a higher section for each learning outcome.
Marking Criteria
To pass the module at threshold level (D- grade) you must have:
1.Answered all the threshold questions for learning outcome C1/C2/S1 correctly and met all related criteria in the Data Checklist.
2.Answered the threshold questions for learning outcome K1/S1 correctly and met all related criteria in the Data Checklist.
If you have failed any one of these criteria you will be awarded an E grade. If you have failed both criteria, then you will be awarded an F grade.
To be eligible for a higher grade, you must have achieved a threshold pass as described above. Your final grade will then be determined by your performance on the higher questions. For each of these questions, you will be awarded an additional mark for a correct answer. If your answer is wrong, you may be awarded a partial (e.g. 0.5) of a mark where your described method (working) has convinced the marker that you were close to finding the correct answer. It is therefore important that you report how you worked out your answer (e.g. the SPSS syntax or SQL query) in the field provided, even if you are not sure that your answer is correct.
Note that if you have met the criteria for a threshold pass you are assured a minimum grade of D-, even if you score zero marks on the higher questions.
Format of the Assessment
The Submission form is now released (starting on page 4). You should complete the data checklist and enter your answers for the questions in the spaces provided.
Submission Instructions
Coursework must be submitted via Moodle (Upload your c/w onto Moodle – CHECK WITH YOUR TUTOR). The required file format for this report is a Zip Archive. Your student LBIC ID number must be used as the file name
Within this archive you should include all of your files specified in the Data Checklist i.e.:
1.Completed Submission form (when sending your form do not include pages 1-3)
2.Tweets.sav – the SPSS version of your tweets data, including metadata and derived variables.
3.Tweets.sqlite – the dBase version of your data table, as derived (saved) from SPSS.
Avoiding Plagiarism
Please ensure that you understand the meaning of plagiarism and the seriousness of the offence. Information on plagiarism and how to avoid it is provided in the LBIC Student Handbook. For instance, in this assignment, submission of a Tweets file that was collected by another student would be considered an act of plagiarism.
1.List two qualitative variables in your dataset?
2.Suppose you wanted to assign a numerical coding scheme in your SPSS data table to arbitrarily group cases based on their time zone value e.g. London =1, Tokyo = 2, Washington =3, etc. What SPSS function would you use to create this new variable?
3.What would be the level of measurement of the variable you created in Q2?
4.What is the mode of positive sentiment score across all tweets in the tweets dataset?
5.Which “screenname” has got the second least followers?
6.You predict that followers and friends will correlate positively. What would be the best type of graph to visualise the correlation between these two variables?
1.Compute a Pearson correlation test for all numeric variables. Which two variables have got the strongest coefficient? Based on the strongest coefficient, compute the proportion shared variance.
2.You want to be able to identify the distribution of tweets across the hours of the day and the days of the week. From DateTime compute two variables: one called hour (You’ll find the relevant function group, called Time Duration Extraction, within the Compute - Function Group list) and another called wkDay (the name of the function will be similar but resides in a different function group). For wkDay, make them more readable by adding appropriate value labels. Make sure you use the correct/advised coding scheme in each case (e.g. for Day does the series start from Sunday or Monday?).
3.Considering only tweets tagged by the London location, test whether there is a significant correlation between negative sentiment (sentNeg) and time of day (hour).
4.Generate a bar graph that shows the total number of tweets posted on each week day. Which day got the most activity? Paste the command and an image of your final graph below.
5.You want to answer the following research question – Do users who have more than 401 followers tend to have less negative sentiments compared to those with less than or equal to 400 followers (taking p = 0.05 as the acceptance level)? Choose an appropriate test to answer the question and explain what you found in your result.
6.Considering only the day with the least activity (refer to the answer of Q4), determine if there is a significant correlation between ‘Followers’ and ‘Friends’.
Threshold Questions (SQL and Databases)
1.Imagine a database system comprising your Tweets table along with another table called Users. This table comprises attributes that describe the profile of each distinct user that appears within the Tweets table (e.g. TimeZone, description etc.). Which existing attribute within Tweets would best serve as the foreign key?
2.Which attribute forms the primary key in your Tweets dataset? Name an alternate key?
3.Write a query to retrieve all records where the first name is Sean.
4.Write a query that would (only test this on a duplicate table) remove all rows from your Tweets table that were posted by users with fewer than 100 or more than 2000 followers. How many records do you now have in your duplicate table?
Number of records in the duplicate table:
5.Write a query that shows all unique URLs coming from the Instagram domain . The number of rows should equal the number of unique URLs (in other words, no duplicates).
6.Modify your database table so that sentNeg scores run along the same scale as sentPos
Higher Questions
1.Which date during your collection period got the greatest number of matching tweets?
2.Write a query you used to generate the UserStat table as specified in the checklist.
3.Who (Screenname) were the top two contributors of original tweets in your dataset?
4.For those tweets with sentpos of 3, which location has got the most followers?
5.Write a query that shows only Location where average positive sentiment is greater than 3.
6.Create a new table called NotRT (as mentioned in the checklist) that contains only original tweets (no retweets) that had a retweet count of at least five times at the time the tweet was retrieved. This table should contain all columns from the Tweets table. After doing this, check the number of rows in this table.[1.5 Mark]
7.For those tweets with sentpos of 2, which location has got the most followers?