Introduce the principles and uses of Big Data, both from a technical and business perspective.
Equip with a range of techniques for collecting, structuring, processing and analysing big data.
Visualise data when dealing with the volume and variety of Big Data and maintaining the velocity of access.
Expose the legal and ethical requirements of data security.
For Big Data assessment, you need to select ONE dataset from the ones listed below, and perform the required task, following a Big Data architecture framework Big Data project development process and its stages:
Descriptive analysis: You are expected to fully describe your data as we did in the tutorials. Remember that you are answering the what happened question. Please provide appropriate visualisations according to the data you are describing. You must include, as a minimum, a histogram and a scatter or density plot.
Diagnostic analysis: Here you should try to answer the why question. For this point, a regression analysis (based on a scatter plot for example) would be sufficient. As before, you must support your analysis with the adequate plots.
Action: You need to provide a sensible recommendation based only on the assigned task and your analysis.
Please be aware that each plot should be fully described in your assessment. But, the plot itself, must be clear enough to tell a history by itself.
For each dataset you must include a proper citation. This is usually provided at the source website.
Remember that when dealing with text data, you must trim leading and trailing blanks before conducting any analysis.
In the following list you will find four points for each dataset: Name of file, source of file, the required task, and the suggested action
Task: Find where in London hosts have the most positive and negative reviews. For this you need to do a basic sentiment analysis, based, for example, on the files positive.txt and negative.txt. You can take as starting point Tutorial 09 and the ranking approach.
What would you recommend to reduce the number of negative comments,improve the description of the property or change the location of the property. For example, if properties with high number of negative comments are in a particular area, you can recommend rent somewhere else.
Task: Create a new dataset in csv format with only records containing comments and without the word COVID in the title. This new dataset will have only 5 columns: id, title, comments, journal-ref and categories. Using this new dataset, report if there is a relation between journal-ref and categories; construct a ranking for all the words in the titles.
Based on the ranking of words, pick the top ten and recommend a journal for each word. For example, if the top word is “human”, you should rank journals in terms of articles with “human” in the title and then pick the one at the top
Find which video categories attract more negative comments and report the differences between GB and US
Action: For each category, recommend where NOT to publish a video, GB or US. This is based on the number of positive comments.
Task: Explain if there is a relation among: type of product, price and quantity; Explain the differences between years in terms sales, not only in total quantity, but also about different products, quantities and dates.
What would you recommend for improving earnings with only the existing products: increase the number of sales by product or increase the number quantity sold by product. You must mention which products must the retailer target.