Purpose   Â
The purpose of this course is to design, develop and implement a big data operational storage and link it with a big data platform using API.
Skills  Â
Technical documentation, research and investigation, experimentations, big data analysis and processing, big data models building and visualization, big data platforms utilization, team work, organisation, future big data technologies, forward looking and innovative thinking, planning, analysis, assessment, integration, abstraction and high-level modelling.
Forming A Team
A team of two to three will work together as data scientists and the team leader as the lead data scientist to prepare a technical report based on real application data by applying big data models and No SQL. The project background and resources are given later in this document. Â The team members must conduct an extensive investigation using the dataset provided, especially pre-processing, descriptive analysis, experimentations using a data analytics platform, results generations and analysis, results visualization (graphs & tables) and recommendations. Â
Marking:
See project grade rubric document on canvasÂ
Â
Feedback:Â
Feedback will be provided on the progress document during class time. In addition, a written feedback will be provided if the team submitted a draft to the lecturer before the deadline. This written feedback issues must be addressed in the final submission. Â
Â
Project Background
This project involves processing and manipulating unstructured dataset that contains stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and start-up incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".
Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received. You can use the BigQuery Python client library to query tables in this dataset in Kernels or use CQL in Cassandra or other open NoSQL platforms. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.hacker_news.[TABLENAME]. Fork this kernel to get started.
Â
Project Aim
The aim of this project is to utilize NoSQL technology or Python BigQuery with distributed big data models to handle the storage, cleansing, processing and retrieval of unstructured datasets. The students are expected to understand the given dataset variables, develop a data model based on Cassandra NoSQL or any other distributed big data models, implement the database model using Data Definition Language (DDL) and then perform data loading/batch processing and other necessary Data Manipulation Language (DML) including the use of non tabular models (Document/Column/ Map) and collections. Finally, perform read and write operations using CQL or BigQuery via an API.
Â
Project Task
Develop a technical report based on a full investigation of the Hacker News dataset provided on Canvas using big data technology (experimental). In your investigation, you should in depth performÂ
1) Data UnderstandingÂ
2) Data pre-processing if any
3) NoSQL Database modelling
4) Develop the NoSQL database model in Cassandra or other big data distributed technologiesÂ
5) Load instances and perform read and write operations using clustersÂ
6) Design a number of queries to process certain ad hoc queriesÂ
7) Link your database model with an API and execute the queriesÂ
8) Communicate the resultsÂ
9) Use Visualization method to present some of the useful patterns and results discovered and communicate the importance of results visualization.Â
Â
Possible Data Analytics and Big Data ToolsÂ
Cassandra / MongoDB/ Other NoSQL database technologyÂ
R /WEKA/PythonÂ
Hadoop/Spark
Any other open source big data tools
ResourcesÂ
Raw DatasetÂ
Technical Report TemplateÂ
Grade Rubric
Â
PresentationÂ
There will be 10-15 minutes presentation in the final week of the course. The weight of the presentation is 10% of the total assessment weight.Â
Assessment Weight
The project weight is 70% from the total grades of the course. The project is divided into two deliverablesÂ
Report 45%
Presentation 15%