1. Overview
The aim of this coursework is to develop a simple, data-intensive application in Python 3. This is an pair project, and you will have to submit your own, original solution for this coursework specification, consisting of a report, the source code and an executable. The learning objective of this coursework is for students to develop proficiency in advanced programming concepts, stemming from both object-oriented and functional programming paradigms, and to apply these programming skills to a concrete application of moderate size. Design choices regarding languages, tools, and libraries chosen for the implementation need to be justified in the accompanying report.
Â
This coursework will develop personal abilities in using modern scripting languages as a âgluewareâ to build, configure and maintain a moderately complex application and deepen the understanding of integrating components on a Linux system. In a dedicated section, the report needs to critically reflect on the software used for implementing this application, and discuss advantages and disadvantages of this choice. The report should also contain a discussion, contrasting software development on Windows and Linux systems and comparing software development in scripting vs. systems languages (based on the experience from the two pieces of coursework).
Â
2 Lab Environment Software environment: You should use Python 3 as installed on the Linux lab machines (EM 2.50) or on the Linux MACS VM for the implementation. This installation also provides the pandas, tkinter, and matplot libraries. These Linux lab machines are available remotely using the x2go client for remote desktops, running on jove (and from there use ssh to log into the lab machines). For technical HOWTOs about accessing software of relevance for this course, see this technical HOWTOs page for the course or the resources section on the Vision page. If you want to develop the software on your own laptop you need to install the above software. Both Python and the libraries . For each of the chosen technologies, the report should discuss why it is the most appropriate choice for this application, and possible alternatives should be mentioned.
Â
3 Data Analysis of a Document Tracker In this assignment, you are required to develop a simple Python-based application, that analyses and displays document tracking data from a major web site. for publishing documents. It is widely used by many on-line publishers and currently hosts about 15 million documents. The web site tracks usage of the site and makes the resulting, anonymised data available to a wider audience. For example, it records who views a certain document, the browser used for viewing it, the way how the user arrived at this page etc. In this exercise, we use one of these data sets to perform data processing and analysis in Python. The data format uses JSON and is described on this local page, describing the data spec. Note that the data files below contain a sequence of entries in JSON format, rather than one huge.
Â
JSON construct, in order to aide scalability. The application must provide the following functionality:
1. Python: The core logic of the application should be implemented in Python 3.
Â
2. Views by country/continent: We want to analyse, for a given document, from which countries and continents the document has been viewed. The data should be displayed as a histogram of countries, i.e. counting the number of occurrences for each country in the input file.
(a) The application should take a string as input, which uniquely specifies a document (a document UUID), and return a histogram of countries of the viewers. The histogram can be displayed using matplotlib.
(b) Use the data you have collected in the previous task, group the countries by continent, and generate a histogram of the continents of the viewers. The histogram can be displayed using matplotlib.
Â
3. Views by browser: In this task we want to identify the most popular browser. To this end, the application has to examine the visitor useragent field and count the number of occurrences for each value in the input file.
(a) The application should return and display a histogram of all browser identifiers of the viewers.
(b) In the previous task, you will see that the browser strings are very verbose, distinguishing browser by e.g. version and OS used. Process the input of the above task, so that only the main browser name is used to distinguish them (e.g. Mozilla), and again display the result as a histogram.
Â
4. Reader profiles: In order to develop a readership profile for the site, we want to identify the most avid readers. We want to determine, for each user, the total time spent reading documents. The top 10 readers, based on this analysis, should be printed.
Â
5. âAlso likesâ functionality: Popular document-hosting web sites, such as Amazon, provide information about related documents based on document tracking information. One such feature is the âalso likesâ functionality: for a given document, identify, which other documents have been read by this documentâs readers. The idea is that, without examining the detail of either document, the information that both documents have been read by the same reader relates two documents with each other.
Â
Figure 1 gives an example of this functionality. In this task, you should write a function that generates such an âother readers of this document also likeâ list, which is parametrised over the function to determine the order in the list of documents. Display the top 10 documents, which are âlikedâ by other readers.
(a) Implement a function that takes a document UUID and returns all visitor UUIDs of readers of that document.
(b) Implement a function that takes a visitor UUID and returns all document UUIDs that have been read by this visitor.
(c) Using the two functions above, implement a function to implement the âalso likeâ functionality,which takes as parameters the above document UUID and (optionally) visitor UUID, and additionally a sorting function on documents. The function should return a list of âlikedâ documents, sorted by the sorting function parameter. Note: the implementation of this function must not fix the way how documents are sorted, and use the sorting function parameter instead.
(d) Use this function to produce an âalso likeâ list of documents, using a sorting function, based on the number of readers of the same document. Provide a document UUID and visitor UUID as input and produce a list of top 10 document UUIDs as a result.
Â
4 Submission
You must submit the complete project files, containing the source code, a stand-alone executable, and the report (in .pdf format) as one .zip file no later Additionally, a screencast or video of running the application, with an explanation as voice-over audio needs to be submitted to Canvas. This is mandatory, and without the screencast or video the submission is incomplete and may be marked as 0 points. The standard penalty of -30% of the maximum available mark applies to late submissions. No submissions will be accepted after 5 working days beyond the submission deadline. The main function driving the application should be called cw2, as discussed in âCommand-line Usageâ above. Submission must be through Canvas, submitting all of the above files in one .zip file. This coursework is worth 50% of the moduleâs mark. You are marked for your application, the structure, code and comments used, your testing, your report and the screencast/video of demonstrating the running of your application. The marking scheme for this project is attached.