Vertical search & doc clustering system.

Building a Vertical Search Engine and Document Clustering System

Answered

Task 1: Vertical Search Engine

Module Learning Outcomes1-5:

1. Demonstrate a sound knowledge of information retrieval principles

2. Apply main data structures used in index construction in Python or a similar high- level language

3. Implement a typical web crawler and query processor, in Python or a similar high-level language

4. Acquire knowledge and skills to apply common machine learning methods for text classification and document clustering

Build the outline of a minimum viable vertical search engine for text retrieval

There are two tasks in this coursework. You can use any general purpose programming language of your choice to perform these tasks. However, Python is recommended. The tasks are specified next.

1. Develop a vertical search engine similar to Google Scholar, but specialised to retrieve only papers/books published by a member of the School of Life Sciences(SLS) at Coventry University:

https://pureportal.coventry.ac.uk/en/organisations/school-of-life-sciences

That is, at least one of the co-authors is a member of SLS.

2. Your system crawls the relevant web pages and retrieves information about all available publications in a Breadth-First Search (BFS) manner. For each publication, it extracts available data (such as authors, publication year, and title) and the links to both the publication page and the author’s profile (also called “pureportal” profile) page.

3. Make sure you that your crawler is polite, i.e. it preserves the robots.txt rules and does not hit the servers unnecessarily or too fast.

4. Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. Every time it runs, it should update the index with the new data.

5. Make sure you apply the required pre-processing tasks to both the crawled data and the users’ queries.

6. From the user’s point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the resources they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, the search results are restricted to the publications by SLCmembers only.

7. NOTE: You must show in your report and viva that your system is accurate by trying various queries. For example, you must use both short and long queries, both with and without stop words, queries with various keywords and more challenging queries to prove the robustness of your system.

Task 2: Document Clustering

Develop a document clustering system.

First, collect a number of documents that belong to different categories, namely sport, business and science. Each document should be at least one sentence (the longer is usually the better). The total number of documents is up to you but should be at least 100 (the more is usually the better). You may collect these document from publicly available web sites such as BBC news websites, but make sure you preserve their copyrights and terms of use and clearly cite them in your work. You may simply copy-paste such texts manually, and writing an RSS feed reader/crawler to do it automatically is NOT mandatory.

Once you have collected sufficient documents, cluster them using a standard clustering method (e.g. K-means).

Finally, use the created model to assign a new document to one of the existing clusters. That is, the user enters a document (e.g. a sentence) and your system outputs the right cluster.

NOTE: You must show in your report and viva that your system suggests the right cluster for variety of inputs, e.g. short and long inputs, those with and without stop worlds, inputs of different topics, as well as more challenging inputs to show the system is robust enough.

Requirements and Markings

Construction of Inverted Index (Task 1) Construction of the index based on appropriate data structures studied in the module as opposed to naive database tables. The index should be updated incrementally once new data are received from the crawler component. Obviously, no mark is considered if Elastic Search is used.	15
Fully working crawler component (Task 1) Must completely crawl the required web pages in BFS manner and find all the publications by all the SLS members. It should be scheduled to re-crawl to extract new data automatically. Must be polite by preserving robots.txt rules and not hitting the servers too fast. It must extract enough data about each publication (at least author(s), title, publication year and the links to the publication and the SLS author pages). Should apply appropriate pre-processing tasks before passing the data to the indexer.	25
Fully Working query processor component (Task 1) Displaying results relevant to given queries. The accuracy and robustness of the system must be proved in the report (via screenshots) and the viva by trying the system on numerous and various input queries. For example, appropriate inputs should be used to prove the system properly performs the pre-processing tasks (such as stop-world removal and stemming) and ranked retrieval.	25
Fully working document clusteringcomponent (Task 2) A standard clustering algorithm such as K-means must be used. In addition, the learned model must be used to identify the right cluster (e.g. sport, business or science) for a given input. Various inputs must be used both in the report (via screenshots) and the viva to show that the system is accurate and robust.	25
Overall usability (Tasks 1-2) Acceptable response time, nice interface, and anything else that might affect the usability of the systems.	10

1. Based on the above marking scheme, a typical candidate for a mark of 40 is a working search enginewhich accepts users’ queries/keywords and displays some partially correct results, without a proper index and with no document clustering (Task 2) component. Alternatively, Task 2 may be properly accomplished but the search engine might not be fully working because of inappropriate query processor and indexer but with a reasonable working crawler.

2. The expected candidate for 70 or more is a fully-working search engine with reasonable accuracy and speed. This ensures that the system contains fully working crawler and query processor components. In addition, it must have at least one, and preferably both, of the other two components, i.e. the inverted index (without using Elastic Search) and the document clusteringcomponent, in fully working status. If Task 2 is missing, then the rest must be perfect. For example, the inverted index must be fully implemented (without using ES) and be updated incrementally. Alternatively, if Task 2 is perfect, then the inverted index may be implemented using ES, in which case the output of ES must be reformatted for more readable results for the user.

Module Learning Outcomes

Other marks are possible based on the above marking scheme table.

To show that your system meets each of the above-mentioned requirements, your report must provide sufficient evidence including clear description, complete source code, and complete screenshots where applicable.You viva must also demonstrate the fully working systems by trying numerous and various inputs. See Appendix 1 for items to cover in your report and viva.

Notes:

1. You are expected to use the University APAstyle for referencing. For support and advice on this students can contact Centre for Academic Writing (CAW).

2. Please notify your registry course support team and module leader for disability support.

3. Any student requiring an extension or deferral should follow the university process as outlined here .

4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops or personal computer. Students should therefore regularly back-up any work and are advised to save it on the University system.

5. If there are technical or performance issues that prevent students submitting coursework through the online coursework submission system on the day of a coursework deadline, an appropriate extension to the coursework submission deadline will be agreed. This extension will normally be 24 hours or the next working day if the deadline falls on a Friday or over the weekend period. This will be communicated via your Module Leader.

6. You are encouraged to check the originality of your work by using the draft Turnitin links on Aula.

7. Collusion between students (where sections of your work are similar to the work submitted by other students in this or previous module cohorts) is taken extremely seriously and will be reported to the academic conduct panel. This applies to both courseworks and exam answers.

8. A marked difference between your writing style, knowledge and skill level demonstrated in class discussion, any test conditions and that demonstrated in a coursework assignment may result in you having to undertake a Viva Voce in order to prove the coursework assignment is entirely your own work.

9. If you make use of the services of a proof reader in your work you must keep your original version and make it available as a demonstration of your written efforts.

10. You must not submit work for assessment that you have already submitted (partially or in full), either for your current course or for another qualification of this university, with the exception of resits, where for the coursework, you maybe asked to rework and improve a previous attempt. This requirement will be specifically detailed in your assignment brief or specific course or module information. Where earlier work by you is citable, i.e. it has already been published/submitted, you must reference it clearly. Identical pieces of work submitted concurrently may also be considered to be self-plagiarism.

Tasks

General Mark allocation guidelines to students

Appendix 1. Items to cover in your report and Video

Part 1 – Search engine Crawler:

1.1 Number of staff whose publications are crawled(approximately) and the maximum number of publications per staff

1.2. Information collected about each publication (e.g. links, title, year, author or any additional part)

1.3. Which pre-processing tasks are performed before passing data to Indexer/Elastic Search

1.4. When the crawler operates, e.g. scheduled or run manually

1.5. Whether BFS strategy was use. If yes, provide evidence.

1.6. Brief explanation of how it works

Indexer

2.1. Whether you implemented the index or used Elastic Search (note that if Elastic Search is used you will lose the 15 marks for index construction, but the project becomes easier).

2.2. If you implemented it, which data structure is used (for example, incidence matrix or inverted index)

2.3. If you implemented it, whether it is incremental, i.e. it grows and gets updated over the time, or it is constructed from scratch every time your crawler is run

2.4. If you implemented it, show some part of its content (e.g. the constructed dictionary).

2.5. Brief explanation of how it works

Query processor

3.1. Which pre-processing tasks are applied to a given query

3.2. Do you only support Boolean queries (using AND, OR, NOT, etc.) or accept keywords like Google does (without any need for AND, OR, NOT etc.)

3.3. If Elastic Search is used, how you convert a user query to an appropriate query for Elastic Search

3.4. If Elastic Search is NOT used, whether or not you perform ranked retrieval; if yes, specify whether or not you used vector space and the method used to calculate the ranks

3.5. Demonstration of the running system (use screenshots in you report and run your software in your viva). You must run your system on numerous and various input queries to prove the accuracy and robustness of your system. For example, you must use appropriate queries to prove your system performs stop-word removal and stemming and ranked retrieval.

3.6. Brief explanation of how it works

Any other important point you may want to mention, including any restriction, extras, issues

How and how many input documents are collected

Which document clustering method (e.g. K-means with appropriate K value) has been used and how its performance is measured

Which type of clustering is used (hierarchical/flat and hard/soft)

Screenshot and demonstration of its accuracy and robustness for numerous and various inputs

(Optional) any other important point you may want to mention

Present the following as appendix in you report and show them in your video

Complete source code

The documents used in Part 2

Note: Make sure you provide enough evidence (screenshots in your report and demonstration in your viva) for your fully-working systems using various non-trivial inputs.

Get instant help from 5000+ experts for