CS2040S Data Structures and Algorithms
Question:
Data Algorithms
Background
In this assignment, your task is to implement a simple search engine using a simplified version of the well-known Weighted PageRank algorithm. You should start by reading the Wikipedia article on PageRank (read up to the section Damping Factor). Later I will release a video lecture introducing this assignment and discuss the required topics. The main focus of this assignment is to build a graph structure from a set of web pages, calculate Weighted PageRanks and rank pages. To make things easier for you, you don't need to spend time crawling, collecting and parsing web pages for this assignment. Instead, you will be provided with a collection of mock web pages in the form of plain text files. Each web page has two sections:
• Section 1 contains URLs representing outgoing links. The URLs are separated by whitespace, and may be spread across multiple lines.
• Section 2 contains the actual content of the web page, and consists of one or more words. Words are separated by whitespace, and may be spread across multiple lines.
Mans has long been the subject of human interest. Eanly telescopic observations revealed color changes on the surface that were attributed to seasonal vegetation and apparent linear features were ascribed to intelligent design.
Each test directory contains the following files:
Co11ect1on . txt a file that contains a list of relevant URLs (for Part 1) pagerankL1st . exp a file that contains the expected result of Part 1 for this collection of URLs 1nvertedlndex. txt a file containing the inverted index for this collection of URLs (for Part 2)
1og. txt a file that contains information that may be useful for debugging Part 1.
*First, compile the original version of the files using the make command. You'll see that it produces three executables: pagerank, searchPagerank and scaledFootrule. It also copies these executables to the ex1, ex2 and ex3 directories, because when you test your program, you will need to change into one of these directories first. Note that in this assignment, you are permitted to create as many supporting files as you like. This allows you to compartmentalise your solution into different files. For example, you could implement a Graph ADT in Graph . c and Graph . h. To ensure that these files are actually included
in the compilation, you will need to edit the Makefile to include these supporting files; the provided Makefile contains instructions on how to do this. To get the name of the file that contains the page associated with some URL, simply append . txt to the URL.
For example, the page associated with the URL url24 is contained in url24. txt. For each URL in the collection file, you need to read its corresponding page and build a graph structure using a graph representation of your choice. Then, you must use the algorithm below to calculate the Weighted PageRank for each page. Your program in pagerank. c must take three command-line arguments: d (damping factor), diffPR (sum of PageRank differences), maxlterations (maximum number of iterations), and using the algorithm
described in this section, calculate the Weighted PageRank for every web page in the collection. Each of the expected output files in the provided files were generated using the above command.
Your program should output a list of URLs, one per line, to a file named pagerankL1st . txt, along with some information about each URL. Each line should begin with a URL, followed by its outdegree and Weighted PageRank (using the format string "%.71-F-"). The values should be comma-separated, with a space after each comma. Lines should be sorted in descending order of Weighted PageRank. URLs with the same Weighted PageRank should be ordered alphabetically (ascending) by the URL. Here is an example pagerankL1st . txt. Note that this was not generated from any of the provided test files, this is just an example to demonstrate the output format.