Get Instant Help From 5000+ Experts For
question

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost

Editing:Proofread your work by experts and improve grade at Lowest cost

And Improve Your Grades
myassignmenthelp.com
loader
Phone no. Missing!

Enter phone no. to receive critical updates and urgent messages !

Attach file

Error goes here

Files Missing!

Please upload all relevant files for quick & complete assistance.

Guaranteed Higher Grade!
Free Quote
wave
IR Phase 1 Project Requirements

Task 1: Create a corpus of 10 neutral queries

IR Phase 1 Project Requirements

The overall project for this semester is to simulate a search engine over a collection (“corpus”) of documents. This project will be divided into three phases. The requirements described here are for Phase 1.

Phase 1 is broken down into various tasks which will be used in subsequent phases as well. Each Task has to be uploaded to Blackboard system by syllabus deadline in the corresponding assignment content folder for Phase1:

IR.P1.Task#1) You will need to maintain your own corpus of documents for the semester. To do so, come up with 10 neutral (ie no controversy) queries (for example: Who was the 16th U.S. President?) that you will submit to a search engine of your choice. Upload these ten queries to Blackboard system for IR.P1.Task#1.

IR.P1.Task#2) You are to then download the first 20 (non-controversial) webpage (html) responses that the search engine returns with, for each of the 10 queries (this is manually done; you have to download it one by one; fortunately, you only have to do this once).  There will be a total of 200 html files. (We will be discussing shortly in class how to process these using the Java Regex package. You may NOT use 3rd party code. You MUST write your own. You do not need regex necessarily but it does for provide much more concise code.) Place all 200 html files in a directory named Corpus and compress/zip the entire directory. Upload this to Blackboard system for IR.P1.Task#2. YOU SHOULD ASSUME (IN GENERAL) THAT NO CREDIT WILL BE GIVEN FOR SHARED FILES OR LINKS TO FILES. HOWEVER, FOR THIS TASK, if the Blackboard system limits do not allow you to upload this compressed file, then you can store it on a cloud and upload to Blackboard system a secure link to that file.

IR.P1.Task#3) Identify a Stoplist (either download or compute in a separate code on your own) and store it in a hash structure. Program code is needed for the storage of the stopword list into a hash structure and the ability to output your hash structure to an output text file. Upload the .java files necessary to accomplish this to the Blackboard system for IR.P1.Task#3.

IR.P1.Task#4) In Java code, compute an Inverted Index collectively storing info for files that are part of the corpus. See the following links for an explanation of what an Inverted Index is (and what is not, such as a forward index):

Task 2: Download 200 html files and store in Corpus directory

You are to use either Java hashmaps or hashtables for storing the inverted index of your corpus. (Separate email will provide tutorial links for hashtables.) What information should you store in the inverted index for each significant (ie non-stopword) found in one of your documents? a) the word; b) the name of document found in; c) a vector specifying for each occurrence of the word in a document, how many words from beginning of document was it found (for this count include even the stopwords). You need to do this for every word in every document that is not a stopword. Upload this Java Code to the Blackboard system for IR.P1.Task#4.

IR.P1.Task#5) The code for each phase has to be compiled using javac (jdk compiler) and executed using the java command (jdk runtime environment). Important names of files etc. will be provided on the command line of the java command using “flags.” Details about the usage of flags for this phase will be emailed to you separately and discussed below. Further Code issues will be explained in a separate email on Command Line Parsing. Please note that ALL phases of this project will be run from the command line only. Upload the Java Code that processes the project command line and its flags to the Blackboard system for IR.P1.Task#5.

IR.P1.Task#6) You will need to demonstrate the ability to “query” your inverted index for such information as a) does a specific word appear in any document? b) how many documents (and which) does a given word appear in; c) how many times (frequency) does a word appear in a given document. The project will need to create and utilize a -SEARCH flag in conjunction with an output flag indicating which file the output should go to: -output=OutputFileName

-SEARCH=WORD -- would search the Inverted Index for the given WORD and return with which documents does the word appear in and specifically how many times appears in that document.

-SEARCH=DOC   -- would  search Inverted Index for the given Document and return all words found in that DOC with specifically how many times appears in that document.

NOTE: For these commands, you may also need to pass other parameters via the command line using appropriately named flags.

Upload the Java code files that implement these functions to Blackboard system for IR.P1.Task#6.

IR.P1.Task#7) This task is predicated on Task#6 being completed. Demonstrate one example of a word search and one example of a doc search. You will upload three files to the Blackboard system (either individually or in one compressed/zipped file, BUT ALL) for IR.P1.Task#7). The first file is Searches.txt describing these searches and the actual commands the user needs on command line to run your project and achieve these searches. In addition, upload the two output files corresponding to the two searches. (Use different names for each output file.)

Task 3: Store a Stoplist in a hash structure

IR.P1.Task#8) The system should be able to printout the inverted index or other relevant information pertaining to a given document. The project will need to create and utilize a -PRINT flag in conjunction with an output flag indicating which file the output should go to: -output = OutputFileName

-PRINT_INDEX=WORD -- would print all the information contained in the Inverted Index for the given WORD into the output file. The exact format is left up to you, but it must contain all of the information.

- PRINT_INDEX=DOC   -- would print all the information contained in the Inverted Index for the given DOC into the output file. The exact format is left up to you, but it must contain all of the information.

You will need to upload the .java files that implement this feature with the demonstrated output files to IR.P1.Task#8 on the Blackboard system. Also, include a Print.txt file that describes which WORD and which DOC you chose to demonstrate this feature and what the actual commands the user needs on command line to run your project and achieve these printed outputs.

Place all of these files into a temporary directory on your computer system and Compress/zip the directory all as one file and upload it to the Blackboard system for Task#8. YOU SHOULD ASSUME (IN GENERAL) THAT NO CREDIT WILL BE GIVEN FOR SHARED FILES OR LINKS TO FILES. HOWEVER, FOR THIS TASK, if the Blackboard system limits do not allow you to upload this compressed file, then you can store it on a cloud and upload a secure link to that file.

NOTE: For these commands, you may also need to pass other parameters via the command line using appropriately named flags.

IR.P1.Task#9) Collect all the .java files necessary for your system into a single temporary directory on your computer system. Compress/zip the directory all as one file and upload it to the Blackboard system for IR.P1.Task#9. If you did this correctly and did not add any extraneous project or data files and just the .java files, then the compressed file to be uploaded should not be particularly large and you will be able to upload as is (and NOT as a shared or cloud file or via file link.)

IR.P1.Task#10) Comment your code appropriately and use meaningful names for classes, methods, variables and constants. You are expected to report on the number of lines of code using cloc found. This command must be run on the command line (cmd.exe). When you collect all .java files into the same temporary directory (see IR.P1.Task#9 above), then the following command will report on the contents of each .java file. The report generated by cloc program will report on the number of actual java code lines, blank lines, comment lines for each of the .java files that is part of your project. The command to obtain the report data is as follows (assuming you are using the above version):

cloc-1.92.exe --by-file *.java

This report should be copied into a text (.txt) file which you will upload to Blackboard system for IR.P1.Task#10.

support
close