Plagiarism Detection System using Shingling Method

Purpose

Purpose: To develop a plagiarism detection system. Learning Objectives: To practice the principles of modular design; To solve a problem and design a program using top-down design and functional decomposition; To practice file input and output; To practice using strings and operations on strings; To practice using lists; To gain understanding how computer detect similarity in documents. Problem statement: Write a program that takes two files as input and determines their similarity. Problem details: Computers nowadays perform many tasks that seem to require human intelligence. For example, they compare student essays for plagiarism, detect similar web pages or detect spam. Obviously, the computers still don't have the capabilities to understand the meaning of our documents. So how do they do it? Here we will examine one way of answering the above questions. Although the method we use is somewhat simplified, this assignment will give you the general idea how computers perform the above tasks. The method we use is called shingling. Shingling first transforms the two documents we want to compare into two sets of strings (called shingles) and then computes how similar these sets are. To convert a document into a set of shingles, we can take an integer k and move a window of length k over the document. We record all the strings we see in the window, but only once. The resulting set is called the set of k-shingles. Here is an example when k = 3: For the document that contains the sentence: I_am_I_am_Sam_the_cat_I_am. the 3-shingles would be: ['I_a', '_am', 'am_', 'm_I', '_I_', 'm_S', '_Sa', 'Sam', 'm_t', '_th', 'the', 'he_', 'e_c', '_ca', 'cat', 'at_', 't_I', 'am.'] Notice that you would have only have one copy of the shingle 'I_a' (and only one copy of some other shingles) and that I used _ instead of the space character, to make the content of the shingles more obvious. Real documents would contain space charters, so the first shingle would be "I a". Also, notice that the punctuation characters and spaces are included in the shingles. If you transform your first document into a set of shingles A, and your second document into the set of shingles B in the same way, you can use the Jacquard's similarity to measure how similar the documents are. The Jacquard's similarity of two sets is defined by: where the vertical brackets denote the cardinality (number of elements) of the set. Note that if A and B are identical sets, the Jacquard's similarity will be 1 and if they don't have a single element in comment, the Jacquard similarity will be 0. The more similar documents are, the bigger their Jacquard similarity is. Detailed Requirements: The user will be asked to provide the names (and paths) of two documents and the integer k for the length of the shingles. Your program will compute the Jacquard similarity of the sets of k-shingles for the documents. The sets of shingles should be implemented as lists. Your program should be well designed and flexible.

Get instant help from 5000+ experts for