You are on page 1of 8

6.0001/6.

00 Spring 2019 

Problem Set 3: Document Distance  


 
Handed out: ​Thursday, February 21st, 2019 
Due: Friday, March 1, 2019 at 4:59PM 
Checkoff due: ​March 13, 2019 at 9PM 
 
This problem set will introduce you to the topic of dictionaries. You can read more about 
dictionaries in the textbook and in Lecture 6 on MITx. You will implement a program to 
detect similarity between two text documents. Don’t be intimidated by this problem - 
we’ll give you all the information you need to complete this p-set! We will guide you 
through the creation of helper functions before you implement the complete program. 
Although this handout is long, most of the information is there to provide you with 
context, useful examples, and hints, so be sure to read c​ arefully​. 
 
File Set-Up 
● Download the file 1_ps3.zip and ​extract all files to the same directory​.  
● The files included are: ​document_distance.py​,​ ​test_ps3_student.py​, t​ est1a.txt, 
test1b.txt​, ​test2a.txt, test2b.txt, test3a.txt, test3b.txt, hello_world.txt, 
hello_friends.txt​.  
● You will edit ONLY ​document_distance.py​. 
 
Collaboration 
● Students may work together, but each student should write up and hand in their 
assignment separately. Students may not submit the exact same code. 
● Students are not permitted to look at or copy each other’s code or code structure. 
● Include the names of your collaborators in a comment at the start of each file. 
● Please refer to the collaboration policy in the ​Course Information​ for more 
details. 
 
Document Distance Overview 
Given two documents, you will calculate a score between 0 and 1 that will tell you how 
similar they are. If the documents are the same, they will get a score of 1. If the 
documents are completely different, they will get a score of 0. You will calculate the 
score in two different ways, and observe whether one works better than the other. The 
first way will use single word frequencies in the two texts. The second way will use 
bigram (sequence of two adjacent words) frequencies in the two texts. Finally, you will 
write a function that returns the text that most closely matches a query.   
 
Note that you do ​NOT​ need to worry about case sensitivity throughout this pset. All inputs will 
be lower case. 
   

1 of 8
Problem 0: Prep Data 
 
The first step in any data analysis problem is prepping your data.  
We have provided a function called ​load_text​ to read a text file and output all the text in 
the file into a string. This function takes in a variable called filename, which is a string of 
the filename you want to load, including the extension. It removes all punctuation, and 
saves the text as a string. ​Do not modify this function​.  
 
Here’s an example usage: 
 
>> text = load_text("hello_world.txt")
>> text
'hello world hello'
 
You will further prepare the text by taking the string and transforming it into a list 
representation of the text. Given the example from above, here is what we expect:  
 
>> prep_data('hello world hello')
[‘hello’, ‘world’, ‘hello’]
 
Implement ​prep_data​ in ​document_distance.py​ as per the given instruction and 
docstring.  
 
Note: You can assume that the text documents we provide will not have extra 
whitespaces (i.e. there are no tabs or newlines or extra spaces between words). 
 
 
 
Problem 1: Find Bigrams 
 
Now, instead of looking at single words we will look at making a list of bigrams from the input 
text. A bigram is a sequence of two adjacent words in a text. For example, if the text is 
"problem set number three", the bigrams would be "problem set", "set number", and "number 
three". You will be implementing a series of functions that allow you to scan through a text. 
You'll be using a two-word window that moves from the beginning to the end of the text.  
 
You will extract bigrams from the input text list. Implement f​ ind_bigrams​ ​in 
document_distance.py as per the docstring. 
 
 
   

2 of 8
Problem 2: Word Frequency 
 
Now let’s start calculating the word frequency of each unique word (or bigram) in the 
input. The goal is to return a dictionary with a distinct word in the text as the key, and 
how many times the word occurs in the text as the value.  
 
Consider the following examples: 
Example 1: 
>> get_frequencies(['hello', 'world', 'hello'])
{'hello': 2, 'world': 1}
 
Example 2: 
>> get_frequencies(['hello world', 'world hello'])
{'hello world': 1, 'world hello': 1}

Implement ​get_frequencies​ in ​document_distance.py​ using the above instructions 


and the docstring provided.  
 
 
   

3 of 8
Problem 3: Similarity 
 
Now it’s time to calculate similarity! Complete the function ​calculate_similarity​ with the 
following conditions. This function can be used with the word frequency dictionary or the 
bigram frequency dictionary created by the function g ​ et_frequencies​. 
 
The similarity will be a calculated by taking the d ​ ifference in text frequencies​ divided by the 
total frequencies​. The procedure below is for the word frequency dictionary. The same 
procedure is applied for the bigram frequency dictionary. Assume you have two frequency 
dictionaries, dict1 and dict2, one for each of the texts. 
 
1. The ​difference in text frequencies = D ​ IFF​ is the sum of the values from each of the 
following three scenarios:  
a. If a word occurs in dict1 and dict2 then get the absolute value of the difference 
in frequencies 
b. If a word occurs only in dict1 then take the frequency from dict1 
c. If a word occurs only in dict2 then take the frequency from dict2 
2. The ​total frequencies = ​ALL​ is calculated by summing all frequencies in both dict1 and 
dict2.  
3. Return ​1- DIFF/ALL​ rounded to ​2 decimal places​.  
 
Here is an example demonstrating what we expect for the similarity using the ​word ​frequency 
dictionary and b ​ igram f​ requency dictionary​ ​for ​hello_world.txt​ and ​hello_friends.txt.  
 
>> calculate_similarity(world_word_freq, friends_word_freq)
0.33
>> calculate_similarity(world_bigram_freq, friends_bigram_freq)
0.0
 
Implement the function c​ alculate_similarity​ in ​document_distance.py​ with the given 
instruction and docstrings. 
 
   

4 of 8
Problem 4: Most Frequent Word(s) 
 
Next, you will find out which word(s) occurs the most frequently among two dictionaries. You'll 
count how many times every word occurs, combined across both texts and returns a list of the 
most frequent word(s). ​The most frequent word does not need to appear in both 
dictionaries.​ If multiple words are tied (i.e. have the same highest frequency), return an 
alphabetically ordered list of all these words.  
 
Implement the function ​get_most_frequent_words​ in ​document_distance.py ​as per the 
given instructions and doc string.  
 
For example, consider the following usage:  
 
>> freq1 = {"hello":5, "world":1}
>> freq2 = {"hello":1, "world":5}
>> get_most_frequent_words(freq1, freq2)
["hello", "world"]
 
   

5 of 8
Problem 5: Finding closest matching document 
 
In this part, you will find out which document most closely matches a query string. You’ll be 
given a list of filenames, a string query, and an optional boolean parameter ​bigrams. ​The 
bigrams​ parameter will determine if words or bigrams should be used for the analysis. You 
will need to load each of the files, prep the data, generate the frequency dictionaries for the 
data in the files and the string query, and determine the closest document match using the 
similarity function you defined in Problem 3.  
 
Here are some general notes about this problem:  
● If more than one document is tied as the closest match, add all matching documents to 
the list, s​ orting by alphabetical order.  
● If the similarity score for all files is 0, return an empty list.  
● If the parameter bigrams is True, you must use bigrams in your computations of 
similarity.  
  
Implement the function ​find_closest_match​ in ​document_distance.py​ based on the 
instruction and docstring.  
 
For example, here are example usages: 
 
Example 1:  
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello”
>> bigrams = False
>> find_closest_match(filenames, query, bigrams=False)
[“hello_friends.txt”, “hello_world.txt”]

Example 2: 
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello apples”
>> bigrams = True
>> find_closest_match(filenames, query, bigrams=True)
[]
 
When you are done, make sure you run the tester file ​test_ps3_student.py​ to check your code 
against our test cases. 
 
   

6 of 8
Hand-in Procedure  
 
1. Naming Files 
Save your solutions with the original file name: document_distance.py. ​Do not ignore 
this step or save your files with a different name! 
 
2. Time and Collaboration Info 
At the start of your file, in a comment, write down the number of hours (roughly) you spent on 
the problems in that part, and the names of the people with whom you collaborated. 
 
For example: 
# Problem Set 3
# Name: Jane Lee
# Collaborators: John Doe
# Time Spent: 3:30
# Late Days Used: 1 (only if you are using any)
# … your code goes here …
 
3. Submit 
To submit ​document_distance.py​, upload it to the problem set website linked from 
Stellar. You may upload new versions of each file until the 4:59 PM deadline, but 
anything uploaded after that time will be counted towards your late days, if you have 
any remaining. If you have no remaining late days, you will receive no credit for a late 
submission. 
 
After you submit, please be sure to view your submitted file and double-check you 
submitted the right thing. 
 
   

7 of 8
Supplemental Reading about Document Similarity 
 
This pset is a greatly simplified version of a very pertinent problem in Information 
Retrieval. Applications of document similarity range from retrieving search engine 
results to comparing genes and proteins to improving machine translation.  
 
You may have noticed that Problem 5 did not always return intuitive results when trying 
to find the most closely matching document to a query. This is because the similarity 
metric we used to determine closeness was very primitive - it did not take into account 
how words are related to each other nor did it remember word order (beyond the 
immediately preceding one when using bigrams).  
 
More advanced techniques to calculating document distance include transforming the 
text into a vector space and computing the cosine similarity, Jaccard Index, or some 
other metric of the vectors.  
 

8 of 8

You might also like