Problem Set 5 Instructions

6.0001/6.
00 Spring 2019
Problem Set 3: Document Distance

Handed out: Thursday, February 21st, 2019
Due: Friday, March 1, 2019 at 4:59PM
Checkoff due: March 13, 2019 at 9PM

This problem set will introduce you to the topic of dictionaries. You can read more about
dictionaries in the textbook and in Lecture 6 on MITx. You will implement a program to
detect similarity between two text documents. Don’t be intimidated by this problem -
we’ll give you all the information you need to complete this p-set! We will guide you
through the creation of helper functions before you implement the complete program.
Although this handout is long, most of the information is there to provide you with
context, useful examples, and hints, so be sure to read c arefully.

File Set-Up
● Download the file 1_ps3.zip and extract all files to the same directory.
● The files included are: document_distance.py, test_ps3_student.py, t est1a.txt,
test1b.txt, test2a.txt, test2b.txt, test3a.txt, test3b.txt, hello_world.txt,
hello_friends.txt.
● You will edit ONLY document_distance.py.

Collaboration
● Students may work together, but each student should write up and hand in their
assignment separately. Students may not submit the exact same code.
● Students are not permitted to look at or copy each other’s code or code structure.
● Include the names of your collaborators in a comment at the start of each file.
● Please refer to the collaboration policy in the Course Information for more
details.

Document Distance Overview
Given two documents, you will calculate a score between 0 and 1 that will tell you how
similar they are. If the documents are the same, they will get a score of 1. If the
documents are completely different, they will get a score of 0. You will calculate the
score in two different ways, and observe whether one works better than the other. The
first way will use single word frequencies in the two texts. The second way will use
bigram (sequence of two adjacent words) frequencies in the two texts. Finally, you will
write a function that returns the text that most closely matches a query.

Note that you do NOT need to worry about case sensitivity throughout this pset. All inputs will
be lower case.

1 of 8
Problem 0: Prep Data

The first step in any data analysis problem is prepping your data.
We have provided a function called load_text to read a text file and output all the text in
the file into a string. This function takes in a variable called filename, which is a string of
the filename you want to load, including the extension. It removes all punctuation, and
saves the text as a string. Do not modify this function.

Here’s an example usage:

>> text = load_text("hello_world.txt")
>> text
'hello world hello'

You will further prepare the text by taking the string and transforming it into a list
representation of the text. Given the example from above, here is what we expect:

>> prep_data('hello world hello')
[‘hello’, ‘world’, ‘hello’]

Implement prep_data in document_distance.py as per the given instruction and
docstring.

Note: You can assume that the text documents we provide will not have extra
whitespaces (i.e. there are no tabs or newlines or extra spaces between words).

Problem 1: Find Bigrams

Now, instead of looking at single words we will look at making a list of bigrams from the input
text. A bigram is a sequence of two adjacent words in a text. For example, if the text is
"problem set number three", the bigrams would be "problem set", "set number", and "number
three". You will be implementing a series of functions that allow you to scan through a text.
You'll be using a two-word window that moves from the beginning to the end of the text.

You will extract bigrams from the input text list. Implement f ind_bigrams in
document_distance.py as per the docstring.

2 of 8
Problem 2: Word Frequency

Now let’s start calculating the word frequency of each unique word (or bigram) in the
input. The goal is to return a dictionary with a distinct word in the text as the key, and
how many times the word occurs in the text as the value.

Consider the following examples:
Example 1:
>> get_frequencies(['hello', 'world', 'hello'])
{'hello': 2, 'world': 1}

Example 2:
>> get_frequencies(['hello world', 'world hello'])
{'hello world': 1, 'world hello': 1}
Implement get_frequencies in document_distance.py using the above instructions

and the docstring provided.

3 of 8
Problem 3: Similarity

Now it’s time to calculate similarity! Complete the function calculate_similarity with the
following conditions. This function can be used with the word frequency dictionary or the
bigram frequency dictionary created by the function g et_frequencies.

The similarity will be a calculated by taking the d ifference in text frequencies divided by the
total frequencies. The procedure below is for the word frequency dictionary. The same
procedure is applied for the bigram frequency dictionary. Assume you have two frequency
dictionaries, dict1 and dict2, one for each of the texts.

1. The difference in text frequencies = D IFF is the sum of the values from each of the
following three scenarios:
a. If a word occurs in dict1 and dict2 then get the absolute value of the difference
in frequencies
b. If a word occurs only in dict1 then take the frequency from dict1
c. If a word occurs only in dict2 then take the frequency from dict2
2. The total frequencies = ALL is calculated by summing all frequencies in both dict1 and
dict2.
3. Return 1- DIFF/ALL rounded to 2 decimal places.

Here is an example demonstrating what we expect for the similarity using the word frequency
dictionary and b igram f requency dictionary for hello_world.txt and hello_friends.txt.

>> calculate_similarity(world_word_freq, friends_word_freq)
0.33
>> calculate_similarity(world_bigram_freq, friends_bigram_freq)
0.0

Implement the function c alculate_similarity in document_distance.py with the given
instruction and docstrings.

4 of 8
Problem 4: Most Frequent Word(s)

Next, you will find out which word(s) occurs the most frequently among two dictionaries. You'll
count how many times every word occurs, combined across both texts and returns a list of the
most frequent word(s). The most frequent word does not need to appear in both
dictionaries. If multiple words are tied (i.e. have the same highest frequency), return an
alphabetically ordered list of all these words.

Implement the function get_most_frequent_words in document_distance.py as per the
given instructions and doc string.

For example, consider the following usage:

>> freq1 = {"hello":5, "world":1}
>> freq2 = {"hello":1, "world":5}
>> get_most_frequent_words(freq1, freq2)
["hello", "world"]

5 of 8
Problem 5: Finding closest matching document

In this part, you will find out which document most closely matches a query string. You’ll be
given a list of filenames, a string query, and an optional boolean parameter bigrams. The
bigrams parameter will determine if words or bigrams should be used for the analysis. You
will need to load each of the files, prep the data, generate the frequency dictionaries for the
data in the files and the string query, and determine the closest document match using the
similarity function you defined in Problem 3.

Here are some general notes about this problem:
● If more than one document is tied as the closest match, add all matching documents to
the list, s orting by alphabetical order.
● If the similarity score for all files is 0, return an empty list.
● If the parameter bigrams is True, you must use bigrams in your computations of
similarity.

Implement the function find_closest_match in document_distance.py based on the
instruction and docstring.

For example, here are example usages:

Example 1:
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello”
>> bigrams = False
>> find_closest_match(filenames, query, bigrams=False)
[“hello_friends.txt”, “hello_world.txt”]
Example 2:
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello apples”
>> bigrams = True
>> find_closest_match(filenames, query, bigrams=True)
[]

When you are done, make sure you run the tester file test_ps3_student.py to check your code
against our test cases.

6 of 8
Hand-in Procedure

1. Naming Files
Save your solutions with the original file name: document_distance.py. Do not ignore
this step or save your files with a different name!

2. Time and Collaboration Info
At the start of your file, in a comment, write down the number of hours (roughly) you spent on
the problems in that part, and the names of the people with whom you collaborated.

For example:
# Problem Set 3
# Name: Jane Lee
# Collaborators: John Doe
# Time Spent: 3:30
# Late Days Used: 1 (only if you are using any)
# … your code goes here …

3. Submit
To submit document_distance.py, upload it to the problem set website linked from
Stellar. You may upload new versions of each file until the 4:59 PM deadline, but
anything uploaded after that time will be counted towards your late days, if you have
any remaining. If you have no remaining late days, you will receive no credit for a late
submission.

After you submit, please be sure to view your submitted file and double-check you
submitted the right thing.

7 of 8
Supplemental Reading about Document Similarity

This pset is a greatly simplified version of a very pertinent problem in Information
Retrieval. Applications of document similarity range from retrieving search engine
results to comparing genes and proteins to improving machine translation.

You may have noticed that Problem 5 did not always return intuitive results when trying
to find the most closely matching document to a query. This is because the similarity
metric we used to determine closeness was very primitive - it did not take into account
how words are related to each other nor did it remember word order (beyond the
immediately preceding one when using bigrams).

More advanced techniques to calculating document distance include transforming the
text into a vector space and computing the cosine similarity, Jaccard Index, or some
other metric of the vectors.

8 of 8

Problem Set 5 Instructions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Problem Set 5 Instructions

Uploaded by

Copyright:

Available Formats

6.0001/6.

Problem Set 3: Document Distance

Implement get_frequencies in document_distance.py using the above instructions

You might also like

Problem Set 5 Instructions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Problem Set 5 Instructions

Uploaded by

Copyright:

Available Formats

6.0001/6.

Problem Set 3: Document Distance

Implement ​get_frequencies​ in ​document_distance.py​ using the above instructions

You might also like

Implement get_frequencies in document_distance.py using the above instructions