Professional Documents
Culture Documents
00 Spring 2019
1 of 8
Problem 0: Prep Data
The first step in any data analysis problem is prepping your data.
We have provided a function called load_text to read a text file and output all the text in
the file into a string. This function takes in a variable called filename, which is a string of
the filename you want to load, including the extension. It removes all punctuation, and
saves the text as a string. Do not modify this function.
Here’s an example usage:
>> text = load_text("hello_world.txt")
>> text
'hello world hello'
You will further prepare the text by taking the string and transforming it into a list
representation of the text. Given the example from above, here is what we expect:
>> prep_data('hello world hello')
[‘hello’, ‘world’, ‘hello’]
Implement prep_data in document_distance.py as per the given instruction and
docstring.
Note: You can assume that the text documents we provide will not have extra
whitespaces (i.e. there are no tabs or newlines or extra spaces between words).
Problem 1: Find Bigrams
Now, instead of looking at single words we will look at making a list of bigrams from the input
text. A bigram is a sequence of two adjacent words in a text. For example, if the text is
"problem set number three", the bigrams would be "problem set", "set number", and "number
three". You will be implementing a series of functions that allow you to scan through a text.
You'll be using a two-word window that moves from the beginning to the end of the text.
You will extract bigrams from the input text list. Implement f ind_bigrams in
document_distance.py as per the docstring.
2 of 8
Problem 2: Word Frequency
Now let’s start calculating the word frequency of each unique word (or bigram) in the
input. The goal is to return a dictionary with a distinct word in the text as the key, and
how many times the word occurs in the text as the value.
Consider the following examples:
Example 1:
>> get_frequencies(['hello', 'world', 'hello'])
{'hello': 2, 'world': 1}
Example 2:
>> get_frequencies(['hello world', 'world hello'])
{'hello world': 1, 'world hello': 1}
3 of 8
Problem 3: Similarity
Now it’s time to calculate similarity! Complete the function calculate_similarity with the
following conditions. This function can be used with the word frequency dictionary or the
bigram frequency dictionary created by the function g et_frequencies.
The similarity will be a calculated by taking the d ifference in text frequencies divided by the
total frequencies. The procedure below is for the word frequency dictionary. The same
procedure is applied for the bigram frequency dictionary. Assume you have two frequency
dictionaries, dict1 and dict2, one for each of the texts.
1. The difference in text frequencies = D IFF is the sum of the values from each of the
following three scenarios:
a. If a word occurs in dict1 and dict2 then get the absolute value of the difference
in frequencies
b. If a word occurs only in dict1 then take the frequency from dict1
c. If a word occurs only in dict2 then take the frequency from dict2
2. The total frequencies = ALL is calculated by summing all frequencies in both dict1 and
dict2.
3. Return 1- DIFF/ALL rounded to 2 decimal places.
Here is an example demonstrating what we expect for the similarity using the word frequency
dictionary and b igram f requency dictionary for hello_world.txt and hello_friends.txt.
>> calculate_similarity(world_word_freq, friends_word_freq)
0.33
>> calculate_similarity(world_bigram_freq, friends_bigram_freq)
0.0
Implement the function c alculate_similarity in document_distance.py with the given
instruction and docstrings.
4 of 8
Problem 4: Most Frequent Word(s)
Next, you will find out which word(s) occurs the most frequently among two dictionaries. You'll
count how many times every word occurs, combined across both texts and returns a list of the
most frequent word(s). The most frequent word does not need to appear in both
dictionaries. If multiple words are tied (i.e. have the same highest frequency), return an
alphabetically ordered list of all these words.
Implement the function get_most_frequent_words in document_distance.py as per the
given instructions and doc string.
For example, consider the following usage:
>> freq1 = {"hello":5, "world":1}
>> freq2 = {"hello":1, "world":5}
>> get_most_frequent_words(freq1, freq2)
["hello", "world"]
5 of 8
Problem 5: Finding closest matching document
In this part, you will find out which document most closely matches a query string. You’ll be
given a list of filenames, a string query, and an optional boolean parameter bigrams. The
bigrams parameter will determine if words or bigrams should be used for the analysis. You
will need to load each of the files, prep the data, generate the frequency dictionaries for the
data in the files and the string query, and determine the closest document match using the
similarity function you defined in Problem 3.
Here are some general notes about this problem:
● If more than one document is tied as the closest match, add all matching documents to
the list, s orting by alphabetical order.
● If the similarity score for all files is 0, return an empty list.
● If the parameter bigrams is True, you must use bigrams in your computations of
similarity.
Implement the function find_closest_match in document_distance.py based on the
instruction and docstring.
For example, here are example usages:
Example 1:
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello”
>> bigrams = False
>> find_closest_match(filenames, query, bigrams=False)
[“hello_friends.txt”, “hello_world.txt”]
Example 2:
>> filenames = [hello_world.txt, hello_friends.txt]
>> query = “hello apples”
>> bigrams = True
>> find_closest_match(filenames, query, bigrams=True)
[]
When you are done, make sure you run the tester file test_ps3_student.py to check your code
against our test cases.
6 of 8
Hand-in Procedure
1. Naming Files
Save your solutions with the original file name: document_distance.py. Do not ignore
this step or save your files with a different name!
2. Time and Collaboration Info
At the start of your file, in a comment, write down the number of hours (roughly) you spent on
the problems in that part, and the names of the people with whom you collaborated.
For example:
# Problem Set 3
# Name: Jane Lee
# Collaborators: John Doe
# Time Spent: 3:30
# Late Days Used: 1 (only if you are using any)
# … your code goes here …
3. Submit
To submit document_distance.py, upload it to the problem set website linked from
Stellar. You may upload new versions of each file until the 4:59 PM deadline, but
anything uploaded after that time will be counted towards your late days, if you have
any remaining. If you have no remaining late days, you will receive no credit for a late
submission.
After you submit, please be sure to view your submitted file and double-check you
submitted the right thing.
7 of 8
Supplemental Reading about Document Similarity
This pset is a greatly simplified version of a very pertinent problem in Information
Retrieval. Applications of document similarity range from retrieving search engine
results to comparing genes and proteins to improving machine translation.
You may have noticed that Problem 5 did not always return intuitive results when trying
to find the most closely matching document to a query. This is because the similarity
metric we used to determine closeness was very primitive - it did not take into account
how words are related to each other nor did it remember word order (beyond the
immediately preceding one when using bigrams).
More advanced techniques to calculating document distance include transforming the
text into a vector space and computing the cosine similarity, Jaccard Index, or some
other metric of the vectors.
8 of 8