Professional Documents
Culture Documents
Instructor
Marti Hearst
Mentor
Aditi Muralidharan
1. Introduction
There are a number of search engines available today. When users type in a query to
search for something over the internet, they are often overwhelmed with the a large
amount of results. It often becomes difficult to browse through this list and fetch the
most relevant items. We implemented an algorithm that would improve user search
using NLP techniques and provide a mechanism to categorise search results into
relevant categories, thereby making it easy for the end user to navigate to the category
of interest and look for results within that category. We decided to implement Findex
algorithm in this project. Findex is a text categorization algorithm that provides an
overview of search results as categories where categories are made up of most
frequent words and phrases in the resulting document set. The algorithm is based on
the assumption that the most frequently used word/phrases in a set of documents
capture major topics very well. We used the StackOverflow data to implement the
algorithm.
2. Project Goals
2.1 Original intent
We originally intended to work on clustering similar sentences using a monothetic clustering
algorithm such as DisCover [1] . Monothetic clustering is a clustering technique wherein each
cluster is formed using only one feature and that single feature is present across all the samples,
which in our case, are documents. The DisCover algorithm is one such type of a monothetic
clustering algorithm. Also, we intended to use the WordSeer [2] project and the associated
humanities corpora in order to implement the algorithm on it. However, upon deeper analysis we
realized some hurdles to this approach as discussed below.
2.2 Algorithm
The DisCover algorithm aims for full coverage however search results do not necessarily have to
fall under clusters. Moreover, understanding and implementing the algorithm appeared to be
complex and fell out of scope of a class project. Also, a good measure and method of evaluation
of the DisCover algorithm had not been achieved so, we decided to use another algorithm
Findex[3] in order to categorize search results. The details of the Findex algorithm are discussed
in detail in one of the upcoming sections.
2.3 Data
The developer of WordSeer, Aditi M, had set up two text corpora to work with WordSeer.
However, the text seemed more appropriate for testing purposes than for actual use for someone.
2.4 WordSeer
Also, WordSeer had already implemented a version of the Findex algorithm for clustering the
search results. There wasnt much left for us to do and we really wanted to learn more about
Findex and its implementation.
Thus, we decided to implement a search result clustering algorithm (Findex) over StackOverflow
data using the StackOverflow API. We built a search interface for StackOverflow users to type in
their questions and see the results in a neat manner.
2.5 Accomplishments
Data
Data processing was a much harder task than we had originally expected. Using the readymade
database that Aditi has uploaded on to the WordSeer platform, we had not expected the data
cleaning and loading tasks to be so arduous. The StackOverflow API provided us with data in
huge XML and HTML files. A lot of our time was consumed in extracting the data, cleaning it,
parsing it and loading it to the database.
Algorithm
We decided to implement Findex as our algorithm of choice for categorizing the search results.
Details of the implementation will be discussed in the upcoming section. Our original intent was
to only create one layer of categories for the search result topics but we ended up creating a
second layer of sub-categories by recursively applying the Findex algorithm over each of the
parent categories. From an NLP standpoint, we ended up implementing a lot of concepts from
work tokenization, lemmatization, n-gram creation to phrase frequency distribution.
Search User Interface
Initially, our vision was to only create the categories on the search results and display the results
on the console. However, we decided to go a step further and create a web interface as well. We
realized that visualizing the categories and the sub-categories along with the content of the
questions and responses from StackOverflow would be a good way to drive home the point
further about search result categorization.
Future Work
We intend to be able to plug the system that we developed to the StackOverflow interface to
allow users to quickly browse to the intended questions and responses of their interests through
the categorized topics. Currently, the StackOverflow user interface only has tags as a way of
classification. However, the issue with that is that they are user generated and sometimes they
may not necessarily be relevant to the question.
2.6 Results
The following figures represent the results of our project. Each of the figures contains a different
query term and different categories corresponding to the different query terms. Notice how some
query terms have no sub-categories while some do. This is because Findex only displays the
categories or the sub-categories if there is a substantial number of questions that falls under those.
Detailed descriptions of these result pages are provided in the next section.
3. Data
We downloaded the Stack Exchange Creative Commons Data Dump, which has all the
public data from websites like Stack Overflow, Server Fault, Stack Apps, etc up to
September 2011. The data files were in XML format with each question and answer
being an entry with the <row> tag. Because the relevant Stack Overflow data was over
4GB in size, we first built our database using the data from english.stackexchange.com
which was about 21MB in size. This allowed us to make progress in parallel on the
database front and the algorithm front. To save time we initially set up everything in a
sqlite database. We processed the XML file using the python built in xml.dom library
and stored the answers and questions in different tables. We cleaned the answer test
by removing all the HTML tags and dropping the content of certain tags like <CODE>
using BeautifulSoup. We did not want to pass the code snippets to our Findex engine.
After cleaning the answer text we merged the answer and the question table and got rid
of all the extra data, like unanswered questions and irrelevant answers. Using this
temporary database we started building our Findex engine and the user interface. As
the next step, we started building our database with the Stack Overflow data. We
switched from sqlite to MySQL. The xml.dom library did not work well for parsing a large
data set as it reads the entire data file into the memory before processing it. We
switched to xml.etree.cElementTree which is a C based library and had to make some
changes to the sql statements to import the stack overflow data into the MySQL
database. This gave us over 1 million questions with cleaned answer text. For our final
demonstration and user interface, we used a subset of ten thousand questions in a
sqlite database, to improve the response time of the system.
4. Algorithms
We implemented a modified version of Findex algorithm. Below are the major steps of
the algorithm used by us.
4.1 Text Mining
This was one of the most important steps. Before doing frequency calculation, the data
had to be transformed into a particular format and stored in sqlite database. The clean
data had to be tokenized, lemmatized, converted to trigrams.
Tokenization- The answer text was tokenized using nltk tokenizer. Tokenization splits
up a string into a list of constituent words.
Stop word removal - In order to get more relevant categories after applying frequency
distribution on tokenized words, it was important to exclude stop words. Without
excluding them, stop words would appear in the most frequent words list and would
result in meaningless categories. We decided to
more relevant categories than phrases consisting of more than 3 words. For every
answer, we created a list of unigrams, bigrams, trigrams and stored them in an ngrams
table that we created. We store the original unigrams, bigrams, trigrams along with
lemmatized version as shown below.
Our program reads the query phrase entered by the user and executes the steps below:
Tokenize query terms - Search term pointers in memory is tokenized to
[pointers,in,memory]
Upon getting list of top 20 phrases 9by frequency) we also fetch the corresponding
question and answers ids. These are used later to display the search results.
4.4 Hierarchy
In this project we went beyond the Findex algorithm and implemented and introduced
hierarchy in the categories. The unigrams in the resultant phrases were considered as
the top level category. Bigrams containing the unigram was considered as the second
level hierarchy and similarly trigrams containing the bigram was considered as level
three hierarchy. while displaying the search results we showed all all three levels (if
applicable)
Below is the screenshot of the categories returned with parent child relationship.Parent
represents Level 1 category (unigrams), Children represent Level 2 category (bigrams)
and Grandchildren represent Level 3 category (trigrams).
The user can now select particular category of interest. On selecting the category the
list of associated questions are displayed on the right plane. The answers can be
viewed upon clicking on the question. This interface makes it easy for the users to
browse through a list of questions corresponding to the selected category. In the
example below, the user is only interested in looking at categories memory stream,
pointers, memory layout and memory leak. There are 10 questions corresponding to
this selection.
5. Further analysis
We did further analysis of categories using Parts of speech tagging of the categories. We found
some interesting results with this exercise which could be used to further classify the categories.
From POS Tagged categories we found patterns like Adjective - Noun, Verb Noun and Noun
Noun.
Adjective Noun, Noun noun combination depicted the types of category. We can see that
transactional, variable, virtual, actual and table are all types of memory.
Types
[('transactional', 'JJ'), ('memory', 'NN')]
[('variable', 'JJ'), ('memory', 'NN')]
[('virtual', 'JJ'), ('memory', 'NN')]
[('actual', 'JJ'), ('memory', 'NN')]
[('table', 'JJ'), ('memory', 'NN')]
Verb noun combination depicts various actions/ usages of the Noun in the category. The actions
that you can perform on memory consist of writing, freeing, allocation, loading, sharing etc. This
pattern was clearly visible by identifying the verbs in categories.
Actions
[('writing', 'VBG'), ('memory', 'NN')]
[('freeing', 'VBG'), ('memory', 'NN')]
[('allocated', 'VBD'), ('memory', 'NN')]
[('related', 'VBD'), ('memory', 'NN')]
[('cached', 'VBD'), ('memory', 'NN')]
[('loaded', 'VBD'), ('memory', 'NN')]
[('shared', 'VBD'), ('memory', 'NN')]
[('written', 'VBN'), ('memory', 'NN'), ('tested', 'VBN')]
[('written', 'VBN'), ('memory', 'NN')]
[('string', 'VBG'), ('memory', 'NN')]
From the above analysis we could clearly see a pattern in the categories returned. Just by looking
at the parts of speech it was easy to categorise the list further.
Shubham
Priya
Sonali
100%
0%
0%
Initial Loading of
questions and answers
into database
90%
5%
5%
Entire dataset
Tokenization, stop word
removal and loading
ngrams
0%
10%
90%
Frequency calculation
20%
20%
60%
Hierarchical
categorization
0%
90%
10%
POS tagging
5%
80%
15%
Front end
15%
5%
80%
Documentation
33%
33%
33%
Code Cleanup
33%
33%
33%
7. Code
We wrote the code for Findex algorithm and Front end from scratch along with the code to
extend Findex to do hierarchical categorization and parts of speech tagging.
You can find our code repository here: https://github.com/sonalisharma/nlp_sentence_clustering
8. Bibliography
[1] Kummamuru, Krishna, et al. "A hierarchical monothetic document clustering algorithm for
summarization and browsing search results." Proceedings of the 13th international conference on
World Wide Web. ACM, 2004.
[2] Muralidharan, Aditi, and Marti Hearst. "Wordseer: Exploring language use in literary text."
Fifth Workshop on Human-Computer Interaction and Information Retrieval. 2011.
[3] Kki, Mika, and Anne Aula. "Findex: improving search result use through automatic filtering
categories." Interacting with Computers 17.2 (2005): 187-206.