You are on page 1of 15

Stack Search

StackOverflow Search Clustering


By
Sonali Sharma, Shubham Goel, Priya Iyer
Project Report
Report presented towards the completion of the class project for INFO 256
Applied Natural language Processing
Date: 12/16/2013

Instructor
Marti Hearst

Mentor
Aditi Muralidharan

1. Introduction
There are a number of search engines available today. When users type in a query to
search for something over the internet, they are often overwhelmed with the a large
amount of results. It often becomes difficult to browse through this list and fetch the
most relevant items. We implemented an algorithm that would improve user search
using NLP techniques and provide a mechanism to categorise search results into
relevant categories, thereby making it easy for the end user to navigate to the category
of interest and look for results within that category. We decided to implement Findex
algorithm in this project. Findex is a text categorization algorithm that provides an
overview of search results as categories where categories are made up of most
frequent words and phrases in the resulting document set. The algorithm is based on
the assumption that the most frequently used word/phrases in a set of documents
capture major topics very well. We used the StackOverflow data to implement the
algorithm.

2. Project Goals
2.1 Original intent
We originally intended to work on clustering similar sentences using a monothetic clustering
algorithm such as DisCover [1] . Monothetic clustering is a clustering technique wherein each
cluster is formed using only one feature and that single feature is present across all the samples,
which in our case, are documents. The DisCover algorithm is one such type of a monothetic
clustering algorithm. Also, we intended to use the WordSeer [2] project and the associated
humanities corpora in order to implement the algorithm on it. However, upon deeper analysis we
realized some hurdles to this approach as discussed below.
2.2 Algorithm
The DisCover algorithm aims for full coverage however search results do not necessarily have to
fall under clusters. Moreover, understanding and implementing the algorithm appeared to be
complex and fell out of scope of a class project. Also, a good measure and method of evaluation
of the DisCover algorithm had not been achieved so, we decided to use another algorithm
Findex[3] in order to categorize search results. The details of the Findex algorithm are discussed
in detail in one of the upcoming sections.
2.3 Data
The developer of WordSeer, Aditi M, had set up two text corpora to work with WordSeer.
However, the text seemed more appropriate for testing purposes than for actual use for someone.

2.4 WordSeer
Also, WordSeer had already implemented a version of the Findex algorithm for clustering the
search results. There wasnt much left for us to do and we really wanted to learn more about
Findex and its implementation.
Thus, we decided to implement a search result clustering algorithm (Findex) over StackOverflow
data using the StackOverflow API. We built a search interface for StackOverflow users to type in
their questions and see the results in a neat manner.
2.5 Accomplishments
Data
Data processing was a much harder task than we had originally expected. Using the readymade
database that Aditi has uploaded on to the WordSeer platform, we had not expected the data
cleaning and loading tasks to be so arduous. The StackOverflow API provided us with data in
huge XML and HTML files. A lot of our time was consumed in extracting the data, cleaning it,
parsing it and loading it to the database.
Algorithm
We decided to implement Findex as our algorithm of choice for categorizing the search results.
Details of the implementation will be discussed in the upcoming section. Our original intent was
to only create one layer of categories for the search result topics but we ended up creating a
second layer of sub-categories by recursively applying the Findex algorithm over each of the
parent categories. From an NLP standpoint, we ended up implementing a lot of concepts from
work tokenization, lemmatization, n-gram creation to phrase frequency distribution.
Search User Interface
Initially, our vision was to only create the categories on the search results and display the results
on the console. However, we decided to go a step further and create a web interface as well. We
realized that visualizing the categories and the sub-categories along with the content of the
questions and responses from StackOverflow would be a good way to drive home the point
further about search result categorization.
Future Work
We intend to be able to plug the system that we developed to the StackOverflow interface to
allow users to quickly browse to the intended questions and responses of their interests through
the categorized topics. Currently, the StackOverflow user interface only has tags as a way of
classification. However, the issue with that is that they are user generated and sometimes they
may not necessarily be relevant to the question.
2.6 Results

The following figures represent the results of our project. Each of the figures contains a different
query term and different categories corresponding to the different query terms. Notice how some
query terms have no sub-categories while some do. This is because Findex only displays the
categories or the sub-categories if there is a substantial number of questions that falls under those.
Detailed descriptions of these result pages are provided in the next section.

Fig. 1 Query word: python

Fig.2 Query word: databases in python

Fig.3 Query word: Java Interview Questions

Fig. 4 Part-of-speech tagging for query word: memory management

3. Data
We downloaded the Stack Exchange Creative Commons Data Dump, which has all the
public data from websites like Stack Overflow, Server Fault, Stack Apps, etc up to
September 2011. The data files were in XML format with each question and answer
being an entry with the <row> tag. Because the relevant Stack Overflow data was over
4GB in size, we first built our database using the data from english.stackexchange.com
which was about 21MB in size. This allowed us to make progress in parallel on the
database front and the algorithm front. To save time we initially set up everything in a
sqlite database. We processed the XML file using the python built in xml.dom library
and stored the answers and questions in different tables. We cleaned the answer test
by removing all the HTML tags and dropping the content of certain tags like <CODE>
using BeautifulSoup. We did not want to pass the code snippets to our Findex engine.
After cleaning the answer text we merged the answer and the question table and got rid
of all the extra data, like unanswered questions and irrelevant answers. Using this
temporary database we started building our Findex engine and the user interface. As
the next step, we started building our database with the Stack Overflow data. We
switched from sqlite to MySQL. The xml.dom library did not work well for parsing a large

data set as it reads the entire data file into the memory before processing it. We
switched to xml.etree.cElementTree which is a C based library and had to make some
changes to the sql statements to import the stack overflow data into the MySQL
database. This gave us over 1 million questions with cleaned answer text. For our final
demonstration and user interface, we used a subset of ten thousand questions in a
sqlite database, to improve the response time of the system.

4. Algorithms
We implemented a modified version of Findex algorithm. Below are the major steps of
the algorithm used by us.
4.1 Text Mining
This was one of the most important steps. Before doing frequency calculation, the data
had to be transformed into a particular format and stored in sqlite database. The clean
data had to be tokenized, lemmatized, converted to trigrams.
Tokenization- The answer text was tokenized using nltk tokenizer. Tokenization splits
up a string into a list of constituent words.
Stop word removal - In order to get more relevant categories after applying frequency
distribution on tokenized words, it was important to exclude stop words. Without
excluding them, stop words would appear in the most frequent words list and would
result in meaningless categories. We decided to

use the stop word list from the

linguistic tools resources of Information Retrieval department of University of Glasgow.


This was an exhaustive list of stopwords and worked well on our dataset. The list of
stopwords can be viewed here.
Lemmatization - After removing stop words, the next step was to lemmatize the words.
Lemmatization was important. Without lemmatization, simple inflections of the words
such as debug and debugging, list and listing, car and cars, would appear as separate
categories. In order to prevent this we used wordnetlemmatizer to lemmatize tokens.
ngrams - The next step was to create unigrams, bigrams and trigrams. In order to
formulate categories we decided to go upto trigrams as phrases up to 3 words made

more relevant categories than phrases consisting of more than 3 words. For every
answer, we created a list of unigrams, bigrams, trigrams and stored them in an ngrams
table that we created. We store the original unigrams, bigrams, trigrams along with
lemmatized version as shown below.

Fig. 5 Table storing ngrams for tokenized answers

4.2 User Query


The previous step was one time activity to upload the dataset into the database. Now
for the specific search, the users are asked to enter a query in the search interface.

Fig. 6 User Interface to input query term

Our program reads the query phrase entered by the user and executes the steps below:
Tokenize query terms - Search term pointers in memory is tokenized to
[pointers,in,memory]

Lemmatize terms - Tokens are then lemmatized [pointer,in,memory]


Stop words - On removing stop words we are left with [pointer,memory]
Finding relevant questions - For each term in the user query we check the ngrams
table to fetch a list of phrases containing the query term. Upon fetching the phrases
containing query terms, we calculate the frequency of the phrases and arrange them in
descending order.
4.3 Frequency distribution
After fetching all phrases containing the query terms we then calculate frequency
distribution of the phrases and finally filter the results by fetching the top 20 phrases.
The figure below shows the phrases along with their frequency for the query pointers in
memory. Each of these phrases is considered as a separate category.

Fig. 7 Most frequent phrases

Upon getting list of top 20 phrases 9by frequency) we also fetch the corresponding
question and answers ids. These are used later to display the search results.
4.4 Hierarchy
In this project we went beyond the Findex algorithm and implemented and introduced
hierarchy in the categories. The unigrams in the resultant phrases were considered as
the top level category. Bigrams containing the unigram was considered as the second
level hierarchy and similarly trigrams containing the bigram was considered as level
three hierarchy. while displaying the search results we showed all all three levels (if
applicable)
Below is the screenshot of the categories returned with parent child relationship.Parent
represents Level 1 category (unigrams), Children represent Level 2 category (bigrams)
and Grandchildren represent Level 3 category (trigrams).

Fig. 8 Creating hierarchy of phrases

4.5 Displaying results


Front end
The front end to display results was built using flask framework. We connected to
sqlite3 database to fetch results and the webpage was built using flask jinja2 template.
the search interface was simple and user could provide any type of search query. As
described in the section above, the search query was parsed first and the results
calculated on the fly and returned back in a hierarchical order. As shown in the figure
below, the results were arranged by the frequency of the phrases. The hierarchy of the
categories also facilitates search for the end user.

Fig. 9 Search results for query pointers in memory

The user can now select particular category of interest. On selecting the category the
list of associated questions are displayed on the right plane. The answers can be
viewed upon clicking on the question. This interface makes it easy for the users to
browse through a list of questions corresponding to the selected category. In the
example below, the user is only interested in looking at categories memory stream,
pointers, memory layout and memory leak. There are 10 questions corresponding to
this selection.

Fig. 10 Selecting categories to view relevant search results

5. Further analysis
We did further analysis of categories using Parts of speech tagging of the categories. We found
some interesting results with this exercise which could be used to further classify the categories.
From POS Tagged categories we found patterns like Adjective - Noun, Verb Noun and Noun
Noun.
Adjective Noun, Noun noun combination depicted the types of category. We can see that
transactional, variable, virtual, actual and table are all types of memory.
Types
[('transactional', 'JJ'), ('memory', 'NN')]
[('variable', 'JJ'), ('memory', 'NN')]
[('virtual', 'JJ'), ('memory', 'NN')]
[('actual', 'JJ'), ('memory', 'NN')]
[('table', 'JJ'), ('memory', 'NN')]

Verb noun combination depicts various actions/ usages of the Noun in the category. The actions
that you can perform on memory consist of writing, freeing, allocation, loading, sharing etc. This
pattern was clearly visible by identifying the verbs in categories.
Actions
[('writing', 'VBG'), ('memory', 'NN')]
[('freeing', 'VBG'), ('memory', 'NN')]
[('allocated', 'VBD'), ('memory', 'NN')]
[('related', 'VBD'), ('memory', 'NN')]
[('cached', 'VBD'), ('memory', 'NN')]
[('loaded', 'VBD'), ('memory', 'NN')]
[('shared', 'VBD'), ('memory', 'NN')]
[('written', 'VBN'), ('memory', 'NN'), ('tested', 'VBN')]
[('written', 'VBN'), ('memory', 'NN')]
[('string', 'VBG'), ('memory', 'NN')]

From the above analysis we could clearly see a pattern in the categories returned. Just by looking
at the parts of speech it was easy to categorise the list further.

6. Contributions of Each Team Member


Tasks

Shubham

Priya

Sonali

Initial parsing stack


overflow data

100%

0%

0%

Initial Loading of
questions and answers
into database

90%

5%

5%

Entire dataset
Tokenization, stop word
removal and loading
ngrams

0%

10%

90%

Frequency calculation

20%

20%

60%

Hierarchical
categorization

0%

90%

10%

POS tagging

5%

80%

15%

Front end

15%

5%

80%

Documentation

33%

33%

33%

Code Cleanup

33%

33%

33%

7. Code
We wrote the code for Findex algorithm and Front end from scratch along with the code to
extend Findex to do hierarchical categorization and parts of speech tagging.
You can find our code repository here: https://github.com/sonalisharma/nlp_sentence_clustering

8. Bibliography
[1] Kummamuru, Krishna, et al. "A hierarchical monothetic document clustering algorithm for
summarization and browsing search results." Proceedings of the 13th international conference on
World Wide Web. ACM, 2004.
[2] Muralidharan, Aditi, and Marti Hearst. "Wordseer: Exploring language use in literary text."
Fifth Workshop on Human-Computer Interaction and Information Retrieval. 2011.
[3] Kki, Mika, and Anne Aula. "Findex: improving search result use through automatic filtering
categories." Interacting with Computers 17.2 (2005): 187-206.

You might also like