You are on page 1of 14

Chapter 21

Text Mining
2
Introduction
Data mining is the process of finding and exploiting useful
patterns in data.
Text Mining for Derived Columns
Perhaps the most common use of text mining is to add new derived
columns into a model set.
Extracting derived variables is usually a matter of looking for
specific patterns in the text.
An address can be used to identify whether someone lives in an apartment
by looking for an apartment number.
If the address contains any of the following, then it is probably an
apartment number:
Apt. in any case
#
Address line beginning with Apt. or Apartment
Unit
3
Introduction
Sources of Text
E-mails sent by customers
Notes entered by customer service reps, doctors, nurses, garage
mechanics, and so on
Transcriptions (voice-to-text translation) of customer service calls
Comments on websites
Newspaper and magazine articles
Professional reports
Basic Approaches to Representing Documents
There is a continuum of approaches for understanding documents.
At one end is the bag of words approach, where documents are
considered merely a collection of their words.
At the other end is the understanding approach, where an
attempt is made to actually understand the document and what
each word specifically means.
4
Representing Documents in Practice
Stop Words
Refer to words that have little meaning (the ability to differentiate
between different documents).
For instance, virtually all documents in English contain the word the, and
this word essentially has no meaning for typical text mining applications
such as classification, deriving variables, navigation, etc.
Stemming
The process of reducing words to their stem, the base word or almost-
word that provides the meaning without additional grammatical
information.
For instance, the word stemming would be transformed into the word
stem, as would stems and stemmed.
The purpose of stemming is to better capture the content of a document.
One customer complaint might refer to late delivery and another might
say not delivered on time, two phrases that have no words in common.
Using stemming, they would have the word deliver in common.
5
Representing Documents in Practice
Word pairs and phrases
Identifying word pairs and phrases is important for understanding text.
The rock group The Who is a famous example of what can happen with automated
text processing.
Most stop word lists would include both the and who on the list, so the phrase would
disappear entirely from the document.
This could be very problematic.
There are two solutions to this problem.
The easy solution is to keep capitalized stop words, or at least capitalized stop words
that are not at the beginning of the sentence.
A more sophisticated solution is to search for common word pairs and phrases, and to
be sure to keep these.
Using a Lexicon
A lexicon is a list of words that are important.
It might also include synonyms, so several different words might be
combined into a single idea, including misspellings.
For instance, flight, fl, and flt might all represent flight in airline comments.
6
From Text to Numbers
Techniques that use the bag-of-words approach transform the
bag of words into a giant table of numbers, the term-document
matrix.
The term-document matrix is a simple array, where each row
represents a single document and each column represents a
particular word.
Typically, the number of words in a document is reduced through
several steps:
Fixing misspellings
Removing common words and words with little meaning (stop words)
Stemming
Replacing words with synonyms
The result is a vocabulary or lexicon, typically of several hundred
to several thousand words that describe each document.
7
From Text to Numbers
The cells in the matrix contain zero if the word is not in
the document.
Words that are in the document could simply contain the
value one, indicating the presence of the word.
Another possibility is the count of words in the document.
More commonly, though, the value is the inverse
document frequency or one minus the log of the
document frequency.
The inverse document frequency is one divided by the number of
documents containing the term.
Words in many documents have low values; words in few
documents have higher values.
One minus the log of this value behaves in a similar way.
8
From Text to Numbers
Each document can be thought of as a point in a giant
term space.
A corpus can contain thousands of possible terms.
These terms form a space, where each term is along an
axis.
There are thousands of dimensions.
The data is quite sparse, meaning that most documents
do not contain most terms.
High dimensional sparse data is a big challenge in data
mining.
The solution is to use the singular values decomposition
to reduce dimensionality, in the same way that principal
components are used.
9
From Text to Numbers
10
From Text to Numbers
Parsing
In general, parsing works by replacing punctuation
with spaces, and then taking all terms between
spaces.
Fixing Misspellings
The automated task is simply a matter of constructing
a valid dictionary and choosing the closest term.
This is an iterative task, where you start with a
dictionary and find the closest word to each word not
in the dictionary.
Some words are quite close (real misspellings) and
some are quite far away (suggestions for new words
to add into the dictionary).
11
From Text to Numbers
Stemming
Stemming transforms words into their root forms.
For example, one comment might contain Customer paid too much
on last bill; money refunded. Another might say Refunding
overpayment.
These two comments have no words in common, yet they are saying
essentially the same thing.
The stemming algorithm recognizes that overpayment and paid
both have the root of pay.
Similarly, refunded and refund both have the root of refund.
Stemming turns the two comments into Customer pay too much on
last bill; money refund and Refund pay.
After stemming, the comments are not grammatically correct.
On the other hand, comments similar to each other are much more
likely to contain similar terms.
12
From Text to Numbers
Applying Synonym Lists
Synonym lists are words and phrases that are
recognized in the text and replaced by a common
synonym.
These serve several purposes, including fixing
misspellings and finding word phrases.
For example, Change address and Change phone
number both turn into Change account info.
The lists can also be used to fix misspellings.
For example, these might all be synonyms for Showtime:
Showtime / Show time / Show-time / ST / Showt / Shwotme
The synonym lists turn these into a single word (in this case
Showtime).
13
From Text to Numbers
Using a Stop List
The stop list contains words that have minimal meaning.
Stop words can also be meaningful terms that simply do not distinguish
between comments
The purpose of the stop word list is to remove words that do not
distinguish between different comments, even when these words might
seem meaningful.
Converting Text to Numbers
Use of singular value decomposition (SVD) to transform documents into
numbers.
Clustering
Gaussian mixture models (GMM), also known as expectation-
maximization clustering.
In fact, several different methods were tried, notably k-means.
14
RapidMiner Practice
To see:
Training Videos\04 - Neil McGuigan - VancouverData\
Text Mining 1 - Loading Text Into RapidMiner
Text Mining 2 - Processing Text In RapidMiner
Text Mining 3 - Text Association Rules in RapidMiner
Text Mining 4 - Document Similarity and Clustering in
RapidMiner
Text Mining 5 - Automatic Classification of Documents
using RapidMiner
Text Mining 6 - Applying Model To New Documents
To practice:
Do the exercises presented in the movies using the file
TextMiningData.xls.

You might also like