Professional Documents
Culture Documents
Olivier Grisel
http://twitter.com/ogrisel
PyCon 2011
Outline
Why text classification?
What is text classification?
How?
scikit-learn
NLTK
Google Prediction API
Some results
Applications of Text Classification
Task Predicted outcome
New
Text features
Document, vector Predictive Expected
Image, Model Label
Sound
Bags of Words
Tokenize document: list of uni-grams
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Binary occurrences / counts:
{'the': True, 'quick': True...}
Frequencies:
{'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11}
TF-IDF
{'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24}
Better than frequencies: TF-IDF
Term Frequency
WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
[u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait',
u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du
kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait
pas', u'pas tres', u'tres bon']
Features Extraction in scikit-learn
from scikits.learn.features.text import CharNGramAnalyzer
vec = Vectorizer(analyzer=analyzer)
features = vec.fit_transform(list_of_documents)
clf2 = pickle.loads(pickle.dumps(clf))
predicted_labels = clf2.predict(features_of_new_docs)
cs)
m(do
fo r
n s
Training
. t ra features
Text e c
Documents, v vectors
Images,
Sounds...
, y) Machine
fi t (X Learning
.
clf Algorithm
Labels
w )
ne )
cs_ ne
w
(do t(X _
fo rm di c
n s pr e
New a .
Text c.t r clf
e features
Document, v vector Predictive Expected
Image, Model Label
Sound
NLTK
Code: ASL 2.0 & Book: CC-BY-NC-ND
Tokenizers, Stemmers, Parsers, Classifiers,
Clusterers, Corpus Readers
NLTK Corpus Downloader
>>> import nltk
>>> nltk.download()
Using a NLTK corpus
>>> from nltk.corpus import movie_reviews as reviews
>>> reviews.words(pos_ids[0])
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
Common data cleanup operations
Lower case & remove accentuated chars:
import unicodedata
s = ''.join(c for c in unicodedata.normalize('NFD', s.lower())
if unicodedata.category(c) != 'Mn')
Extract only word tokens of at least 2 chars
Using NLTK tokenizers & stemmers
Using a simple regexp:
re.compile(r"\b\w\w+\b", re.U).findall(s)
Feature Extraction with NLTK
Unigram features
def word_features(words):
return dict((word, True) for word in words)
Feature Extraction with NLTK
Bigram Collocations
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures as BAM
from itertools import chain
classifier = NaiveBayesClassifier.train(train_set)
Most informative features
>>> classifier.show_most_informative_features()
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0
Training NLTK classifiers
Try nltk-trainer
{
"probability": {
"neg": 0.36647424288117808,
"pos": 0.63352575711882186
},
"label": "pos"
}
Google Prediction API
Typical performance results:
movie reviews
nltk:
unigram occurrences
Nave Bayesian Classifier ~ 70%
Google Prediction API ~ 83%
scikit-learn:
TF-IDF unigram features
LinearSVC ~ 87%
nltk:
Collocation features selection
Nave Bayesian Classifier ~ 97%
Typical results:
newsgroups topics classification
20 newsgroups dataset
~ 19K short text documents
20 categories
By date train / test split
15 Wikipedia articles
[p.text_content() for p in html_tree.findall('//p')]
CharNGramAnalyzer(min_n=1, max_n=3)
TF-IDF
LinearSVC
Typical results:
Language Identification
Scaling to many possible outcomes
Example: possible outcomes are all the
categories of Wikipedia (565,108)
From Document Categorization
to Information Retrieval
Fulltext index for TF-IDF similarity queries
Smart way to find the top 30 search keywords
Use Apache Lucene / Solr MoreLikeThisQuery
Some pointers
http://scikit-learn.sf.net doc & examples
http://github.com/scikit-learn code
http://www.nltk.org code & doc & PDF book
http://streamhacker.com/
Jacob Perkins' blog on NLTK & APIs
https://github.com/japerk/nltk-trainer
http://www.slideshare.net/ogrisel these slides
http://twitter.com/ogrisel / http://github.com/ogrisel
Questions?