You are on page 1of 1

E

G R J I
W
MILL Identifies Languages (Logically): B O

Language Identification with Trigram,


Y Letter, and Word Frequency
ö
Dan Stuart 12/15/2010

B
E

ħ
D

M
Abstract The Corpus ij
surprisingly this turned out not to be a exactly the “stop words” of each language.

Q
crippling problem. This processing leaves Interestingly, this feature might make a H
đ
For my final project in SI 650: For training and testing purposes, I
only the characters a-z, the apostrophe,
E word-frequency classifier difficult to Trigram Letter Word
and the space in the document, a total of implement on a search engine's index
Information Retrieval I built a Python used the Wikipedia XML corpus1, which is
S

ê
28 characters. because these are the very words that are
script and API package that implements a a large corpus of XML documents from
ignored by any good indexer. M
language identifier using several data several different Wikipedias. The š
þ

Classification Methods
types and classification methods. Because available corpora are in Dutch, English,
á
no good open-source project can do French, German, Spanish, Arabic, Chinese, Results 25 99.6 % 92 % 97.8 %
without a good recursive acronym, the and Japanese; because the last three are The MILL classifiers can use two
project is called MILL (MILL Identifies not romanized, I used only the first five, classification methods: cosine-distance Surprisingly, the cosine classifiers in ł
Languages (Logically)). The goal of MILL is but MILL will accept any number of and perceptron. The cosine-distance general had much better performance
K
to read in a document and determine what languages for its training sets. classifier finds the centroid of the than the perceptron classifiers, despite
language it is in, using one or more of T appropriate data vectors for the given the lower computational cost. For all 50 99.6 % 92.6 % 98.8 %

R
three data types and two classification corpus; then, when given a document it sizes of training sets, the trigram/cosine
schemes. è returns the cosine of the angle between
the centroid and the document's data
classifier exceeded 99% correct
classification, as did the word/cosine
H
M
Data Types vector. The cosine of the angle is an
V
classifier for all but the two smallest sets.
75
W intuitive statistic to use for this purpose; The letter/cosine classifier has accuracy in 99.6 % 93.2 % 99.8 %
R By default, MILL looks at three data
it takes values from 0 (for perpendicular
vectors) to 1 (for parallel vectors) (because
D
the 92-94% range which, while not
Ë
spectacular, is in general better than most
types: the whole-word frequency,
the vector components are of the perceptron classifiers. See Figure 4.
letter-frequency, and trigram-frequency of
the input documents. The first two are
positive-semidefinite, the cosines will G
self-explanatory; the third is the
never be negative). See Figure 1. The performance of the perceptron 100 99.6 % 93.2 % 99.8 %
classifiers was for the most part decidedly
letters that occur in the document,
IJ
frequency of consecutive groups of three Figure 1: Cosine Classification worse. The trigram/perceptron classifier
Each language corpus contains between achieved 60-90% accuracy for all sizes of
excluding most punctuation and ignoring
the case of the letter. So, for example, the N
70,000 and 100,000 XML documents, so training sets, as well as number of
trigrams of the first sentence of this there were more than enough documents features (I tried 20, 50, and 100 features) 150 99.4 % 93.4 % 99.6 %
to build disjoint training and testing sets. and maximum number of training steps
paragraph are: 'by_', 'y_d', '_de', 'def', etc,
(1000, 2500, and 5000). The
M

where the underscore stands for a single My testing set contained 100 documents
letter/perceptron classifier did better,

ŋ
space. All three of these data types are from each language, and I trained the
constructed as normalized vectors, with classifiers on sets of 25, 50, 75, 100, 150, with 80-90% accuracy, but the Y
T

the magnitude in each dimension and 200 documents per language. The R perceptron/word classifier achieved 200 99.6 % 94 % 99.8 %
95-99% accuracy, though computing χ2 for
representing the (normalized) frequency training sets themselves were not disjoint; Ž

3
the documents in each corpus were sorted the word features constituted a significant
of the particular feature of the document.
Figure 2: Perceptron Classification portion of training time. Figure 4: Cosine Matching Results

ŵ
These are simply the default data types, ASCII-betically by default, and to build the
and while they are generally hard-coded in training sets I pulled documents from the The perceptron classifier is generally
R

top of the list, while testing documents It is worth wondering why the cosine accuracies/weights as shown in Figure 5.
parts of the API further upstream, the more powerful, but more computationally classifiers did so much better, in every
letter- and trigram-frequency vectors are came from the bottom of the list. œ
demanding. This classifier attempts to This set of classifiers achieved 100%
I
J

case, than the perceptron classifiers. All accuracy on the second testing set using
implemented as special cases of the
Ngram-vector class, so it would be To pre-process the corpus before
separate documents into “yes” and “no” of the documents used in this project are democratic classification. While this is Cosine/ W
classes for each language it is given. See Wikipedia articles, so they are most likely 99.6 %
straightforward to extend the classifier to
look at digrams, quadgrams, etc.
training, I took only the body text
(between the <body> tags), which was the
Figure 2. from a few hundred to a few thousand
impressive, another set of classifiers with
significantly better scores achieved 99.8%
Trigram
actual text of each article, and stripped words long; the data vector centroids accuracy, which is the same as the best
However, the sheer number of possible
quadgrams (approximately 600,000) out the XML tags along with any weird Significant Features ű E
from any set of such long documents are classifier in the set. While it bears further
makes the computational trade-off for formatting characters. Due to technical likely to converge to much the same investigation, it seems likely that the Cosine/
λ 94 %

G
MILL uses χ2 feature selection before values, so the large number of words per democratic classifier is a good way to
using N>3 not worth the possible gains in limitations, this included special accented
document is probably what made the Letter
D

training the perceptron classifier in order


accuracy. characters (such as é) as well, but
W
to determine which are the most cosine so much more accurate. I did not
compile the results of multiple classifiers,
and it would probably give equal or better T
significant features to classify on. test the classifier on shorter documents, results than any individual classifier.
Æ
Є

such as queries, but it is likely that for


Significant features are those which are Cosine/
Trigrams E Words significantly more frequent in one these types of documents the cosine
classifiers would be much less accurate. Conclusion Word 99.8 %
language than the others, or more
frequent in all the other languages than Probably the word classifiers would
The cosine classification method
Dutch
'_ee', 'voo', 'zij', 'een', 'het', 'van', the one we're interested in classifying. maintain usefulness, but it is likely that
such documents would be much harder to worked surprisingly well, even for a small P
Because languages tend to have different
'ijn', 'aar' 'zijn','voor'
μ
sets of frequent features, in practice the classify. number of training documents. Perceptron/ F69 %
Classifying with the trigram feature is
significant features are those which are
slightly less accurate than with word
Trigram
English
'ly_', 'ted', 'by_', 'and', 'to', 'the', frequent in the classified language. Democratic frequencies, but much less ś
'_by', 'f_t' 'by', 'with'
Because the are only 28 character Classification computationally expensive. Democratic
classification may lead to slight Perceptron/
French
"_l'", 'pou', "_d'", 'dans', 'le', 'et', features, feature selection was not
improvements in accuracy, but must be 88 %
'du_', 'ait' 'du', 'il' necessary for the letters, but looking at Ω Finally, I also implemented and briefly
tested a system of having the six T
weighed against the additional secondary Letter
the significant trigrams and words can be
illuminating. The 5 most significant W
classifiers produce a single result by training time. MILL v0.5 is available from:
R '_ei', 'von', 'zu_', 'der', 'und', 'von', trigrams and words for each language are “voting” for their respective results, with
German 'eic', 'ung' 'eine', 'ist' shown in Figure 3. the votes weighted by the classifiers' http://www-personal.umich.edu/~dstu Perceptron/
accuracy on the first testing set. I built a /MILL.html 94.6 %
The significant trigram features seem second disjoint testing set, and ran a two Word
'do_', 'ado', 'ego', 'del', 'el', 'ms',
Spanish '_ms', 'gor' 'una', 'con' to be exploiting the grammatical sets of classifiers on it. The classifiers
properties of each language, while in trained with 200 documents per class, 20 Figure 5: Democratic Classification Weights
ĺ

Figure 3: Significant Features


ρ S
general the significant word features are
Ξ W ∏
features, and 100 maximum steps, had

You might also like