Professional Documents
Culture Documents
Abstract
This paper examines a method to match texts (in a broad acception)
according to their stylistic similarity, with standard clustering techniques,
and considers for that purpose various means of quantifying the dissimi-
larity between texts. A comparison of those criteria - derived from well-
known measures such as weighted Euclidean distances or Bray-Curtis dis-
similarity - is then undergone, and an analysis of the results the afore-
mentioned method yields is attempted.
Acknowledgements
I would like to thank Pr. S.Y.Kung, without whom my notions of
machine learning would have been far blurrier; and express my gratitude
to my adviser, David Mimno, without whose help, advice and suggestions
this paper would have been nothing but “a tale full of sound and fury,
signifying nothing”. Yet much faster to write.
† Thiswork has been conducted in the Princeton University, both as part of a Senior In-
dependent Work (COS498), and within the context of the course Digital Neurocomputing
(ELE571).
1
Contents
1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Applications 3
3 Approach 3
3.1 Summary: protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Clustering algorithm: K-Means . . . . . . . . . . . . . . . . . . . 4
3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Presentation of the metrics . . . . . . . . . . . . . . . . . . . . . 7
3.6 How to compare clusterings . . . . . . . . . . . . . . . . . . . . . 9
3.6.1 Variation of information . . . . . . . . . . . . . . . . . . . 9
3.6.2 Similarity matrices . . . . . . . . . . . . . . . . . . . . . . 9
3.7 Classification of new samples . . . . . . . . . . . . . . . . . . . . 10
4 Results 10
4.1 Clustering comparison: Variation of information . . . . . . . . . 12
4.2 Text matchings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Future work 19
6 Conclusion 20
7 Appendix 21
7.1 Complete list of features . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Metrics summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2.2 Converting similarity measures in pseudo-semi-distances . 23
7.2.3 About the Extended Jaccard index . . . . . . . . . . . . . 25
2
1 Introduction
1.1 Overview
This project aims at finding a way, given a set of texts (that is, novels, short
stories, plays, and articles), to detect intertextuality (or, in a broader sense,
similarities) between them, by using machine learning techniques. Before going
further, let us state precisely what actual meaning of “similarity” is considered.
Two texts can indeed be related in many ways:
- Common (thematic) structure (narrative patterns, such as in tales or
myths. . . )
- Variations upon a plot: for example, Antigone (by Sophocle, Anouilh,
Brecht, Cocteau. . . )
- Concepts (when two articles or books share so many concepts that they
should be related): e.g, philosophy books about ethics
- Style: either between books by the same author 1 or in the case of pastiches
(texts written à la manière de).
- Shared vocabulary/chunks of sentences: for example, in the case of pla-
giarism, a lot of words or collocations are common to both texts
Within the framework of this project, we shall focus on the style, that is, the
specificities in grammar, syntax, punctuation, rythm of sentences, use of vo-
cabulary, . . . constituting the “signature” of an author, and making its work
recognizable.
2 Applications
We develop here an automated method to find out which books are similar,
from a stylistic point of view; it will not (and, for that matter, is not meant
to) replace human judgment in such comparisons, but hopefully be an auxiliary
and complementary tool for that purpose, allowing to “prune” the possibilities
before a human eye investigate further, focusing on the most likely matches.
Besides, this technique could also be applied in two other areas: the first
one is plagiarism detection, where a positive match between two texts’ style
would be related to a possible copying, and be a preliminary test before further,
in-depth, inquiries. The second would be authorship controversies, where, given
an unknown work, one tries to find out whom is the likeliest author.
3 Approach
3.1 Summary: protocol
As outlined above, this work focuses on clustering texts (in a broad sense)
according to their style. The first task is thus to pick a set of features, from
each text, which could accurately characterize it in that view.
Before that, let us specify that by text, is meant any written English work,
whether in verse or prose, but not scientific formulas nor source code. In other
1 Even if it is not explicit, as in the case of authors using pseudonyms, like Romain
Gary/Émile Ajar.
3
terms, the work must be either literary or, at least, proper written English (so
that any consideration of written style can actually apply).
Protocol: Here is the summary of the different steps involved in this approach,
and a brief description of each of them:
(i) Downloading all the Project Gutenberg (PG)2 texts, and preprocess them
by removing the extra PG header and footer (added by the PG volunteers
in every work they digitalize), and discarding books whose length was less
than 16750 bytes (this threshold, arbitrary chosen, correspond to works
whose length does not exceed approximatively 4-6 pages - indeed, the risk
was that with such short texts, the residual “noise” due to remaining tags
of words added by the PG might have skewed the results)
(ii) Processing them, extracting the features (cf. Section 3.3 below) from each
text (using, amongst others, parser and named entity recognizer (NER)
from the Stanford NLP Tools)3 .
(iii) Shuffling the corpus to avoid (when sampling the corpus for a training set)
bias due to consecutive, too similar, texts (e.g., if works were grouped by
author in the original PG directories)
(iv) Running with different metrics (see Section 3.5, page 7, for a description),
on a training set of 500 samples4 , with the same algorithm (K-Means - cf
Section 3.2) and parameters (max. 100 iterations per run, Kmax = 99,
random initialization)
(v) Analyzing the clusterings obtained with K-Means, comparing the rele-
vance of each metric, and finding a way to combine the most relevant
criteria.
2 The Project Gutenberg is a volunteer project whose goal is to make freely available digi-
research scientists, postdocs, programmers and students who work together on algorithms that
allow computers to process and understand human languages”: http://nlp.stanford.edu
4 Mainly because of time constraints, using more samples was discarded as it would have
4
Given a maximum number of iterations Imax , K-Means, at each iteration, assign
each vector x to the cluster whose centroid center is the closest; at the end of
each iteration, each centroid center µk is updated to be the barycenter of all
vector contained in the k th cluster [2].
The original centroid locations are generally picked randomly; K-Means,
when it converges within the maximum number of iterations allowed, outputs
a clustering whose value E(X ) is a local minimum of E; note that this local
minimum depends on the initialization (two different initial configurations of
the centroids may result in different clusterings, both local minima of the cost
function). Furthermore, K-Means, like all clustering algorithms, is heavily de-
pendent on the metric used - fact this study is based upon.
3.3 Features
From each text (sample), a set of characteristics was extracted, mapping each
of them to a vector in RM . In order to capture the most possible stylistic
information, several aspects were considered (see Appendix 7.1 for an exhaustive
list and description of the features):
words statistics (ws): for each word from a reference dictionary, indicators
are computed to quantify the density of this word in the text, both relative
to the text and to what was expected relatively to its usual frequency.5
named entities (ne): the overall frequency of named entities, such as names,
cities, institutions (3 values)
sentence splitting (ss): statistics about the length of sentences (minimum,
maximum, average and standard deviation) (4 values)
chapter splitting (cs): statistics about the length of chapters (4 values)
punctuation (pn): frequency of each punctuation sign, and overall punctua-
tion frequency compared to the text character count. . . (17 values)
lexical categories (lc): frequency of each category (adjectives, nouns, verbs,
adverbs. . . ) in the text, compared to the number of words (7 values)
All these values have then been normalized, to prevent any scale effect (due to
different, non-comparable, range of values for two different features):
vi − v¯i
vi ←
σ(vi )
where σ(vi ) and v¯i are respectively the empirical standard deviation and mean
for the ith feature, over all samples.
These features altogether form, for each sample/text, a collection of real val-
ues, which can be seen as a vector u ∈ RM (where M is the number of features,
that is, approximatively 140,000 - mostly for words features). That way, we can
represent each sample, non-vectorial, by a point in a vector space, to which can
be applied standard mathematical operations and machine learning techniques
- such as metrics, distances, and clustering algorithms to them. An example of
the values forming such vectors is visible in Fig.1, where the first 40 components
of 3 samples are displayed as an histogram, and in Table 1, where the first 15
components of these 3 samples are listed.
5 As explained in the appendix, however, only one indicator per word has been considered
in this study.
5
x1 x2 x3
-0.789898 -0.405664 0.0595365
-1.00907 1.60941 -0.811952
-0.994053 -0.373620 -0.558598
-0.374360 -1.18555 -0.671482
-0.0447545 -0.0447481 -0.0447520
-0.814749 -0.628798 -0.158060
-0.502863 -0.512023 -0.411324
-1.11112 0.419349 0.419349
-0.795607 -0.837574 -0.168150
4.10806 -0.187883 -0.342737
2.19531 0.418397 -0.596197
-0.580942 -0.713383 1.41148
1.32911 0.207751 0.210466
-2.86644 -0.792608 -1.07873
-1.82875 0.190791 -0.337891
.. .. ..
. . .
3.4 Corpus
As mentioned before, the corpus used in this study is sampled from the Project
Gutenberg, and consists in books, short stories, periodics, poetry and plays of all
eras, and genres. After preprocessing, and discarding the samples whose length
6 The files can be retrieved at http://ngrams.googlelabs.com/datasets
7 The merging consisted in suming the number of occurrences of each of the terms; however,
as for the number of pages and volumes in which each term appears, adding the numbers was
problematic. The choice made was to take the maximum of the different values, for these two
indicators.
8 Amongst the “pruned” words, 132,371 occurred in less than 0.2% of the books, 655 in
more than 80.0%, 14034 had a possessive mark, and 3,384 contained invalid characters such
as sharp or semicolon.
6
Figure 1: Feature values for 3 samples (only the 40 first features are displayed)
did not exceed the threshold, 14,304 texts remained9 . From the whole pool of
texts, 500 were randomly picked to constitute a training set, each sample being
attributed the same probability of being chosen.
d(x, x) = 0 (reflexivity)
d(x, y) ≥ 0 (non-negativity)
d(x, y) = 0 ⇔ x = y (identity of indiscernibles)
d(x, y) = d(y, x) (reflexivity)
d(x, z) = d(x, y) + d(y, z) (triangular inequality)
which the texts were written. However, given the proportion of English literature amongst the
PG texts, the assumption was that almost all the texts considered were written in English -
assumption confirmed by the actual training set, which proved to be formed of English texts
only.
10 A similarity measure is a function s : X × X → R accounting for the “resemblance” be-
tween two elements: the greater s(x, y) is, the more x and y are considered alike.
7
metrics as pseudo-semi-distances, pseudo-semi-metrics, dissimilarity measures
or even, simply, metrics.
Here is a list of the different measures the K-Means algorithm was ran with:
Canberra distance first introduced (and then slightly modified) by Lance & Williams
([3], [4]), this is a metric well-suited to values centered on the origin: in-
deed, in the form due to Adkins, described in [4], the result “becomes
unity when the variables are of opposite sign.”12 It is thus a good indica-
tor when interested in dividing data according to a threshold value.
consequently given a smaller weight - since the chapter splitting tended to be somehow. . . fuzzy
in some texts of PG.
12 Cf. Jan Schulz, http://www.code10.info/index.php?option=com_content&view=
article&id=49:article_canberra-distance&catid=38:cat_coding_algorithms_
data-similarity&Itemid=57
8
Cosine dissimilarity adapted from the Cosine similarity measure, which quan-
tifies the “angle” between to vectors in RM (the smaller the angle is, the
more the two vectors are alike).
For a more in-depth description of those measures, and how similarity mea-
sures were converted in pseudo-semi-distances, refer to Appendix 7.2. Note that
other measures could have been used, such as Mahalanobis distances, Pearson
Correlation measure, or Dice similarity coefficient. The first former were dis-
carded as too computationally intense, while the latter is equivalent to the
extended Jaccard coefficient13 .
where the entropy H, the mutual information I and the cluster probabilities
P, P 0 are given by
K K X
K
X X P (k, `)
H(C) = − P (k) log P (k) ≥ 0, I(C, C 0 ) = P (k, `) log ≥0
P (k)P 0 (`)
k=1 k=1 `=1
and
|Ck ∩ C`0 | |Ck | |C 0 |
P (k, `) = , P (k, `) = , P 0 (`) = `
n n n
In this case, this criterion was applied to the clusterings obtained by running
K-Means on the training data set, with the 29 different metrics (see Section 4.1,
p.12, for the results and their analysis).
9
sample to the ith cluster14 .
From this matrix can be derived, for each clustering, a 500 × 500 symmetric
similarity matrix S = (sij ), with
(
1, if the ith and j th sample belong to the same cluster
sij =
0 otherwise
From these 29 similarity matrices, could then be computed the average sim-
ilarity matrix S̄ and the standard deviation of similarity matrix Σ - that is,
respectively, the matrix of mean and standard deviation of the pairwise similar-
ities. As we will see in Section 4.2, these two matrices constitute a good start to
find out the most likeliest matches between texts; indeed, two samples with very
high average similarity and very standard deviation are, with high probability,
stylistically related in some way.
k ∗ = argmin d(x, µk )
k:Ck 6=∅
4 Results
Convergence of the clusterings Interestingly enough, when running K-Means,
the required time was hugely dependent on the dissimilarity measures consid-
ered. While most of the measures ensured a rather quick convergence of the
algorithm (between 8 and 29 iterations for the weighted Euclidean metrics,
between 21 and 32 for the Bray-Curtis ones, between 31 and 40 for the five
Canberra (one of the slowest to converge), 5 or 6 iterations for the Cosine and
extended Jaccard, and 10 for Chi-square - the maximum authorized number
14 K-Means being a hard clustering technique, each sample belongs to one and only one
cluster, and ci,j ∈ {0, 1}. For other types of clusterings, the values would have been non-
negative, fractional, each column suming up to 1 (soft clusterings).
10
being 100), some other metrics failed to converge in the allowed time. In partic-
ular, the weighted Manhattan metrics either converged very fast (in 4, 4 and 2
iterations, respectively for the first, second and fourth set of weights) or failed to
converge at all, resulting in trivial (only one non-empty cluster) or almost-trivial
clusterings - that is, a cluster of 499 points, and one of only one sample. This
was the case for the third and fifth set of weights, again with the Manhattan
measure - in other terms, the two set of weights putting a lot of emphasize on
the punctuation.
As for the weighted Chi-square measure, it never converged in less than
100 iterations. However, the resulting clusterings were not trivial (or almost)
ones, and consisted in more than 50 non-empty clusters (of between 1 and 20
samples).
The analysis of the clusterings furthermore show that the Bray-Curtis dis-
similarity measures gave very poor results, with trivial clusterings for either the
non-weighted one and all set of weights - except for the third one (the one focus-
ing on punctuation and length of sentences: cf. Fig.2 for a vizualization of its
distance to other clusterings). This strongly advocates for the dropping
of the Bray-Curtis measure in such tasks - as outlined below, the results
it gives are very similar to other metrics, and, unless punctuation is to be the
main criterion, the computational cost involved by adding a metric to the pool
is not worth it.
11
(a) W.r.t. clustering 10 (b) W.r.t. clustering 13
where the dij are the (Euclidean) distances between the points in the target
space, and f (d∗ij ) are the best-fitting distances corresponding to the input dis-
15 Notethat the clusterings 1, 2, 4, 5, 6, 25, 26 and 28 were trivial ones.
16 Forthat purpose, we had to remove the identical clusterings in order to get a nonsingular
VI matrix; indeed, 8 metrics resulting in the same, trivial clustering (all data points in the
same cluster) - they were replaced by only one instance of that trivial clustering.
12
similarities (found by monotonic regression, in order to minimize the stress)17 .
The different markers used for the points in the 2D visualization correspond
to the 4 groups derived from the 3D visualization. Indeed, while the 2D scaling,
even if it placed the clusterings in several clearly separable areas (in particular
17 The better fitting in 3 dimensions is also revealed by the stress criterion: 0.09520 in 2D,
mation of the original dissimilarities] as well as the distances [resulting distances in the d-
dimensional target space] in a Shepard plot. This provides a check on how well the distances
recreate the disparities, as well as how nonlinear the monotonic transformation from dissimi-
larities to disparities is.”, Matlab help, http://www.mathworks.com/
13
Number Measure Number Measure
1 Bray-Curtis (Weights 1) 7 Canberra (Weights 1)
3 Bray-Curtis (Weights 3) 8 Canberra (Weights 2)
17 Chi-Square (Unweighted) 9 Canberra (Weights 3)
27 Manhattan (Weights 3) 11 Canberra (Weights 5)
29 Manhattan (Weights 5) 12 Chi-Square (Weights 1)
13 Chi-Square (Weights 2)
10 Canberra (Weights 4) 14 Chi-Square (Weights 3)
15 Chi-Square (Weights 4) 16 Chi-Square (Weights 5)
23 Euclidean (Weights 4) 20 Euclidean (Weights 1)
21 Euclidean (Weights 2)
18 Cosine 22 Euclidean (Weights 3)
19 Extended Jaccard 24 Euclidean (Weights 5)
14
of its closeness to the other samples. Such a visualization can be found in Fig.6:
note that, while the sorting only was based on the mean similarities, the high
similarity values from the left correspond to low standard deviations - suggesting
that for samples truly alike, most metrics yield the same classification.
Looking then for very low standard deviations between the pairwise match-
ings20 , by determining the entries of Σ below an arbitrary threshold, we were
able to find the “most obvious matches” - that is, the good matches (entry close
to 1 in S̄) between samples on which almost every clusterings agreed (entry
close to 0 in Σ). In other words, were selected the set of matching couples
M = (i, j) i 6= j, S̄i,j > 1 − ε, Σi,j < ε0
where ε ∈ [0, 1) and ε0 > 0 are arbitrary threshold, chosen so that the resulting
set is not empty (but still relatively sparse: no more than a fraction α of the
N (N −1)
2 pairs). For this study, the thresholds were set at
ε = α 1 − max mean S̄i,j , ε0 = β min mean Σi,j
i j i j
where α = 0.5 and β = 0.75 were tuned manually. As shown in Fig.7, amongst
others21 , the group of samples {147, 163, 179, 188, 244, 377 and 455} is consid-
ered “similar” by most of the clusterings, with reference to both samples 188
and 24422 . Another group outputted was {152, 184, 273, 405}23 (the intra-
similarities for this group, with reference to the sample 184, are displayed in
20 Not taking into account the perfect average matching of a sample with itself.
21 Those two partiular groups were chosen for further analysis as they include some samples
perfectly matched together (i.e., with S̄i,j = 1 and Σi,j = 0).
22 That is, the output shown that the pairs (147, 188), (163, 188), (179, 188), (244, 188),
(377, 188), (455, 188), (147, 244), . . . (455, 188) all belonged to M.
23 The pairs (184, 152), (273, 152); (152, 184), (273, 184), (405, 184); (152, 273), (184, 273), (405, 273); (184, 405), (273, 405)
15
Figure 7: Example of detected matchings.
Fig.8). These two groups, (I) and (II), small enough to be we be further
analyzed, were a priori non-related, and it is only by looking at the samples
themselves that the matching could be explained or discarded:
Set (I) After looking up what were exactly the texts of this group (especially
the ones with perfect matching, i.e. S̄i,j = 1, Σi,j = 0: clustered together in
every single of the 29 clusterings), it appeared that 4 texts of the first set were
all by the same author, William Wymark Jacobs, and author of short stories and
novels from the beginning of the XXth century (mostly known for his tale ”The
Monkey’s Paw”): two of them were disjoint parts of the same book, while the
two remaining were different novels (the complete list of the samples in (I) and
(II) can be found in Table 3). This seems to suggest that in the case of
a characteristic tone and written style, the set of features considered
as well as the clustering method employed give sound results. Note
also that the most discriminant criterion appears to be the standard deviation:
high values of mean similarity prove to be strongly correlated with low standard
deviations.
Note that within the standard deviation threshold allowed, other books were
16
Figure 8: The samples 152, 184, 273 and 405 are matched together as well.
classified as belonging to this set: one of them by Laura Lee Hope24 , one by
Sophie May25 and the other by H. Irving Hancock - all three authors of children
literature from the early XXth century and, in that fashion, related indeed to
William Wymark Jacobs works, both by the era and the type of literature.
Since most of the metrics, even those putting very little or no importance to
word distribution, agreed on this classification, one can safely assert that the
matchings were a result of more than simple vocabulary sharing.
Looking further to the two samples at indices #163 and #179 in the Fig.7
(respectively “Bunny Brown and His Sister Sue Giving a Show” (L.L. Hope),
and “The Grammar School in Summer Athletics” (H.I. Hancock)), it appears
they were classified as similar to the W.W. Jacobs’ works by 17 non-trivial
metrics out of 22, and it might be worth noticing that once again, the Cosine
and extended Jaccard resulted in the same prediction (“#179 more alike to the
Jacobs’ works than #163, but both not that similar to them”), while the Chi-
Square (Weights 4) and Euclidean (Weights 4) also had similar outputs (“#163
slightly more likely to be like the Jacobs’ works than #179, and both quite
24 “Pseudonym used by the Stratemeyer Syndicate for the Bobbsey Twins and several other
children’s fiction.
17
Number Title Author or date
147 Jimmy, Lucy, and All (Little Prudy’s Children) Sophie May
163 The Grammar School Boys in Summer Athletics H. I. Hancock
179 Bunny Brown And His Sister Sue Giving A Show L. L. Hope
188 At Sunwich Port, Part 3 W. W. Jacobs
244 At Sunwich Port, Part 4 W. W. Jacobs
377 Odd Craft W. W. Jacobs
455 Ship’s Company W. W. Jacobs
152 Punch, or The London Charivari. Vol. 153. August 1, 1917
184 Punch, or The London Charivari. Vol. 152. January 17, 1917
273 Punch, or The London Charivari. Vol. 153. July 11, 1917
405 Punch, or The London Charivari. Vol. 153. November 7, 1917
18
amount of time (the measures discarded were the numbers 1, 3, 17, 29 and 10,
that is the Bray-Curtis (Weights 1 and 3), Manhattan (Weights 5) and Canberra
(Weights 4): that is, some of those resulting in the greatest running time). To
proceed with an even smaller number of metrics, one can try to select the best
subset of metrics, as we did, by successive trials, favouring the combinations in
which the most expensive metrics are dropped; or to change the thresholds ε
and ε0 , to improve the precision and recall of such combinations.
5 Future work
Here are discussed several possible ways of following through this project, and
extensions or improvements which could be of some interest.
Metrics and features One might want to consider more dissimilarity mea-
sures, and see how they behave and in which clustering they result.
Another option would be to increase the set of features (e.g., whether by taking
into account the full set of indices computed for every word, as suggested in Ap-
pendix 7.1, or by broadening the reference dictionary; or by defining completely
new indicators to characterize each text).
Clustering algorithm Since the choice of the clustering algorithm (i.e., K-Means)
was arbitrary, and made mainly because of its simplicity and speed, it would also
be interesting to try other clustering methods, such as Expectation-Maximization
(EM) clusterings or other type of soft clusterings. Again on K-Means, which
happens to be dependent on the initialization, it would also be worth studying
the dependency of the result on the said initialization (or try to remove that
dependency with such techniques as simulated annealing). In the few attempts
we had the time to pursue on that matter, it appeared that the behaviour of
K-Means, given different initialization patterns for the centroids, was heavily
dependent on the pseudo-semi-metric used.
or, in other terms, each text would be characterized by the following values:
26 If no marked prevalency of topic is found, then the prevalent topic would be set to a null
19
• p-tuple of prevalent overall topics
• set of k-tuples, for different values of k, showing the k-succession of preva-
lent topics in the text
This would yield another mapping from the text samples to RP , and from that to
other clusterings, based on a completely different approach of similarity, taking
into account, this time, the concepts, themes, and structure of the documents.
6 Conclusion
This paper suggested a new method to automatically classify and cluster un-
known text samples, based on their style, by defining a set of features to be
extracted from each of them and applying machine learning techniques on the
resulting data. It also attempted to analyze the behaviour of different measures
one could consider in that task, and determine their respective suitability. Even
though many possibilities and future improvements are still to consider, this
study led to the following conclusions:
The choice of features is relevant, and leads to sound results.
Suprisingly enough, quite different metrics yield very close cluster-
ings.
Comparison of these metrics shows which of them are most helpful
for our purpose, and which of them are to be discarded.
A priori unknown texts samples were found to be related, based
solely on the few features extracted.
20
7 Appendix
7.1 Complete list of features
The idea of using logarithms is inspired by the definition of the Inverse Doc-
ument Frequency: since the frequencies computed are likely to be very small
for most of the words (according to Zipf’s law), taking the natural logarithm of
the inverse results in indices in an “acceptable range”, easy to compare from a
human point of view, and without any risk of mistake due to loss of precision
during calculations (risk which would be more likely to occurr with quantities
of the order of 10−10 ).
• named entities 27
#words
– overall frequency : ln( #named entities )
#named entities
– internal frequency : ln( #(different named entities) )
– ln( #characters
#signs )
• lexical categories
#(words in text)
– ln( #(adjectives in text) )
#(words in text)
– ln( #(nouns in text) )
21
#(words in text)
– ln( #(interjections in text) )
• Words
for each word from reference dictionary29 , except named entities30
#(words in text)
– ln( #(occurrences in text) )
To quantify the average density in text
22
For u, v ∈ RM ,
v
uM
uX
d(u, v) = t ωi (ui − vi )2 (Weighted Euclidean)
i=1
X
d(u, v) = ωi |ui − vi | (Weighted Manhattan)
i
X (ui − vi )2
d(u, v) = (Chi-square distance32 )
i
|ui + vi |
X (ui − vi )2
d(u, v) = ωi (Weighted Chi-square)
i
|ui + vi |
X |ui − vi |
d(u, v) = (Canberra distance)
i
|ui | +|vi |
P
|ui − vi |
d(u, v) = P i (Bray-Curtis dissimilarity)
i |u i | +|vi |
uT v
P
ui vi
σ(u, v) = = pP i2 pP 2 (Cosine similarity)
ku k2 ·kv k2 i ui i vi
uT v
σ(u, v) = 2 2 (Extended Jaccard coefficient)
ku k2 +kv k2 −uT v
P
ui vi
= P 2 Pi 2 P
i ui + i vi − i ui vi
23
δ(u, v) = − ln σ̄(u, v), values in [0, ∞)
δ(u, v) = arccos σ̄(u, v), values in [0, π2 ]
1
δ(u, v) = σ̄(u,v) − 1, values in [0, ∞)
Each of them has its own particularities: for example, the second penalizes low
similarities, by mapping them to huge distances (since f 0 (x) = − x1 , δ increases
really fast when σ̄ → 0); the fourth is even more marked in that, as f 0 (x) = − x12 .
The third derivation acts in the opposite way: slight decreases of similarity
around 1 induce great variations in distance33 . It has, besides, the property of
preserving triangular inequality.
(a) f : x → 1 − x (b) f : x → − ln x
(c) f : x → arccos x 1
(d) f : x → x
−1
24
7.2.3 About the Extended Jaccard index
Here can be found the proof that the Extended Jaccard coefficient (Tanimoto
coefficient) takes values in [− 13 , 1]. This result was used when renormalizing it,
to derive a dissimilarity measure (cf. Appendix 7.2.2).
For x, y ∈ Rn arbitrary, non both zero34 , recalling that xT y = 14 kx +
2 2
y k −kx − y k ,
xT y xT y
σ(x, y) = 2 2 = 2
kx k +ky k −xT y kx − y k +xT y
2 2
1 kx + y k −kx − y k
=
4 kx − y k2 + 1 kx + y k2 −kx − y k2
4
2 2
kx + y k −kx − y k a−b
= 2 2 =
kx + y k +3kx − y k a + 3b
2 2
where a = kx + y k and b = kx − y k , both non-negative.
(i) Since x = y if, and only if, b = 0, x = y ⇒ σ(x, y) = 1. Reciprocally,
σ(x, y) = 1 ⇔ a − b = a + 3b ⇔ b = 0 ⇔ x = y, and finally
x = y ⇔ σ(x, y) = 1
that is, σ(x, y) ∈ [− 13 , 1), x 6= y; and the minimum is attained for (and
only for) ab = 0, i.e. a = 0 or, in other terms, x = −y.
Conclusion:
1
∀x, y ∈ Rn , σ(x, y) ∈ [− , 1]
3
and σ(x, y) = 1 ⇔ x = y, while σ(x, y) = − 13 ⇔ x = −y.
Miscellaneous
The programs and scripts used for this projected were developped in Python
(preprocessing of the text samples), Java (for the feature extraction part, inte-
grating the Stanford NLP Tools), C++ (clustering) and Matlab (result analysis).
34 If x = y = 0 n , by convention, the Extended Jaccard coefficient is equal to 1 (both
R
numerator and denominator are zero).
25
References
[1] V. R. Khapli and A.S. Bhalchandra. Comparison of Similarity Metrics for
Thumbnail Based Image Retrieval. Journal of Computer Science and Engi-
neering, 5(1):15–20, 2011. 23
26