You are on page 1of 1

10/1/2017 text mining - Properties of Levenshtein, N-Gram, cosine and Jaccard distance coecients - in sentence matching - Cross

join this community tour help

_
Cross Validated is a question and Here's how it works:
answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. Join them; it only takes a
minute:
Anybody can ask Anybody can The best answers are voted
a question answer up and rise to the top
Join

Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching

Let's say I have two strings:

string A: 'I went to the cafeteria and bought a sandwich.'


string B: 'I heard the cafeteria is serving roast-beef sandwiches today'.

Formulas:

Levenshtein distance: Minimum number of insertions, deletions or substitutions necessary to convert string a into string b

N-gram distance: sum of absolute differences of occurrences of n-gram vectors between two strings. As an example, the first 3 elements of
the bi-gram vectors for strings A and B would be (1, 1, 1) and (0, 0, 0), respectively.

Cosine similarity:
a b


a a

b b

length()

Jaccard similarity:
a b

length()
a b

Metrics on word granularity in the examples sentences:

Levenshtein distance = 7 (if you consider sandwich and sandwiches as a different


word)
Bigram distance = 14
Cosine similarity = 0.33
Jaccard similarity = 0.2

I would like to understand the pros and cons of using each of the these (dis)similarity measures. If possible, it would be nice to understand
these pros/cons in the example sentence, but if you have an example that better illustrates the differences, please let me know. Also, I realize
that I can scale Levenshtein distance by the number of words in the text, but that wouldn't work for the bigram distance, since it would be
greater than 1.

To start, it seems that cosine and Jaccard provide similar results. Jaccard is actually much less computationally intensive and is also (a little
bit) easier to explain to a layman.

text-mining distance-functions similarities

edited Jun 20 '16 at 20:51 asked Jun 16 '16 at 18:38


matsuo_basho
109 8

A good way to start to understand the differences is to dig up for their formulas, all expressed in a single a-b-c-d
"binary data form" , such as used in this answer, for example. ttnphns Jun 20 '16 at 18:10

1 @ttnphns, I understand the differences between the algorithms, just not clear on a situation where for example
cosine similarity would be superior to Jaccard similarity. matsuo_basho Jun 20 '16 at 18:14

They are not "algorithms". They are alternative proximity measures. ttnphns Jun 20 '16 at 18:15

2 If you now the formulas of them all, why not show them in your question; and then ask "how are their properties
different given these formulas? I expect this (something) but I don't understand that (something)". That would make
your question specific and showing your efforts. So far, the question is too broad. ttnphns Jun 20 '16 at 18:19

1 @ttnphns, you're right, added the formulas now. matsuo_basho Jun 20 '16 at 20:53

1 Levenshtein is a specific form of "alignment" distance, and it compares sequences of elements, i.e. both content of
elements and their order. Cosine and Jaccard compare only content (element is, say, a letter). Bi-gram distance
compares content of elements, but an element is defined specifically as 2-letter chunk. ttnphns Jun 21 '16 at 9:52

https://stats.stackexchange.com/questions/219243/properties-of-levenshtein-n-gram-cosine-and-jaccard-distance-coecients-in 1/1