Welcome to Scribd!

Skip carousel

Comparing Levenshtein, N-Gram, Cosine and Jaccard Distance Measures

Uploaded by

AngelRibeiro10

0% found this document useful (0 votes)

57 views1 page

Properties of Levenshtein, N-Gram, Cosine and Jaccard Distance Coefficients - In Sentence Matching - Cross Validated

Original Title

Text Mining

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Properties of Levenshtein, N-Gram, Cosine and Jaccard Distance Coefficients - In Sentence Matching - Cross Validated

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

57 views1 page

Comparing Levenshtein, N-Gram, Cosine and Jaccard Distance Measures

Uploaded by

AngelRibeiro10

Properties of Levenshtein, N-Gram, Cosine and Jaccard Distance Coefficients - In Sentence Matching - Cross Validated

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 1

Search inside document

10/1/2017 text mining - Properties of Levenshtein, N-Gram, cosine and Jaccard distance coecients - in sentence matching - Cross

join this community tour help

_
Cross Validated is a question and Here's how it works:
answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. Join them; it only takes a
minute:
Anybody can ask Anybody can The best answers are voted
a question answer up and rise to the top
Join

Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching

Let's say I have two strings:

string A: 'I went to the cafeteria and bought a sandwich.'

string B: 'I heard the cafeteria is serving roast-beef sandwiches today'.

Formulas:

Levenshtein distance: Minimum number of insertions, deletions or substitutions necessary to convert string a into string b

N-gram distance: sum of absolute differences of occurrences of n-gram vectors between two strings. As an example, the first 3 elements of
the bi-gram vectors for strings A and B would be (1, 1, 1) and (0, 0, 0), respectively.

Cosine similarity:
a b

a a

b b

length()

Jaccard similarity:
a b

length()
a b

Metrics on word granularity in the examples sentences:

Levenshtein distance = 7 (if you consider sandwich and sandwiches as a different

word)
Bigram distance = 14
Cosine similarity = 0.33
Jaccard similarity = 0.2

I would like to understand the pros and cons of using each of the these (dis)similarity measures. If possible, it would be nice to understand
these pros/cons in the example sentence, but if you have an example that better illustrates the differences, please let me know. Also, I realize
that I can scale Levenshtein distance by the number of words in the text, but that wouldn't work for the bigram distance, since it would be
greater than 1.

To start, it seems that cosine and Jaccard provide similar results. Jaccard is actually much less computationally intensive and is also (a little
bit) easier to explain to a layman.

text-mining distance-functions similarities

edited Jun 20 '16 at 20:51 asked Jun 16 '16 at 18:38

matsuo_basho
109 8

A good way to start to understand the differences is to dig up for their formulas, all expressed in a single a-b-c-d
"binary data form" , such as used in this answer, for example. ttnphns Jun 20 '16 at 18:10

1 @ttnphns, I understand the differences between the algorithms, just not clear on a situation where for example
cosine similarity would be superior to Jaccard similarity. matsuo_basho Jun 20 '16 at 18:14

They are not "algorithms". They are alternative proximity measures. ttnphns Jun 20 '16 at 18:15

2 If you now the formulas of them all, why not show them in your question; and then ask "how are their properties
different given these formulas? I expect this (something) but I don't understand that (something)". That would make
your question specific and showing your efforts. So far, the question is too broad. ttnphns Jun 20 '16 at 18:19

1 @ttnphns, you're right, added the formulas now. matsuo_basho Jun 20 '16 at 20:53

1 Levenshtein is a specific form of "alignment" distance, and it compares sequences of elements, i.e. both content of
elements and their order. Cosine and Jaccard compare only content (element is, say, a letter). Bi-gram distance
compares content of elements, but an element is defined specifically as 2-letter chunk. ttnphns Jun 21 '16 at 9:52

https://stats.stackexchange.com/questions/219243/properties-of-levenshtein-n-gram-cosine-and-jaccard-distance-coecients-in 1/1

08 Teen Talk The Language of Adolescents Cambridge University
Document313 pages
08 Teen Talk The Language of Adolescents Cambridge University
Marena Cielo Carstens
100% (1)
DP-100 Exam
Document20 pages
DP-100 Exam
Swati Kohli
No ratings yet
Can Anyone Please Explain The The Difference Between A Vector and A Matrix?
Document3 pages
Can Anyone Please Explain The The Difference Between A Vector and A Matrix?
Alfaro
No ratings yet
What Is Cross-Entropy?: 1 Answer
Document3 pages
What Is Cross-Entropy?: 1 Answer
MD. ABIR HASAN 1704018
No ratings yet
General Relativity - What Is The Best Way To Imagine The Difference Between Vectors and One-Forms? - Physics Stack Exchange
Document6 pages
General Relativity - What Is The Best Way To Imagine The Difference Between Vectors and One-Forms? - Physics Stack Exchange
Kurt Patzik
No ratings yet
Sila U Kablovima
Document3 pages
Sila U Kablovima
Pavao Gos
No ratings yet
Algebra Precalculus - Is There A Name For This Strange Solution To A Quadratic Equation Involving A Square Root - Mathematics Stack Exchange
Document6 pages
Algebra Precalculus - Is There A Name For This Strange Solution To A Quadratic Equation Involving A Square Root - Mathematics Stack Exchange
Winston Gomez
No ratings yet
TSP Problem Solving Techniques
Document46 pages
TSP Problem Solving Techniques
umyangel18601
No ratings yet
Maximize Maxwell's Equations
Document4 pages
Maximize Maxwell's Equations
ahmad.a.touseef
No ratings yet
Trigonometry - Can Anyone Think of Some More Difficult Trigonometric Equations Looking for Olympiadcontest Style Problems.
Document1 page
Trigonometry - Can Anyone Think of Some More Difficult Trigonometric Equations Looking for Olympiadcontest Style Problems.
Florije Luzha
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
Document1 page
Why Do You Need To Scale Data in KNN: 3 Answers
vaskore
No ratings yet
Department of Education Lesson on Similarity of Triangles
Document17 pages
Department of Education Lesson on Similarity of Triangles
Marjuline De Guzman
No ratings yet
Understanding Covariant and Contravariant Vectors
Document3 pages
Understanding Covariant and Contravariant Vectors
Kurt Patzik
No ratings yet
Understanding Covariant and Contravariant Vectors
Document3 pages
Understanding Covariant and Contravariant Vectors
Kurt Patzik
No ratings yet
Hw4 Probs
Document12 pages
Hw4 Probs
sfaawrgawg
No ratings yet
Does Relativistic Quantum Mechanics Really Violate Causality
Document4 pages
Does Relativistic Quantum Mechanics Really Violate Causality
shiravand
No ratings yet
P2 Chapter 6::: Trigonometry
Document29 pages
P2 Chapter 6::: Trigonometry
amandeep kaur
No ratings yet
Mock Test 2013
Document9 pages
Mock Test 2013
Nadia
No ratings yet
Electric field inside a uniformly polarised sphere - Physics Stack Exchange
Document3 pages
Electric field inside a uniformly polarised sphere - Physics Stack Exchange
H L
No ratings yet
Graphs and Networks
Document5 pages
Graphs and Networks
vidula
No ratings yet
Th3Numb3Rthr33'S Mock Mathcounts National Round
Document12 pages
Th3Numb3Rthr33'S Mock Mathcounts National Round
Hlack Bammer
No ratings yet
Difference Between Function and Equation: 3 Answers
Document1 page
Difference Between Function and Equation: 3 Answers
Lucky Canda Gojetia
No ratings yet
Stephen Ward, Paul Fannon - A Level Mathematics For AQA Student Book 1 (AS - Year 1) With Cambridge Elevate Edition (2 Years) (AS - A Level Mathematics For AQA) - Cambridge University Press (2017)
Document1,361 pages
Stephen Ward, Paul Fannon - A Level Mathematics For AQA Student Book 1 (AS - Year 1) With Cambridge Elevate Edition (2 Years) (AS - A Level Mathematics For AQA) - Cambridge University Press (2017)
老湿机
100% (1)
Radicals - What Rational Numbers Have Rational Square Roots - Mathematics Stack Exchange
Document7 pages
Radicals - What Rational Numbers Have Rational Square Roots - Mathematics Stack Exchange
Frankie Liu
No ratings yet
Jacard Vs PMI
Document2 pages
Jacard Vs PMI
AngelRibeiro10
No ratings yet
Modeling Data With Polynomials: Lesson
Document8 pages
Modeling Data With Polynomials: Lesson
Rojane L. Alcantara
No ratings yet
Sequences and Series - Is The Sum of Sin (N) - N Convergent or Divergent - Mat
Document2 pages
Sequences and Series - Is The Sum of Sin (N) - N Convergent or Divergent - Mat
Eric Barnes
No ratings yet
PEE - Lesson 1
Document11 pages
PEE - Lesson 1
Nathaniel Napay
No ratings yet
Understanding the difference between vector and raster data models
Document4 pages
Understanding the difference between vector and raster data models
স্ব জল
No ratings yet
String - Python - How To Print Range A-Z - Stack Overflow
Document8 pages
String - Python - How To Print Range A-Z - Stack Overflow
vaskore
No ratings yet
Detailed Trigonometry Lesson Plan
Document4 pages
Detailed Trigonometry Lesson Plan
george taclobao
No ratings yet
Section B Tips: DR J Frost (Jfrost@tiffin - Kingston.sch - Uk)
Document16 pages
Section B Tips: DR J Frost (Jfrost@tiffin - Kingston.sch - Uk)
andreivlad
No ratings yet
Mathematical Association of America Math Horizons
Document3 pages
Mathematical Association of America Math Horizons
Furkan Bilgekagan Ertugrul
No ratings yet
Combinatorics - Number of Derangements of The Word BOTTLE - Mathematics Stack Exchange
Document9 pages
Combinatorics - Number of Derangements of The Word BOTTLE - Mathematics Stack Exchange
DanielValadãoBastos
No ratings yet
4 How To Solve Recurrence Relation
Document18 pages
4 How To Solve Recurrence Relation
SATYA
No ratings yet
Pra Vartak: I Instructions
Document3 pages
Pra Vartak: I Instructions
Qamar Javed
No ratings yet
HKYAS
Document20 pages
HKYAS
Aaryan Sukhadia
No ratings yet
Mathematics
Document62 pages
Mathematics
Y D Amon Ganzon
No ratings yet
Fiber Optic Communication Systems Techniques Lecture
Document13 pages
Fiber Optic Communication Systems Techniques Lecture
vishnu anil
No ratings yet
Fundamental Equation(s) of String Theory - Physics Stack Exchange PDF
Document1 page
Fundamental Equation(s) of String Theory - Physics Stack Exchange PDF
Jason Adrian Wu
No ratings yet
Numerov Method
Document4 pages
Numerov Method
jaswinder singh
No ratings yet
Math 9-Q4-Module-1
Document18 pages
Math 9-Q4-Module-1
An Neh Gyn
100% (2)
Gr8 March 21-22, 2023
Document13 pages
Gr8 March 21-22, 2023
Mark Billy Rigunay
No ratings yet
P2 Chapter 10::: Numerical Methods
Document30 pages
P2 Chapter 10::: Numerical Methods
NaveF
No ratings yet
RMM 2017 Solutions - Yeo
Document8 pages
RMM 2017 Solutions - Yeo
ElevenPlus Parents
No ratings yet
MITOCW - MITRES - 18-007 - Part4 - Lec2 - 300k.mp4: Professor
Document16 pages
MITOCW - MITRES - 18-007 - Part4 - Lec2 - 300k.mp4: Professor
gaur1234
No ratings yet
Algorithm - Multiply Polynomials - Stack Overflow
Document4 pages
Algorithm - Multiply Polynomials - Stack Overflow
Cupa no Densetsu
No ratings yet
English 3am19 1trim2
Document2 pages
English 3am19 1trim2
kaimero chang
No ratings yet
Theory - Triangle Tonnetz Vs
Document4 pages
Theory - Triangle Tonnetz Vs
benny
100% (1)
c3 Coursework Fixed Point Iteration
Document8 pages
c3 Coursework Fixed Point Iteration
afjwdkwmdbqegq
100% (2)
Given Three Points Compute Affine Transformation - Stack Overflow
Document1 page
Given Three Points Compute Affine Transformation - Stack Overflow
Wicho
No ratings yet
Midt1 13wi
Document5 pages
Midt1 13wi
Smartunblurr
No ratings yet
Emroz'S: 100% Error-Free & Self-Sufficient
Document39 pages
Emroz'S: 100% Error-Free & Self-Sufficient
YourDaddY
No ratings yet
Reverse Parametrization in Scalar Field
Document9 pages
Reverse Parametrization in Scalar Field
Romesor Apol
No ratings yet
LP 2 Six Trigonometry
Document7 pages
LP 2 Six Trigonometry
Noviemar Ursal
No ratings yet
Group 1 - Final Exam Math 705
Document11 pages
Group 1 - Final Exam Math 705
Gertrudes Guacena
No ratings yet
Simultaneous Linear Equations
Document14 pages
Simultaneous Linear Equations
Khanim
No ratings yet
Polytechnicuniversityofthephilippi NES: Graph Theory Definition of Graph Theory
Document5 pages
Polytechnicuniversityofthephilippi NES: Graph Theory Definition of Graph Theory
Ernest Matthieu Goria
No ratings yet
Opposite Hypotenuse Adjacent Hypotenuse Opposite Adjacent: SOH-CAH-TOA (Repeat 3x) Sine (Repeat 2x)
Document7 pages
Opposite Hypotenuse Adjacent Hypotenuse Opposite Adjacent: SOH-CAH-TOA (Repeat 3x) Sine (Repeat 2x)
Bingkay Cabural
No ratings yet
Lecture 2 Reciprocal Lattice Notes
Document3 pages
Lecture 2 Reciprocal Lattice Notes
oluwasegunadebayo91
No ratings yet
(X) 1 (, ) (A) A (X) F (X) X (B) (B) +B
Document2 pages
(X) 1 (, ) (A) A (X) F (X) X (B) (B) +B
PK recordings
No ratings yet
A Concept of Limits
From Everand
A Concept of Limits
Donald W. Hight
Rating: 4 out of 5 stars
4/5 (4)
Error Based and Reward Based Learning: April 2016
Document24 pages
Error Based and Reward Based Learning: April 2016
AngelRibeiro10
No ratings yet
A Closed-Loop Brain-Computer Music Interface For Continuous Affective Interaction
Document4 pages
A Closed-Loop Brain-Computer Music Interface For Continuous Affective Interaction
AngelRibeiro10
No ratings yet
The Polyglot Project
Document534 pages
The Polyglot Project
routrhead
100% (4)
Biopython Tutorial PDF
Document332 pages
Biopython Tutorial PDF
AngelRibeiro10
No ratings yet
Calculus Cheat Sheet Derivatives
Document4 pages
Calculus Cheat Sheet Derivatives
AngelRibeiro10
No ratings yet
Language, Music and Computing - Mitrenina, Eds - 2019 PDF
Document239 pages
Language, Music and Computing - Mitrenina, Eds - 2019 PDF
AngelRibeiro10
No ratings yet
Unified, Real-Time Object Detection with YOLO
Document10 pages
Unified, Real-Time Object Detection with YOLO
David Budaghyan
No ratings yet
Python Audio Signal Processing
Document4 pages
Python Audio Signal Processing
Rajat
No ratings yet
OpenMIIR Dataset Enables EEG Music Imagery Research
Document7 pages
OpenMIIR Dataset Enables EEG Music Imagery Research
AngelRibeiro10
No ratings yet
Brain-Computer Music Interface For Composition and Performance
Document8 pages
Brain-Computer Music Interface For Composition and Performance
AngelRibeiro10
No ratings yet
A Survey of Heterogeneous Information Network Analysis
Document45 pages
A Survey of Heterogeneous Information Network Analysis
AngelRibeiro10
No ratings yet
ML Interperatability
Document46 pages
ML Interperatability
rcoca_1
No ratings yet
Fundamentals of Algorithmics Brassard Ingles
Document530 pages
Fundamentals of Algorithmics Brassard Ingles
dastaango
No ratings yet
Guide To Unconventional Computing For Music
Document290 pages
Guide To Unconventional Computing For Music
Sonnenschein
100% (4)
Tutorial On Mining Heterogeneous Information Networks: Acknowledgement
Document35 pages
Tutorial On Mining Heterogeneous Information Networks: Acknowledgement
AngelRibeiro10
No ratings yet
Ontolog Social Web Keynote
Document17 pages
Ontolog Social Web Keynote
AngelRibeiro10
No ratings yet
Introduction To Complex Networks: Flavia Bonomo
Document38 pages
Introduction To Complex Networks: Flavia Bonomo
AngelRibeiro10
No ratings yet
Ontolog Social Web Keynote PDF
Document17 pages
Ontolog Social Web Keynote PDF
AngelRibeiro10
No ratings yet
Review Text Based
Document8 pages
Review Text Based
AngelRibeiro10
No ratings yet
Overlap Coefficient - Wikipedia
Document1 page
Overlap Coefficient - Wikipedia
AngelRibeiro10
No ratings yet
Redes Complexas 2
Document69 pages
Redes Complexas 2
AngelRibeiro10
No ratings yet
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
Document6 pages
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
AngelRibeiro10
No ratings yet
How To Use The Hungarian Algorithm - 10 Steps (With Pictures)
Document2 pages
How To Use The Hungarian Algorithm - 10 Steps (With Pictures)
AngelRibeiro10
No ratings yet
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
Document6 pages
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
AngelRibeiro10
No ratings yet
Sound Lab: Power Spectra: Background
Document4 pages
Sound Lab: Power Spectra: Background
AngelRibeiro10
No ratings yet
Jumping NLP Curves PDF
Document10 pages
Jumping NLP Curves PDF
AngelRibeiro10
No ratings yet
Jacard Vs PMI
Document2 pages
Jacard Vs PMI
AngelRibeiro10
No ratings yet
Introduction To Computer Programming With MATLAB
Document6 pages
Introduction To Computer Programming With MATLAB
AngelRibeiro10
No ratings yet
Acustica
Document2 pages
Acustica
AngelRibeiro10
No ratings yet
Deep Learning With Keras::: Cheat Sheet
Document2 pages
Deep Learning With Keras::: Cheat Sheet
Minh Nguyễn
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
Document28 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
Rylee Simth
No ratings yet
Natural Language Database
Document68 pages
Natural Language Database
Roshan Tirkey
No ratings yet
Word2Vec Demo
Document12 pages
Word2Vec Demo
Gia Ân Ngô Triệu
No ratings yet
Dire Dawa University Thesis on Designing and Developing an Afan Oromo Word Sequence Prediction Model Using Deep Learning
Document73 pages
Dire Dawa University Thesis on Designing and Developing an Afan Oromo Word Sequence Prediction Model Using Deep Learning
Lusi ሉሲ
100% (2)
3-Lecture Three - (Chapter Two-N-gram Language Models)
Document28 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
Getnete degemu
No ratings yet
Maharana Pratap Engineering College: Computer Science and Engineering
Document14 pages
Maharana Pratap Engineering College: Computer Science and Engineering
Nivedita Singh
No ratings yet
1 PDF
Document28 pages
1 PDF
qaro kadu
No ratings yet
CS224n: Natural Language Processing With Deep Learning
Document14 pages
CS224n: Natural Language Processing With Deep Learning
Muhammad Rizwan Khalid
No ratings yet
NLP Sentiment Analysis
Document7 pages
NLP Sentiment Analysis
Kartikey Singh
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
Document41 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
ni9khilduck.com
No ratings yet
Automated generation of code-based YARA rules
Document13 pages
Automated generation of code-based YARA rules
xn0d0x
No ratings yet
A Phrasal Expressions List: Ron Martinez and Norbert Schmitt
Document22 pages
A Phrasal Expressions List: Ron Martinez and Norbert Schmitt
Zulaikha Zulkifli
No ratings yet
Drug recommendation system using sentiment analysis and machine learning
Document8 pages
Drug recommendation system using sentiment analysis and machine learning
taf
No ratings yet
SEO Book
Document32 pages
SEO Book
idletom
No ratings yet
Group Assignment Text Analytics Techniques Python SAS (25% of total marks
Document4 pages
Group Assignment Text Analytics Techniques Python SAS (25% of total marks
Sharveen Veen
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
Document24 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
Juan Zarate
No ratings yet
Analyze Sentiment with Naive Bayes
Document4 pages
Analyze Sentiment with Naive Bayes
Vshala Vtechcoders
No ratings yet
ML Text Classification Lecture
Document75 pages
ML Text Classification Lecture
Zara Jamshaid
No ratings yet
NLP2 7
Document400 pages
NLP2 7
srirams007
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
Document48 pages
Learning Transferable Visual Models From Natural Language Supervision
Tzu-Hsien Tsai
No ratings yet
A Vietnamese Language Model Based On Recurrent Neural Network
Document5 pages
A Vietnamese Language Model Based On Recurrent Neural Network
phuccoi
No ratings yet
Ogbogu - Fpocsha17095 - Complete Project
Document75 pages
Ogbogu - Fpocsha17095 - Complete Project
nweke sunday Basilica
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
Document8 pages
Information Retrieval Using Effective Bigram Topic Modeling
Ches Pascual
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
Document89 pages
2017 Phrase Mining From Massive Text and Its Applications
acepmardiyana
No ratings yet
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
Document13 pages
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
Big Daddy
No ratings yet
Evaluation of Language Identification Methods
Document24 pages
Evaluation of Language Identification Methods
Iskandar Abdurrani
No ratings yet
A Survey On Automatically Mining Facets For Queries From Their Search Results
Document6 pages
A Survey On Automatically Mining Facets For Queries From Their Search Results
International Journal of Innovations in Engineering and Science
No ratings yet