1 views

Uploaded by AngelRibeiro10

Properties of Levenshtein, N-Gram, Cosine and Jaccard Distance Coefficients - In Sentence Matching - Cross Validated

- Uni Sci- %28Evaluation Form-2013%29
- spssintro
- Competencias Laborales en La Productividad de Los Procesos
- Methodology
- Single machine stochastic models
- FIA Fall 07 Budget Proposal Word 97-03
- Project Rules
- project2 educational statistics i
- EVALUATION OF EARLY GROWTH AND PHOTOSYNTHETIC PIGMENTS OF SOME SPECIES OF ACACIA GROWING IN AL- BAHA REGION IN SAUDI ARABIA.
- Discourse Dialogue
- 05-Handbook-on-data-quality-assessment-methods-and-tools.pdf
- New Res Sales
- Course Guide
- PUBLIC SERVICE COMMISSION CIRCULAR NOTE NO. 44 OF 2018
- CH7T5
- 3 5 a appliedstatistics
- 8
- Quantitative Aids to Decision Making
- Lightbulbs
- Introduction to Data Teacher

You are on page 1of 1

_

Cross Validated is a question and Here's how it works:

answer site for people interested in

statistics, machine learning, data

analysis, data mining, and data

visualization. Join them; it only takes a

minute:

Anybody can ask Anybody can The best answers are voted

a question answer up and rise to the top

Join

Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching

string B: 'I heard the cafeteria is serving roast-beef sandwiches today'.

Formulas:

Levenshtein distance: Minimum number of insertions, deletions or substitutions necessary to convert string a into string b

N-gram distance: sum of absolute differences of occurrences of n-gram vectors between two strings. As an example, the first 3 elements of

the bi-gram vectors for strings A and B would be (1, 1, 1) and (0, 0, 0), respectively.

Cosine similarity:

a b

a a

b b

length()

Jaccard similarity:

a b

length()

a b

word)

Bigram distance = 14

Cosine similarity = 0.33

Jaccard similarity = 0.2

I would like to understand the pros and cons of using each of the these (dis)similarity measures. If possible, it would be nice to understand

these pros/cons in the example sentence, but if you have an example that better illustrates the differences, please let me know. Also, I realize

that I can scale Levenshtein distance by the number of words in the text, but that wouldn't work for the bigram distance, since it would be

greater than 1.

To start, it seems that cosine and Jaccard provide similar results. Jaccard is actually much less computationally intensive and is also (a little

bit) easier to explain to a layman.

matsuo_basho

109 8

A good way to start to understand the differences is to dig up for their formulas, all expressed in a single a-b-c-d

"binary data form" , such as used in this answer, for example. ttnphns Jun 20 '16 at 18:10

1 @ttnphns, I understand the differences between the algorithms, just not clear on a situation where for example

cosine similarity would be superior to Jaccard similarity. matsuo_basho Jun 20 '16 at 18:14

They are not "algorithms". They are alternative proximity measures. ttnphns Jun 20 '16 at 18:15

2 If you now the formulas of them all, why not show them in your question; and then ask "how are their properties

different given these formulas? I expect this (something) but I don't understand that (something)". That would make

your question specific and showing your efforts. So far, the question is too broad. ttnphns Jun 20 '16 at 18:19

1 @ttnphns, you're right, added the formulas now. matsuo_basho Jun 20 '16 at 20:53

1 Levenshtein is a specific form of "alignment" distance, and it compares sequences of elements, i.e. both content of

elements and their order. Cosine and Jaccard compare only content (element is, say, a letter). Bi-gram distance

compares content of elements, but an element is defined specifically as 2-letter chunk. ttnphns Jun 21 '16 at 9:52

https://stats.stackexchange.com/questions/219243/properties-of-levenshtein-n-gram-cosine-and-jaccard-distance-coecients-in 1/1

- Uni Sci- %28Evaluation Form-2013%29Uploaded byJosé M. Rivas Mercury
- spssintroUploaded byএম.ইউ ইসলাম হৃদয়
- Competencias Laborales en La Productividad de Los ProcesosUploaded byIng William David
- MethodologyUploaded byLeah Tamondong Mangalinao
- Single machine stochastic modelsUploaded byqer111
- FIA Fall 07 Budget Proposal Word 97-03Uploaded byFIA
- Project RulesUploaded byজয়ন্ত দেবনাথ জয়
- project2 educational statistics iUploaded byapi-398642997
- EVALUATION OF EARLY GROWTH AND PHOTOSYNTHETIC PIGMENTS OF SOME SPECIES OF ACACIA GROWING IN AL- BAHA REGION IN SAUDI ARABIA.Uploaded byIJAR Journal
- Discourse DialogueUploaded byqwertyuiop
- 05-Handbook-on-data-quality-assessment-methods-and-tools.pdfUploaded byIulian Nenu
- New Res SalesUploaded byDarryl Rhea
- Course GuideUploaded byvarinderkaur1992
- PUBLIC SERVICE COMMISSION CIRCULAR NOTE NO. 44 OF 2018Uploaded bybahadoor22i5583
- CH7T5Uploaded byeco_varman
- 3 5 a appliedstatisticsUploaded byapi-312536537
- 8Uploaded byEditor IJTEMT
- Quantitative Aids to Decision MakingUploaded bybalakscribd
- LightbulbsUploaded byxongassilva
- Introduction to Data TeacherUploaded byNunut Andriani
- cc measurement dataUploaded byapi-302629676
- Two fortran packages for assessing initial value methodsUploaded byCarlos Mauricio Patlán
- Statistics With MatlabUploaded byWahyu Joe Pradityo
- Yearbook-2015.pdfUploaded byPartha Sarathi Roy
- Quiz1 2Uploaded byedniel maratas
- Descriptive ResearchUploaded byGabriel Naparato
- What is StatisticsUploaded bycuberbill1980
- big bookUploaded byRaniena Chokyuhyun
- Inferential StatisticsUploaded byWaqasAhmad
- SY_BA_April2012_28-2-12Uploaded bysomnathkolte

- Review Text BasedUploaded byAngelRibeiro10
- A Survey of Heterogeneous Information Network AnalysisUploaded byAngelRibeiro10
- Language, Music and Computing - Mitrenina, Eds - 2019.pdfUploaded byAngelRibeiro10
- Overlap Coefficient - WikipediaUploaded byAngelRibeiro10
- edital_poscomp 2018Uploaded byAngelRibeiro10
- curso_grafos_handout201009Uploaded byAngelRibeiro10
- Biopython_Tutorial.pdfUploaded byAngelRibeiro10
- Quando eu era um filhoteUploaded byAngelRibeiro10
- Ontolog Social Web KeynoteUploaded byAngelRibeiro10
- inplementar.pdfUploaded byAngelRibeiro10
- inplementar.pdfUploaded byAngelRibeiro10
- Guide to Unconventional Computing for MusicUploaded bySonnenschein
- Fundamentals of Algorithmics Brassard InglesUploaded byTusharVatsa
- ontolog-social-web-keynote.pdfUploaded byAngelRibeiro10
- egc2013_tutoriel_MissaouiUploaded byAngelRibeiro10
- Sound LabUploaded byAngelRibeiro10
- Beethoven's Letters. (1790--1826.) Vol. iUploaded byAngelRibeiro10
- Aristóteles - Arte PoéticaUploaded byFellipe Ferini dos Santos
- How to Use the Hungarian Algorithm_ 10 Steps (With Pictures)Uploaded byAngelRibeiro10
- natural language processingUploaded byAngelRibeiro10
- acustica.txtUploaded byAngelRibeiro10
- Guia.politicamente.incorreto.da.Historia.do.BrasilUploaded byCleber Daniel Paiva
- Redes ComplexasUploaded byAngelRibeiro10
- jumping-nlp-curves.pdfUploaded byAngelRibeiro10
- Programa Escola RCUploaded byAngelRibeiro10
- book_270.pdfUploaded bygerman2210
- Redes Complexas 2Uploaded byAngelRibeiro10
- Introduction to Computer Programming With MATLABUploaded byAngelRibeiro10
- Jacard vs PMIUploaded byAngelRibeiro10

- Supply WisdomSM, a branch of Neo Group Inc. Announces an Industry Milestone - All Aspects of Third Party Risk Monitoring Solutions Are 100% Real-Time and ContinuousUploaded byPR.com
- Lem Blues 8, Blues 12Uploaded byGeorge Cristea
- SmartPlant Instrumentation Tutorial, v2009 SP3 (9.0.3).pdfUploaded byaredas
- JY901 gyroscope User Manual by ElecmasterUploaded byElecmaster
- GE2155 UnixUploaded byAnitha Perumalsamy
- Horizontal Drain Stabilize Clay SlopesUploaded bySen Hu
- MISRA C—Some key rules to make embedded systems safer.pdfUploaded byLeiser Hartbeck
- Oldsmobile 98 - WikipediaUploaded byJaime Adrian
- Civl101 - Introduction to Tall Building StructuresUploaded byAravind Bhashyam
- GSM Product Training Technical CasesUploaded bymanuel
- TNT PROMOUploaded byCassie Layug
- Field Artillery Journal - Jul 1916Uploaded byCAP History Library
- Boyle's Law22 Lesson PlanUploaded byMontesa Allana Ea
- bibliografia-corrosion-4020Uploaded byjcbecerrat5801
- AU Syllabus StructuresUploaded byAnonymous jcMpiSxFo
- New Microsoft Office Word DocumentUploaded byPrasaanth Rock
- Arhist4 - Miag-Ao Church ReportUploaded byDavid Mendoza
- 2020-exam-1.docUploaded byRippleIllusion
- MATH-285Uploaded bywilson277
- Manual AmarilloUploaded byManh Hung Nguyen
- Cis Chapter 11Uploaded byOrio Ariel
- Fatigue Resistance Design of SteelUploaded byPrantik Adhar Samanta
- AEM Hi-Flow Fuel Rail_Installation Instructions 25-108Uploaded byTHMotorsports.net
- spectrum5_volume5Uploaded byDebasis Das
- GEO informatics III TO VIII.pdfUploaded byRaja Prabhu
- Dsorption and Desorption Mechanisms of Methylene BlueUploaded byWayan Arnata
- PDFlib in PHP HowToUploaded bybg248
- DS_Fax Appliance A102 A104Uploaded byJohnson Lukose
- GameFAQs_ Tokyo Xtreme Racer Zero (PS2) Getting Started Guide by Wolf FeatherUploaded byshaolinbr
- Apostila Básica Creo 1.0Uploaded byRenato Nery Domingues