You are on page 1of 30

Elasticsearch-3

Dealing with Human Language


Mohammad Aminul Islam
11103812
Cologne University of Applied Science

6/16/15

Topics to Be Covered
Started with language
Analyzer
Stemming
Language Identification

Identify words
Normalizing token
Ruducing words to their root form
Stemming issue
Lemmatization
Types of stemmer

Stopwords: Performance Vs Precision


Synonyms, Typoes and Mispeling
6/16/15

Goals
What is happening behind the search
engine

6/16/15

I know all the words, but that


sentence make no sense to me
Matt Groening

6/16/15

Started with languages: Analyzer


Elasticsearch has a collection of language
analyzer
These analyzer perform four works
Tokenized text into individual words
The quick brown foxes > The , quick , brown,
foxes

Lowercase tokens
The > the

Remove common stopwords


quick, brown ,foxes

Token to their root form


Foxes> fox

6/16/15

Analyzer
The english analyzer remove the possesive `s
John `s> John

The french analyzer remove the elision like


l ` and qu `
The german analyzer normalize and ae
with a, > ss

6/16/15

Question ?

I am not happy with HTC mobile> I ,am , happy ,


HTC, mobile
What can we do now?

6/16/15

Analyzer : Configuration
Language analyzer can be use without
configuration. Analyzer allows to control
some behaviour
Stem-word exclusion
For example : organ and organization are the
same word

Custom stopword

6/16/15

For example : not, no cosider as important


word.

Incorrect stemming
Stemming rules are different for different
language
We cannot use one stemming rule for all
language
Root word sometimes change the meaning
of the actual word
For example : ebay.co.uk
ebay.de

6/16/15

Identify Language
We know our own document language
External document may be contain different
language
We can use language detector for identifying
the language
For example: Compact language detector
from google
It can detect 160+ language
It can detect mutiple language within a single
line of text.

6/16/15

Identifying Words

6/16/15

Words
Words are separated whitespace or
punctuation
In English there are some controversy
word: O`clock, cooperate, eyewitness
In Deutsch or dutch there are some
combined words
Some asian language have no whitespace
between words
Dedicated analyzer for many language

6/16/15

Standard Analyzer
By default standard analyzer is used
We can define standard analyzer as a
custom analyzer

6/16/15

Standard Tokenizer
Take a string as input , process the input
and break it into individual words
Whitespace tokenizer simply break on
whitespace
You are the 1st runner home!
You,are , the, 1st, runner,home

The letter tokenizer break on any character


that is not a letter
You,are , the, st, runner,home
Standard Tokenizer use unicode text
segmentation algorighm, it allows text
containing a mixture of language

6/16/15

Reducing word to their


root form

6/16/15

Words can changes their form

Number: fox, foxes


Tense: pay, paid, paying
Gender: waiter, waitress
Person: hear, hears

Stemming try to remove difference between


inflected word and root word
For example : foxes > fox
**Problem is Root word not always same
meaning

6/16/15

Two issues of Stemming


Understemming:
Fail to reduce the word with same meaning
and same root
For example: jumped and jumps > jump
Jumping reduce to jumpi
relevent document will not come

Overstemming

Fail to keep two words with distinct meanings


separate
For example: General and generate >gener
irrelevent document will come

6/16/15

Lemmatization
It is a set of related words
For example: paying,paid,pays is pay

It can group words by their word sense


For example: wake and wakeup is different
Lemmatization is much more complicated and
expensive process

6/16/15

Types of stemmer
Algorithm Stemmers:

It is use algorithm
Easy to use
Fast ,use little memory
Good for regular works

Dictionary Stemmers:
It use dictionary
It use more momory
Have to load all words

6/16/15

Question?
Which stemmer we should use?

6/16/15

Stopwords
Performance Vs Precision

6/16/15

For search purposes some words are more


important than others. For better indexing we
need to find out valuable words
Low frequency terms:
Words that rearly appear in document have a
high value or weight
High frequency terms:
Common words that appear in many
document have lower value or weight such as
the, an

6/16/15

Default stopwords
The default English stopwords used in
Elasticsearch are as follows:
a, an, and, are, as, at, be, but, by, for, if, in,
into, is, it, no, not, of, on, or, such, that, the,
their, then, there, these, they, this, to, was,
will, with

These stopwords filtered out before indexing


with little negative impact on retrieval. Is it a
good idea?

6/16/15

Pros & Cons of Stopwords


Cons:
Distinguishing happy form not happy
Finding Shakespeares quotation To be, or not
to be
Using the country code for Norway: no

Pros:
Performance
Search Fox instead of the fox

6/16/15

Synonyms
Synonyns are listed as comma separate
values

With the => syntax, it is possible to


specify a list of terms to

6/16/15

Typoes & Mispelings


80% of human misspellings have an edit
distance of 1
80% of misspellings could be corrected
with a single edit
Edit distance specified by fuzziness
parameter 2
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters

6/16/15

Phonetic Matching
Sound similar, may be spelling differ
Algorithm for word to phonetic
representation are
Soundex algorithm is the granddaddy of t all
Metaphone and double metaphone for
English
Caverphone for matching names in New
Zealand
Klner Phonetik for better handling of
German words.

6/16/15

Q&A

6/16/15

Summary

6/16/15

Analyzer
Stemming
Identify Words
Normalizing Token
Types of Stemmer
Stopwords
Synonyms,Mispeling

Thank you

6/16/15

You might also like