Professional Documents
Culture Documents
Sumber: CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Term Vocabulary & Postings lists (Tokenisasi)
Ch. 1
Pertemuan sebelumnya:
Struktur dari Inverted Indeks:
Dictionary (Vocabulary) & Inverted List (Postings)
Token stream
Friends Romans
Linguistic module
Countrymen
Modified tokens
friend
Indexer friend
roman
countryman
2
1
4
2
4
Inverted index
roman countryman
13
16
Parsing Dokumen
Perhatikan terlebih dahulu format dokumen
pdf/word/excel/html ?
Sec. 2.1
Complications: Format/Language
Dokumen yang akan diindeks dapat berupa dokumen yang ditulis dalam beberapa bahasa
Sebuah indeks dapat mengandung kata dari beberapa bahasa Karena sebuah dokumen dapat ditulis dalam beberapa bahasa Contoh: Email dalam bahasa Inggris tetapi attacment dari email adalah dokumen yang ditulis dalam bahasa Jerman
Sec. 2.2.1
Tokenisasi (Tokenization)
Input: Friends, Romans, Countrymen Output: Tokens
Friends Romans Countrymen
Jadi token adalah sederetan karakter (a sequence of characters) dalam dokumen Setiap token menjadi kandidat dari elemen dalam indeks, tentunya setelah preprocessing
8
Sec. 2.2.1
Tokenisasi (Tokenization)
Beberapa isu dalam tokenisasi: Finlands capital
Finland? Finlands? Finlands? Hewlett-Packard Hewlett dan Packard sebagai dua token atau satu?
state-of-the-art: break up hyphenated sequence co-education lowercase, lower-case, lower case?
Sec. 2.2.1
Angka (Numbers)
3/12/91 Mar. 12, 1991 12/3/91 No. B-52 Kode: 324a3df234cb23e Telepon: (0651) 234-2333 Biasanya angka memiliki space diantaranya Sistem IR yang lama tidak mengindeks angka
Tapi angka itu penting. Coba bayangkan bila ingin mencari baris dari error kode program melalui Sistem IR atau mencari nomor tertentu Salah satu solusi adalah menggunakan mekanisme n-grams
10
Sec. 2.2.1
Sec. 2.2.1
Katakana
Hiragana
Kanji
Romaji
12
Sec. 2.2.1
13
Sec. 2.2.2
Stop words
Menggunakan stop list, kata-kata yang sering muncul (tetapi kurang penting) dapat dikeluarkan dari indeks:
Secara semantic mereka tidak penting: the, a, and, to, be Jumlahnya cukup banyak: ~30% dari semua kata dalam corpus
14
Sec. 2.2.3
Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary We most commonly implicitly define equivalence classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
Sec. 2.2.3
Even in languages that standardly have accents, users often may not type them
Often best to normalize to a de-accented term
Tuebingen, Tbingen, Tubingen Tubingen
16
Sec. 2.2.3
Tokenization and normalization may depend on the language and so is intertwined with language detection
Morgen will ich in MIT
Crucial: Need to normalize indexed text as well as query terms into the same form
17
Sec. 2.2.3
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
e.g., General Motors Fed vs. fed SAIL vs. sail
Often best to lower case everything, since users will use lowercase regardless of correct capitalization
Google example:
Query C.A.T. #1 result was for cat (well, Lolcats) not Caterpillar Inc.
18
Sec. 2.2.3
Normalization to terms
An alternative to equivalence classing is to do asymmetric expansion An example of where this may be useful
Enter: window Enter: windows Enter: Windows Search: window, windows Search: Windows, windows, window Search: Windows
19
Sec. 2.2.4
Lemmatization
Reduce inflectional/variant forms to base form E.g.,
am, are, is be car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
Lemmatization implies doing proper reduction to dictionary headword form
21
Sec. 2.2.4
Stemming
Reduce terms to their roots before indexing Stemming suggest crude affix chopping
language dependent e.g., automate(s), automatic, automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
22
Sec. 2.2.4
Porters algorithm
Commonest algorithm for stemming English
Results suggest its at least as good as other stemming options
23
Sec. 2.2.4
24
Sec. 2.2.4
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Full morphological analysis at most modest benefits for retrieval Do stemming and other normalizations help?
English: very mixed results. Helps recall but harms precision
operative (dentistry) oper operational (research) oper operating (systems) oper
Sec. 2.3
2
2 8 1
4
2
8
3
41
8
48
11
64
17
128
21
Brutus
31 Caesar
If the list lengths are m and n, the merge takes O(m+n) operations.
Can we do better? Yes (if index isnt changing too fast).
26