You are on page 1of 28

Vector Space Model

and Features

Carl Staelin
Motivation
 Text is “messy” and is hard to manage.
 Variable length
 Contain a string of words
 We need some mathematical representation
of a document in order to produce
“similarity scores”

Lecture 2 Information Retrieval and Digital Libraries Page 2


Vector Space Model
Represent a document by a vector
 d = {w }
i ik k
 Entries on vector represent features
 Queries are a form of document and
can be represented in the same form

Lecture 2 Information Retrieval and Digital Libraries Page 3


Similarity
There are various similarity measures
• Inner product: d i  d j    d ik d jk 
k

 d d 
 
ik jk
• Cosine of Angle: cos ij  k

d d
2 2
ik jk
k k

• Euclidean distance
• …

Lecture 2 Information Retrieval and Digital Libraries Page 4


Motivation
Lots of tools to analyze/manipulate vectors
• Clustering

• Machine learning

• Matrix math, such as eigenvectors

• ???

Lecture 2 Information Retrieval and Digital Libraries Page 5


Question 1
How to convert a document into a vector?
• Documents contain variable numbers of words

• Assign each word in the language a unique index

• Create a vector with #words entries

• Vector entries

• Boolean: true iff word in document


• Real: various measures of word frequency
• Probability:

Lecture 2 Information Retrieval and Digital Libraries Page 6


Question 2
What is a word?
• A unique sequence of characters that do not

contain any spaces or punctuation?


• A single meaning?
• Goose, geese
• Run, ran, running
• Compute, computer, computers, computing
• A phrase?
• President Clinton, …

Lecture 2 Information Retrieval and Digital Libraries Page 7


Features
“Any prominent or distinctive aspect, quality, or
characteristic”
 Unique words or meanings are typically features

 Other aspects may also be features:

 Date/time
 Is this email from a known sender?
 A defined location on the feature vector
 The standard IR name for features is “Terms”

Lecture 2 Information Retrieval and Digital Libraries Page 8


Converting words to features
 Most words in most languages have
multiple forms
 Different plurality, tense, gender, …
 Many words have multiple meanings
 Different words have the same meaning

Lecture 2 Information Retrieval and Digital Libraries Page 9


Real World Problems
 Misspelled words
 Variant spellings and hyphenation
 Data interpreted as text
 Uuencoded data in email
 Assembly language symbols from shared
libraries

Lecture 2 Information Retrieval and Digital Libraries Page 10


Stemming
 Word stemming is the process of identifying the
root word
 In Hebrew this is generally “easy”
 In English this is generally difficult
 Reduces the dimensionality of the space
 May introduce confusion between word senses
 “The computer algorithm complexity…”
 “He computes the complex algorithm…”

Lecture 2 Information Retrieval and Digital Libraries Page 11


Thesaurus
“A book of synonyms and antonyms”
 Contains word equivalence classes

 May be used to further reduce vector space

dimensionality
 May be used to “expand” search query
 want -> {want, desire, wish, fancy, lust, …}
 May also introduce confusion
 “I fancy an ice cream right now.”
 “That fancy ice cream parlor is too expensive.”

Lecture 2 Information Retrieval and Digital Libraries Page 12


Dictionary
“A book listing a comprehensive or restricted
selection of the words of a language”
 A dictionary might have for each term:

 Feature index (location on feature vector)


 Document frequency
 …

Lecture 2 Information Retrieval and Digital Libraries Page 13


Zipf’s Law
Not all words are created equal…
 rank • f ~= C
k k

 For English text, C ~= N/10, where N is


number of terms in the corpus
 Implications:
 Most occurrences are common words
 Most occurrences are common (stop) words
 Most words appear infrequently
 Roughly half of all terms appear once

Lecture 2 Information Retrieval and Digital Libraries Page 14


Insights and Observations
 Vectors are sparse
 Some words will appear in nearly all
documents
 Many words will appear in only one document
 Words which appear in only a few documents
tend to be most “important”
 Vector space representations do not directly
capture thesaurus-like word similarities

Lecture 2 Information Retrieval and Digital Libraries Page 15


Boolean Weights
 This is the simplest model
 wik
 1 if term k in document i
 0 otherwise

Lecture 2 Information Retrieval and Digital Libraries Page 16


TFIDF Weights
TFIDF definitions:
fik: #occurrences of term tk in document Di
tfik: fik / maxl(fil) normalized term frequency
dfk: #documents which contain tk
idfk: log(d / dfk) where d is the total number of documents
wik: tfik idfk term weight
Similarity TFIDFij = wi• wj = Σk(tfik tfjk idfk idfk)
Intuition: rare words get more weight, common
words less weight

Lecture 2 Information Retrieval and Digital Libraries Page 17


TFDIF Variants
There are lots of variants of TFIDF
fik: #occurrences of term tk in document Di
Standard Variant
wik: (0.5 + 0.5tfik) idfk
Mehran Sahami’s Variant
wik: (fik / Σl fil)½

Lecture 2 Information Retrieval and Digital Libraries Page 18


Singhal’s TFIDF Variant
This method does not treat the query and
document symmetrically.
wij from TFIDF
di: # of unique terms in document Di
atfi: the average term frequency in Di
djk = (1 + log(tfjk))/(1 + log(atfj))
p: pivot point at which probability of retrieval equals
probability of relevance; p is a document length
s: slope of correction curve
Similarity(Qi,Dj) = ΣkwiKdjk/((1-s)p + s|dj|)

Lecture 2 Information Retrieval and Digital Libraries Page 19


Probabilistic Retrieval
Ranking is done based on an estimate of the
probability of their relevance to the query
• Advantages: probabilities are familiar and

there are a number of useful tools


• Disadvantage: sometimes costly to compute

Entries in document vectors are probabilities

Lecture 2 Information Retrieval and Digital Libraries Page 20


Independence Assumptions
Many probabilistic retrieval schemes make
these assumptions:
• Distribution of terms in relevant documents

is independent and their distribution in all


documents is independent
• Distribution of terms in relevant documents

is independent and their distribution in non-


relevant documents is independent

Lecture 2 Information Retrieval and Digital Libraries Page 21


Ordering Principles
Some sample principles which may underlie
probabilistic weighting schemes:
• Probable relevance is based only on the

presence of search terms in the documents


• Probable relevance is based on both the

presence of search terms in documents and


their absence from documents

Lecture 2 Information Retrieval and Digital Libraries Page 22


Binary Independence Retrieval
• Attempt to identify relevant documents R
• Irrelevant documents R
• Probability dj relevant: P(R|dj)
• Probability dj irrelevant: P(R|dj)
• Sim(dj,q) = P(dj|R) / P(dj|R)
• = ∑i wij wiq (log(P(ki|R)/(1-P(ki|R))) +
log((1-P(ki|R))/P(ki|R)))
Lecture 2 Information Retrieval and Digital Libraries Page 23
Binary Independence Retrieval
• How to compute P(ki|R), P(ki|R)?
• ni is number of documents containing ki
• P0(ki|R) = 0.5, P0(ki|R) = ni/N
• V = # of with ranking above threshold r
• Vi = # of relevant documents containing ki
• P1(ki|R) = (Vi + ni/N) / (V + 1)
• P1(ki|R) = (ni – Vi + ni/N) / (N – V + 1)
• Can now iterate!

Lecture 2 Information Retrieval and Digital Libraries Page 24


Binary Independence Model
• In the initial probabilistic model, the
weights were binary, similar to the boolean
model.
• However, it is possible to enhance the
probabilistic model by using various
weighting schemes
• ntfik = fik / max(fi1, fi2, .., fin)

Lecture 2 Information Retrieval and Digital Libraries Page 25


Croft&Harper’s Model
• Similar probabilistic model
• wik = 1iff term k in document i, 0 o.w.
• N = # of documents in corpus
• nk = # of documents containing term k
• Sim = C ∑k wik wjk + ∑k wik wjk log((N-nk)/nk)

Lecture 2 Information Retrieval and Digital Libraries Page 26


Non-Binary Independence Model
• Basic idea: try to compute probability that a term
appears in relevant/irrelevant documents with
frequency f
• Requires a lot more work than simply computing
the probability that a term appears in
relevant/irrelevant documents
• In real systems, usually use normalized frequency
intervals, e.g. (0..0.01], (0.01..0.2], …

Lecture 2 Information Retrieval and Digital Libraries Page 27


Recommended Texts
• Modern Information Retrieval, by Ricardo Baeza-
Yates and Berthier Ribeiro-Neto, Addison-
Wesley, 1999.
• Information Retrieval Algorithms and Heuristics,
by David A. Grossman and Ophir Frieder, Kluwer
Academic Publishers, 1998.
• Python Essential Reference, by David M. Beazley,
New Riders, 2000.

Lecture 2 Information Retrieval and Digital Libraries Page 28

You might also like