Vector Space Model and Features: Carl Staelin

Vector Space Model
and Features
Carl Staelin
Motivation
 Text is “messy” and is hard to manage.
 Variable length
 Contain a string of words
 We need some mathematical representation
of a document in order to produce
“similarity scores”
Lecture 2 Information Retrieval and Digital Libraries Page 2

Vector Space Model
Represent a document by a vector
 d = {w }
i ik k
 Entries on vector represent features
 Queries are a form of document and
can be represented in the same form

Similarity
There are various similarity measures
• Inner product: d i  d j    d ik d jk 
k
 d d 
 
ik jk
• Cosine of Angle: cos ij  k
d d
2 2
ik jk
k k
• Euclidean distance
• …

Motivation
Lots of tools to analyze/manipulate vectors
• Clustering
• Machine learning
• Matrix math, such as eigenvectors
• ???

Question 1
How to convert a document into a vector?
• Documents contain variable numbers of words
• Assign each word in the language a unique index
• Create a vector with #words entries
• Vector entries
• Boolean: true iff word in document

• Real: various measures of word frequency
• Probability:

Question 2
What is a word?
• A unique sequence of characters that do not
contain any spaces or punctuation?

• A single meaning?
• Goose, geese
• Run, ran, running
• Compute, computer, computers, computing
• A phrase?
• President Clinton, …

Features
“Any prominent or distinctive aspect, quality, or
characteristic”
 Unique words or meanings are typically features
 Other aspects may also be features:
 Date/time
 Is this email from a known sender?
 A defined location on the feature vector
 The standard IR name for features is “Terms”

Converting words to features
 Most words in most languages have
multiple forms
 Different plurality, tense, gender, …
 Many words have multiple meanings
 Different words have the same meaning

Real World Problems
 Misspelled words
 Variant spellings and hyphenation
 Data interpreted as text
 Uuencoded data in email
 Assembly language symbols from shared
libraries

Stemming
 Word stemming is the process of identifying the
root word
 In Hebrew this is generally “easy”
 In English this is generally difficult
 Reduces the dimensionality of the space
 May introduce confusion between word senses
 “The computer algorithm complexity…”
 “He computes the complex algorithm…”

Thesaurus
“A book of synonyms and antonyms”
 Contains word equivalence classes
 May be used to further reduce vector space
dimensionality
 May be used to “expand” search query
 want -> {want, desire, wish, fancy, lust, …}
 May also introduce confusion
 “I fancy an ice cream right now.”
 “That fancy ice cream parlor is too expensive.”

Dictionary
“A book listing a comprehensive or restricted
selection of the words of a language”
 A dictionary might have for each term:
 Feature index (location on feature vector)

 Document frequency
 …

Zipf’s Law
Not all words are created equal…
 rank • f ~= C
k k
 For English text, C ~= N/10, where N is

number of terms in the corpus
 Implications:
 Most occurrences are common words
 Most occurrences are common (stop) words
 Most words appear infrequently
 Roughly half of all terms appear once

Insights and Observations
 Vectors are sparse
 Some words will appear in nearly all
documents
 Many words will appear in only one document
 Words which appear in only a few documents
tend to be most “important”
 Vector space representations do not directly
capture thesaurus-like word similarities

Boolean Weights
 This is the simplest model
 wik
 1 if term k in document i
 0 otherwise

TFIDF Weights
TFIDF definitions:
fik: #occurrences of term tk in document Di
tfik: fik / maxl(fil) normalized term frequency
dfk: #documents which contain tk
idfk: log(d / dfk) where d is the total number of documents
wik: tfik idfk term weight
Similarity TFIDFij = wi• wj = Σk(tfik tfjk idfk idfk)
Intuition: rare words get more weight, common
words less weight

TFDIF Variants
There are lots of variants of TFIDF
fik: #occurrences of term tk in document Di
Standard Variant
wik: (0.5 + 0.5tfik) idfk
Mehran Sahami’s Variant
wik: (fik / Σl fil)½

Singhal’s TFIDF Variant
This method does not treat the query and
document symmetrically.
wij from TFIDF
di: # of unique terms in document Di
atfi: the average term frequency in Di
djk = (1 + log(tfjk))/(1 + log(atfj))
p: pivot point at which probability of retrieval equals
probability of relevance; p is a document length
s: slope of correction curve
Similarity(Qi,Dj) = ΣkwiKdjk/((1-s)p + s|dj|)

Probabilistic Retrieval
Ranking is done based on an estimate of the
probability of their relevance to the query
• Advantages: probabilities are familiar and
there are a number of useful tools

• Disadvantage: sometimes costly to compute
Entries in document vectors are probabilities

Independence Assumptions
Many probabilistic retrieval schemes make
these assumptions:
• Distribution of terms in relevant documents
is independent and their distribution in all

documents is independent
• Distribution of terms in relevant documents
is independent and their distribution in non-

relevant documents is independent

Ordering Principles
Some sample principles which may underlie
probabilistic weighting schemes:
• Probable relevance is based only on the
presence of search terms in the documents

• Probable relevance is based on both the
presence of search terms in documents and

their absence from documents

Binary Independence Retrieval
• Attempt to identify relevant documents R
• Irrelevant documents R
• Probability dj relevant: P(R|dj)
• Probability dj irrelevant: P(R|dj)
• Sim(dj,q) = P(dj|R) / P(dj|R)
• = ∑i wij wiq (log(P(ki|R)/(1-P(ki|R))) +
log((1-P(ki|R))/P(ki|R)))
Binary Independence Retrieval
• How to compute P(ki|R), P(ki|R)?
• ni is number of documents containing ki
• P0(ki|R) = 0.5, P0(ki|R) = ni/N
• V = # of with ranking above threshold r
• Vi = # of relevant documents containing ki
• P1(ki|R) = (Vi + ni/N) / (V + 1)
• P1(ki|R) = (ni – Vi + ni/N) / (N – V + 1)
• Can now iterate!

Binary Independence Model
• In the initial probabilistic model, the
weights were binary, similar to the boolean
model.
• However, it is possible to enhance the
probabilistic model by using various
weighting schemes
• ntfik = fik / max(fi1, fi2, .., fin)

Croft&Harper’s Model
• Similar probabilistic model
• wik = 1iff term k in document i, 0 o.w.
• N = # of documents in corpus
• nk = # of documents containing term k
• Sim = C ∑k wik wjk + ∑k wik wjk log((N-nk)/nk)

Non-Binary Independence Model
• Basic idea: try to compute probability that a term
appears in relevant/irrelevant documents with
frequency f
• Requires a lot more work than simply computing
the probability that a term appears in
relevant/irrelevant documents
• In real systems, usually use normalized frequency
intervals, e.g. (0..0.01], (0.01..0.2], …

Recommended Texts
• Modern Information Retrieval, by Ricardo Baeza-
Yates and Berthier Ribeiro-Neto, Addison-
Wesley, 1999.
• Information Retrieval Algorithms and Heuristics,
by David A. Grossman and Ophir Frieder, Kluwer
Academic Publishers, 1998.
• Python Essential Reference, by David M. Beazley,
New Riders, 2000.

Vector Space Model and Features: Carl Staelin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vector Space Model and Features: Carl Staelin

Uploaded by

Copyright:

Available Formats

Vector Space Model

Lecture 2 Information Retrieval and Digital Libraries Page 2

Lecture 2 Information Retrieval and Digital Libraries Page 3

Lecture 2 Information Retrieval and Digital Libraries Page 4

• Matrix math, such as eigenvectors

Lecture 2 Information Retrieval and Digital Libraries Page 5

• Assign each word in the language a unique index

• Create a vector with #words entries

• Boolean: true iff word in document

Lecture 2 Information Retrieval and Digital Libraries Page 6

contain any spaces or punctuation?

Lecture 2 Information Retrieval and Digital Libraries Page 7

 Other aspects may also be features:

Lecture 2 Information Retrieval and Digital Libraries Page 8

Lecture 2 Information Retrieval and Digital Libraries Page 9

Lecture 2 Information Retrieval and Digital Libraries Page 10

Lecture 2 Information Retrieval and Digital Libraries Page 11

 May be used to further reduce vector space

Lecture 2 Information Retrieval and Digital Libraries Page 12

 Feature index (location on feature vector)

Lecture 2 Information Retrieval and Digital Libraries Page 13

 For English text, C ~= N/10, where N is

Lecture 2 Information Retrieval and Digital Libraries Page 14

Lecture 2 Information Retrieval and Digital Libraries Page 15

Lecture 2 Information Retrieval and Digital Libraries Page 16

Lecture 2 Information Retrieval and Digital Libraries Page 17

Lecture 2 Information Retrieval and Digital Libraries Page 18

Lecture 2 Information Retrieval and Digital Libraries Page 19

there are a number of useful tools

Entries in document vectors are probabilities

Lecture 2 Information Retrieval and Digital Libraries Page 20

is independent and their distribution in all

is independent and their distribution in non-

Lecture 2 Information Retrieval and Digital Libraries Page 21

presence of search terms in the documents

presence of search terms in documents and

Lecture 2 Information Retrieval and Digital Libraries Page 22

Lecture 2 Information Retrieval and Digital Libraries Page 24

Lecture 2 Information Retrieval and Digital Libraries Page 25

Lecture 2 Information Retrieval and Digital Libraries Page 26

Lecture 2 Information Retrieval and Digital Libraries Page 27

Lecture 2 Information Retrieval and Digital Libraries Page 28

You might also like