Professional Documents
Culture Documents
and Features
Carl Staelin
Motivation
Text is “messy” and is hard to manage.
Variable length
Contain a string of words
We need some mathematical representation
of a document in order to produce
“similarity scores”
d d
ik jk
• Cosine of Angle: cos ij k
d d
2 2
ik jk
k k
• Euclidean distance
• …
• Machine learning
• ???
• Vector entries
Date/time
Is this email from a known sender?
A defined location on the feature vector
The standard IR name for features is “Terms”
dimensionality
May be used to “expand” search query
want -> {want, desire, wish, fancy, lust, …}
May also introduce confusion
“I fancy an ice cream right now.”
“That fancy ice cream parlor is too expensive.”