Professional Documents
Culture Documents
Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Prasad
L1IntroIR
Unstructured Structured
Prasad
Unstructured Structured
Prasad
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
Prasad L1IntroIR 4
Unstructured data
Typically refers to free text
qData which does not have clear, semantically overt, easy-for-a-computer structure
Allows
qKeyword-based queries including operators qMore sophisticated concept queries, e.g.,
find all web pages dealing with drug abuse
Prasad L1IntroIR 5
Semi-structured data
In fact almost no data is unstructured
qE.g., this slide has distinctly identified zones such as the Title and Bullets
Prasad
L1IntroIR
What is IR?
Representation
Keywords/Phrases, Structure/Fonts, Counts, etc
Prasad
L1IntroIR
Ultimate Focus of IR
Satisfying user information need
qEmphasis is on retrieval of information (not data)
Predicting which documents are relevant, and then linearly ranking them.
Prasad L1IntroIR 8
DIKW Hierarchy
Data: Symbolic units
qE.g., Records of customer. qE.g., Bytes from sensors.
DIKW Hierarchy
Knowledge : Information organized with theoretical concepts or abstract ideas (How?)
qE.g., How many customers have cancelled the accounts in current fiscal year? qE.g., Analysis of temperature variation over the years and their causes.
Context
Wisdom
Future
Knowledge Information
Past
Novelty
Experience Understanding
Data
Gathering of parts
Prasad L1IntroIR
You see things; and you say "Why?" But I dream things that never were; and I say "Why not?" George Bernard Shaw
Prasad
L1IntroIR
13
QUERY :
Prasad
User Task
Retrieval Database Browsing
qRetrieval
Purposeful HP Multifunction Printer Information
qBrowsing
Casual Big Bang, CBR, Element Genesis, Supernova, ... Hyperlink-based
qFiltering by Agents
Push Podcasts from B.B.C.s Naked Science
Prasad L1IntroIR 15
Accents spacing
stopwords
Noun groups
stemming
Manual indexing
Abstraction (essentials)
qStructure, fonts, proximity, repetitions, etc
Prasad L1IntroIR 16
Text
4, 10
user need
Text Operations
Text
6, 7
logical view
Query user feedback Operations query
inverted file
Index
Searching
8
Ranking
2 L1IntroIR
Text Database
17
IR Basics
Models and retrieval evaluation Query languages and operations
Improve inferring query context
(query expansion, relevance feedback)
Text operations
Improve gleaning of document semantics
(stemming keywords)
Prasad
L1IntroIR
19
How do search engines work? And how can we make them better?
Prasad L1IntroIR 20
Issues:
qhow do you process about? qhow do you rank results?