cs707 010312

Information Retrieval
Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Prasad
L1IntroIR
Unstructured (text) vs. structured (database) data in 1996

160 140 120 100 80 60 40 20 0 Data volume Market Cap
L1IntroIR 2
Unstructured Structured
Prasad
Unstructured (text) vs. structured (database) data in 2006

160 140 120 100 80 60 40 20 0 Data volume Market Cap
L1IntroIR 3
Unstructured Structured
Prasad
Structured vs unstructured data

Structured data : information in tables
Employee Smith Chang Ivy Manager Jones Smith Smith Salary 50000 60000 50000
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
Prasad L1IntroIR 4
Unstructured data
Typically refers to free text
qData which does not have clear, semantically overt, easy-for-a-computer structure
Allows
qKeyword-based queries including operators qMore sophisticated concept queries, e.g.,
find all web pages dealing with drug abuse
Prasad L1IntroIR 5
Semi-structured data
In fact almost no data is unstructured
qE.g., this slide has distinctly identified zones such as the Title and Bullets
Facilitates semi-structured search such as

qTitle contains data AND Bullets contain search
to say nothing of linguistic structure
Prasad
L1IntroIR
What is IR?
Representation
Keywords/Phrases, Structure/Fonts, Counts, etc
Organization and Storage

Inverted File Index, Compressed, etc Hardware Architecture and Memory Hierarchy
Access to information items

Interface : Spell-checker to tree-structured display Visualization : Labeled Clusters, Timelines, Spring graphs, etc.
Prasad
L1IntroIR
Ultimate Focus of IR
Satisfying user information need
qEmphasis is on retrieval of information (not data)
User information need : Examples

qPrinter reviews;Printer prices and availability
qWords in which all vowels appear qAnagram/Permutations of art qFlight numbers; UPS/FedEx/USPS Tracking code
Predicting which documents are relevant, and then linearly ranking them.
Prasad L1IntroIR 8
Information Need : Query, Relevancy

An information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need.
Prasad L1IntroIR 9
DIKW Hierarchy
Data: Symbolic units
qE.g., Records of customer. qE.g., Bytes from sensors.
Information : Data with an interpretation (Who?, What?, When?, Where?).

qE.g., Records of current/new customer grouped by their ages. qE.g., Variation in temperature readings.
Prasad L1IntroIR 10
DIKW Hierarchy
Knowledge : Information organized with theoretical concepts or abstract ideas (How?)
qE.g., How many customers have cancelled the accounts in current fiscal year? qE.g., Analysis of temperature variation over the years and their causes.
Wisdom : Understanding of fundamental principles + Human Judgement

qE.g., What strategies can be employed to retain customers in the face of cheaper alternatives? qE.g., Global warming issues and the future of Earth.
Prasad L1IntroIR 11
DIKW hierarchy: Clark 2004
Context
Joining of wholes Formation of a whole Connection of parts
Wisdom
Future
Knowledge Information
Past
Novelty
Experience Understanding
Data
Gathering of parts
Prasad L1IntroIR
Researching Absorbing Doing Interacting Reflecting

12
You see things; and you say "Why?" But I dream things that never were; and I say "Why not?" George Bernard Shaw
Prasad
L1IntroIR
13
Information vs Data Retrieval

DATA: Unstructured : open to interpretation Usually incomplete or ambiguous (w.r.t. information need) Structured with well-defined semantics Well-defined semantics Exact match required - no or many results Foundations: Algebra/Logic Accounting
L1IntroIR 14
QUERY :
QUALITY OF Partial match allowed, RESULTS: relevance-based ranking

FOUNDATIONS: APPLICATION:
Probabilistic underpinnings Library
Prasad
User Task
Retrieval Database Browsing
qRetrieval
Purposeful HP Multifunction Printer Information
qBrowsing
Casual Big Bang, CBR, Element Genesis, Supernova, ... Hyperlink-based
qFiltering by Agents
Push Podcasts from B.B.C.s Naked Science
Prasad L1IntroIR 15
Logical View of Documents

Docs
Accents spacing
stopwords
Noun groups
stemming
Manual indexing
structure structure Full text Index terms
Abstraction (essentials)
qStructure, fonts, proximity, repetitions, etc
Prasad L1IntroIR 16
The Retrieval Process

User Interface
Text
4, 10
user need
Text Operations
Text
6, 7
logical view
Query user feedback Operations query
logical view Indexing

DB Manager Module 8
inverted file
Index
Searching
8
retrieved docs ranked docs

Prasad
Ranking
2 L1IntroIR
Text Database
17
IR Basics
Models and retrieval evaluation Query languages and operations
Improve inferring query context
(query expansion, relevance feedback)
Text operations
Improve gleaning of document semantics
(stemming keywords)
Efficient Access: Index and Search

qVisualization, Multimedia, Applications,
Prasad L1IntroIR 18
Clustering and classification

Given a set of docs, group them into clusters based on their content. Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.
Prasad
L1IntroIR
19
The web and its challenges

Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks
qlink analysis, clickstreams, ...
How do search engines work? And how can we make them better?
Prasad L1IntroIR 20
More sophisticated semistructured search

Title is about Object Oriented Programming AND Author something like stro*rup
qwhere * is the wild-card operator
Issues:
qhow do you process about? qhow do you rank results?
The focus of XML search.

Prasad L1IntroIR 21
More sophisticated information retrieval

Cross-language information retrieval Question answering Summarization Text mining
Prasad L1IntroIR 22
Future Progress: Factors/Trends

Large, uncontrolled publishing media
qQuality issues
Cheap, fast and wide access

qEase of use (query formulation)
Variety and flexibility

qNavigational and Visualization aids qDirectory-based (Table of contents) vs Keywordsbased (Inverted File Index)
Index terms (automatic/human-created) vs Full-text
Privacy, Security, Copyright

Prasad L1IntroIR 23

cs707 010312

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

cs707 010312

Uploaded by

Copyright:

Available Formats

Information Retrieval

Unstructured (text) vs. structured (database) data in 1996

Unstructured (text) vs. structured (database) data in 2006

Structured vs unstructured data

Facilitates semi-structured search such as

Organization and Storage

Access to information items

User information need : Examples

Information Need : Query, Relevancy

Information : Data with an interpretation (Who?, What?, When?, Where?).

Wisdom : Understanding of fundamental principles + Human Judgement

DIKW hierarchy: Clark 2004

Joining of wholes Formation of a whole Connection of parts

Researching Absorbing Doing Interacting Reflecting

Information vs Data Retrieval

QUALITY OF Partial match allowed, RESULTS: relevance-based ranking

Probabilistic underpinnings Library

Logical View of Documents

structure structure Full text Index terms

The Retrieval Process

logical view Indexing

retrieved docs ranked docs

Efficient Access: Index and Search

Clustering and classification

The web and its challenges

More sophisticated semistructured search

The focus of XML search.

More sophisticated information retrieval

Future Progress: Factors/Trends

Cheap, fast and wide access

Variety and flexibility

Privacy, Security, Copyright

You might also like