Mlas06 Nigam Tie 01

Machine Learning for Information
Extraction: An Overview
Kamal Nigam
Google Pittsburgh

With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea
Example: A Problem
Genomics job
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Example: A Solution
Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy
Contact: susan@foodscience.com
Category: Travel/Hospitality
Function: Food Services

Potential Enabler of Faceted Search

Lots of Structured Information in Text

IE from Research Papers
What is Information Extraction?
Recovering structured data from formatted text

Identifying fields (e.g. named entity recognition)

Understanding relations between fields (e.g. record
association)

association)
Normalization and deduplication
association)
Normalization and deduplication

Today, focus mostly on field identification &
a little on record association

IE Posed as a Machine Learning Task
Training data: documents marked up with
ground truth
In contrast to text classification, local features
crucial. Features of:
Contents
Text just before item
Text just after item
Begin/end boundaries

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix contents suffix

Good Features for Information Extraction
Example word features:
identity of word
is in all caps
ends in -ski
is part of a noun phrase
is in a list of city names
is under node X in WordNet or
Cyc
is in bold font
is in hyperlink anchor
features of past & future
last person name was female
next two words are and
Associates
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-
number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
Creativity and Domain Knowledge Required!
Is Capitalized
Is Mixed Caps
Is All Caps
Initial Cap
Contains Digit
All lowercase
Is Initial
Punctuation
Period
Comma
Apostrophe
Dash
Preceded by HTML tag

Character n-gram classifier
says string is a person
name (80% accurate)
In stopword list
(the, of, their, etc)
In honorific list
(Mr, Mrs, Dr, Sen, etc)
In person suffix list
(Jr, Sr, PhD, etc)
In name particle list
(de, la, van, der, etc)
In Census lastname list;
segmented by P(name)
In Census firstname list;
segmented by P(name)
In locations lists
(states, cities, countries)
In company name list
(J. C. Penny)
In list of company suffixes
(Inc, & Associates,
Foundation)
Word Features
lists of job titles,
Lists of prefixes
Lists of suffixes
350 informative phrases
HTML/Formatting Features
{begin, end, in} x
{, , <a>, <hN>} x
{lengths 1, 2, 3, 4, or longer}
{begin, end} of line
Creativity and Domain Knowledge Required!
Good Features for Information Extraction
IE History
Pre-Web
Mostly news articles
De Jongs FRUMP [1982]
Hand-built system to fill Schank-style scripts from news wire
Message Understanding Conference (MUC) DARPA [87-95],
TIPSTER [92-96]
Most early work dominated by hand-built models
E.g. SRIs FASTUS, hand-built FSMs.
But by 1990s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek 97], BBN [Bikel et al 98]
Web
AAAI 94 Spring Symposium on Software Agents
Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
Tom Mitchells WebKB, 96
Build KBs from the Web.
Wrapper Induction
Initially hand-build, then ML: [Soderland 96], [Kushmeric 97],
Landscape of ML Techniques for IE:

Any of these models can be used to capture words, formatting or both.
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Classifier
which class?
Try alternate
window sizes:
Boundary Models
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Finite State Machines
Most likely state sequence?
Wrapper Induction
Abraham Lincoln was born in Kentucky.
Learn and apply pattern for a website


PersonName
Sliding Windows & Boundary Detection
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell
School of Computer Science
Carnegie Mellon University

3:30 pm
7500 Wean Hall

Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s. As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.
Looking for
seminar
location
Information Extraction by Sliding Windows

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location
Information Extraction by Sliding Window

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location
Information Extraction with Sliding Windows
[Freitag 97, 98; Soderland 97; Califf 98]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
w
t-m
w
t-1
w
t
w
t+n
w
t+n+1
w
t+n+m
prefix contents suffix

Standard supervised learning setting
Positive instances: Candidates with real label
Negative instances: All other candidates
Features based on candidate, prefix and suffix
Special-purpose rule learning systems work well
courseNumber(X) :-
tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, <previousToken>, inTitle, true),
some(X, B, <>. tripleton, true)
Rule-learning approaches to sliding-
window classification: Summary
Representations for classifiers allow
restriction of the relationships between
tokens, etc
Representations are carefully chosen
subsets of even more powerful
representations based on logic programming
(ILP and Prolog)
Use of these heavyweight representations is
complicated, but seems to pay off in results
IE by Boundary Detection

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location

Jaime Carbonell

3:30 pm
7500 Wean Hall

E.g.
Looking for
seminar
location
BWI: Learning to detect boundaries
Another formulation: learn three probabilistic
classifiers:
START(i) = Prob( position i starts a field)
END(j) = Prob( position j ends a field)
LEN(k) = Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)

LEN(k) is estimated from a histogram

[Freitag & Kushmerick, AAAI 2000]
BWI uses boosting to find detectors for
START and END
Each weak detector has a BEFORE and
AFTER pattern (on tokens before/after
position i).
Each pattern is a sequence of tokens and/or
wildcards like: anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber,
Weak learner for patterns uses greedy
search (+ lookahead) to repeatedly extend a
pair of empty BEFORE,AFTER patterns
Field F1
Person Name: 30%
Location: 61%
Start Time: 98%
Problems with Sliding Windows
and Boundary Finders
Decisions in neighboring parts of the input
are made independently from each other.

Nave Bayes Sliding Window may predict a
seminar end time before the seminar start time.
It is possible for two overlapping windows to both
be above threshold.

In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Finite State Machines
Hidden Markov Models
S
t - 1
S
t
O
t
S
t+1
O
t +1
O
t
-
1
...
...
Finite state model
Graphical model
Parameters: for all states S={s
1
,s
2
,}
Start state probabilities: P(s
t
)
Transition probabilities: P(s
t
|s
t-1
)
Observation (emission) probabilities: P(o
t
|s
t
)
Training:
Maximize probability of training observations (w/ prior)
[
=

| |
1
1
) | ( ) | ( ) , (
o
t
t t t t
s o P s s P o s P

HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,

...
transitions
observations
o
1
o
2
o
3
o
4
o
5
o
6
o
7
o
8
Generates:

State
sequence
Observation
sequence
Usually a multinomial over
atomic, fixed alphabet
IE with Hidden Markov Models
Yesterday Lawrence Saul spoke this example sentence.
Yesterday Lawrence Saul spoke this example sentence.
Person name: Lawrence Saul
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated person name
state extract as a person name:
) , ( max arg o s P
s

Generative Extraction with HMMs

Parameters: {P(s
t
|s
t-1
), P(o
t
|s
t
), for all states s
t
, words o
t
}
Parameters define generative model:

[McCallum, Nigam, Seymore & Rennie 00]
[
=

| |
1
1
) | ( ) | ( ) , (
o
t
t t t t
s o P s s P o s P

HMM Example: Nymble
Other examples of HMMs in IE: [Leek 97; Freitag & McCallum 99; Seymore et al. 99]
Task: Named Entity Extraction
Train on 450k words of news wire text.
Case Language F1 .
Mixed English 93%
Upper English 91%
Mixed Spanish 90%
[Bikel, et al 97]

Person
Org
Other
(Five other name classes)
start-of-
sentence
end-of-
sentence

Transition
probabilities
Observation
probabilities
P(s
t
| s
t-1
, o
t-1
) P(o
t
| s
t
, s
t-1
)
Back-off to: Back-off to:
P(s
t
| s
t-1
)
P(s
t
)
P(o
t
| s
t
, o
t-1
)
P(o
t
| s
t
)
P(o
t
)
or
Results:
Regrets from Atomic View of Tokens
Would like richer representation of text:
multiple overlapping features, whole chunks of text.
line, sentence, or paragraph features:
length
is centered in page
percent of non-alphabetics
white-space aligns with next line
containing sentence has two verbs
grammatically contains a question
contains links to authoritative pages
emissions that are uncountable
features at multiple levels of granularity
Example word features:
identity of word
is in all caps
ends in -ski
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past & future
last person name was female
next two words are and Associates
Problems with Richer Representation
and a Generative Model
These arbitrary features are not independent:
Overlapping and long-distance dependences
Multiple levels of granularity (words, characters)
Multiple modalities (words, formatting, layout)
Observations from past and future
HMMs are generative models of the text:
Generative models do not easily handle these non-
independent features. Two choices:
Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
Ignore the dependencies. This causes over-counting of
evidence (ala nave Bayes). Big problem when combining
evidence, as in Viterbi!
) , ( o s P

Conditional Sequence Models
We would prefer a conditional model:
P(s|o) instead of P(s,o):
Can examine features, but not responsible for generating
them.
Dont have to explicitly model their dependencies.
Dont waste modeling effort trying to generate what we are
given at test time anyway.
If successful, this answers the challenge of
integrating the ability to handle many arbitrary
features with the full power of finite state automata.
Conditional Markov Models
S
t - 1
S
t
O
t
S
t+1
O
t +1
O
t
-
1
...
...
Generative (traditional HMM)
[
=

| |
1
1
) | ( ) | ( ) , (
o
t
t t t t
s o P s s P o s P

...
transitions
observations
S
t - 1
S
t
O
t
S
t+1
O
t +1
O
t
-
1
...
...
Conditional
...
transitions
observations
[
=

| |
1
1
) , | ( ) | (
o
t
t t t
o s s P o s P

Standard belief propagation: forward-backward procedure.
Viterbi and Baum-Welch follow naturally.
Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]
MaxEnt POS Tagger [Ratnaparkhi, 1996]
SNoW-based Markov Model [Punyakanok & Roth, 2000]
Exponential Form
for Next State Function
|
.
|
\
|
=

k
t t k k
t t
t t s t t t
s o f
s o Z
o s P o s s P
t
) , ( exp
) , (
1
) | ( ) , | (
1
1
1
Capture dependency on s
t-1
with |S|
independent functions, P
s
t-1
(s
t
|o
t
).

Each state contains a next-state classifier
that, given the next observation, produces a
probability of the next state, P
s
t-1
(s
t
|o
t
).
s
t-1

s
t

Recipe:
- Labeled data is assigned to transitions.
- Train each states exponential model by maximum entropy
weight feature
Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1 * Pr(5|4,i)/Z2 * Pr(3|5,b)/Z3
= 0.5 * 1 *1
Pr(0123|rib)=1
Pr(0453|rob)=1
Label Bias Problem
n n
o o o o s s s s ,... , ,... ,
2 1 2 1
= =

HMM
MEMM
CRF
S
t-1
S
t

O
t

S
t+1

O
t+1
O
t-1

S
t-1
S
t

O
t

S
t+1

O
t+1
O
t-1

S
t-1
S
t

O
t

S
t+1

O
t+1
O
t-1

...
...
...
...
...
...
[
=

| |
1
1
) | ( ) | ( ) , (
o
t
t t t t
s o P s s P o s P
[
=
=

|
|
|
.
|
\
|
+
| |
1
1
,
| |
1
1
) , (
) , (
exp
1
) , | ( ) | (
1
o
t
k
t t k k
j
t t j j
o s
o
t
t t t
o s g
s s f
Z
o s s P o s P
t t
|
|
|
.
|
\
|
+
| |
1
1
) , (
) , (
exp
1
) | (
o
t
k
t t k k
j
t t j j
o
o s g
s s f
Z
o s P
(A special case of MEMMs and CRFs.)

Conditional Random Fields (CRFs)
[Lafferty, McCallum, Pereira 2001]
From HMMs to MEMMs to CRFs
Conditional Random Fields (CRFs)
S
t
S
t+1
S
t+2

O = O
t
, O
t+1
, O
t+2
, O
t+3
, O
t+4

S
t+3
S
t+4

Markov on s, conditional dependency on o.
[
=

|
.
|
\
|
| |
1
1
) , , , ( exp
1
) | (
o
t k
t t k k
o
t o s s f
Z
o s P
Hammersley-Clifford-Besag theorem stipulates that the CRF

has this forman exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped
(linear chain is a trivial tree), inference can be done by dynamic
programming in time O(|o| |S|
2
)just like HMMs.
[Lafferty, McCallum, Pereira 2001]
Training CRFs
) , , , ( ) , (
) , ( ) | ( ) , (

: gradient likelihood - Log
}) , { | } ({
: data ning given trai parameters of likelihood - log Maximize
1
2
) ( ) (
} {
) ( ) (
penalty smoothing parameters current by assigned labels using count feature labels correct using count feature
) (
t t
t
k k
k
i s
i
k
i
i
i i
k
k
i
k
s s t o f o s C
o s C o s P o s C
L

- -
s o L
k

=
=
c
c

Methods:
iterative scaling (quite slow)
conjugate gradient (much faster)
conjugate gradient with preconditioning (super fast)
limited-memory quasi-Newton methods (also super fast)

Complexity comparable to standard Baum-Welch
[Sha & Pereira 2002]
& [Malouf 2002]
Sample IE Applications of CRFs
Noun phrase segmentation [Sha & Pereira, 03]
Named entity recognition [McCallum & Li 03]
Protein names in bio abstracts [Settles 05]
Addresses in web pages [Culotta et al. 05]
Semantic roles in text [Roth & Yih 05]
RNA structural alignment [Sato & Satakibara 05]
Examples of Recent CRF Research
Semi-Markov CRFs [Sarawagi & Cohen 05]
Awkwardness of token level decisions for segments
Segment sequence model alleviates this
Two-level model with sequences of segments,
which are sequences of tokens

Stochastic Meta-Descent [Vishwanathan 06]
Stochastic gradient optimization for training
Take gradient step with small batches of examples
Order of magnitude faster than L-BFGS
Same resulting accuracies for extraction

Further Reading about CRFs
Charles Sutton and Andrew McCallum. An
Introduction to Conditional Random Fields for
Relational Learning. In Introduction to Statistical
Relational Learning. Edited by Lise Getoor and
Ben Taskar. MIT Press. 2006.

http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

Mlas06 Nigam Tie 01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mlas06 Nigam Tie 01

Uploaded by

Copyright:

Available Formats

Machine Learning for Information

Generative Extraction with HMMs

(A special case of MEMMs and CRFs.)

Hammersley-Clifford-Besag theorem stipulates that the CRF

You might also like