You are on page 1of 53

Imitation Learning

Modeling & learning sequence of decisions

Guillaume Wisniewski
guillaume.wisniewski@limsi.fr
March 2018
Université Paris Sud & LIMSI

1
Natural Language Understanding

They eat the pizza with anchovies.

2
Dependency parsing

3
Why is NLU difficult ?

1. word = discrete space ⇒ no way to ‘program’ the fact that


cat and cats are the same ‘concept’ and it is not always true
2. Zipf law : many rare events
3. language is highly structured (recursive structure + generative
power)
4. language is ambiguous
5. language is evolving

4
Zipf Law

A plot of the rank versus frequency for the first 10 million words in 30
Wikipedias
5
Structure Prediction
Dependency parsing

A dependency tree is (Tesnières, 1954) :


• a directed graph : node = words, arcs =
relations between two words
• a tree (i.e. : each word has exactly one
parent, except one + no cycle)
• a set of relations between words : each
word is ‘attached’ to the word it qualifies
(e.g. un → chat ← noir)
• edge between a parent and its child =
relation between an head (the parent) and its
dependent (the child)
• labels on edge
⇒ represents the syntax of a sentence
6
Our new goal

• given a sentence
x = x1 , ..., xn predict its
dependency tree...
• ...knowing a set of
annotated sentences (i.e. :
sentences with their gold
dependency tree)
⇒ structure prediction

7
Structure prediction

The task
• input : sequences
• output : sequences, tree, ...
• both input and output can be divided into sub-parts ⊕
relations/dependencies between these parts

8
Understanding structure prediction
A new application

9
Starting with a simpler problem

fixed size image ⇒ 1 letter among 26

10
Starting with a simpler problem

fixed size image ⇒ 1 letter among 26


multi-classe classification

10
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

11
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

11
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

11
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

11
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

11
A first hand-written recognition system

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image


processing)
• recognize each letter independently

→ non optimal

11
A better way...

B C A C A
P R O R E
G T E T O

• take into account other letters when making a decision


• dependencies between labels
• labels are chosen simultaneously ⇒ joint inference

12
A better way...

B C A C A
P R O R E
G T E T O

• take into account other letters when making a decision


• dependencies between labels
• labels are chosen simultaneously ⇒ joint inference

12
Why is it easy ?

• we just have to consider all letters and all labels simultaneously


• x = (xi )li=1 : sequence of observations
• y = (yi )li=1 : sequence of labels
• f(x, y) : joint representation of the whole sequences of
observations and labels
• features can depend/represent several label :
(
1 si yi = A and yi+1 = B
• fα,i =
0 otherwise
(
1 if there is three consecutive ‘z’
• fβ,i =
0 otherwise
• ...
• the problem boils down to a simple multi-class classification
problem
y∗ = arg max F (x, y; w)
y∈Y 13
Why is it hard ?

• m possible label for each


word, l letters in the word
• inference : m × l scores to
compute
• joint inference : ml scores to
compute
⇒ combinatorial optimization
problem

14
Transition-Based Dependency
Parsing
General Principle

• sequence of actions to build


the output tree incrementally
• state/configuration =
describe the partial structure
⊕ what remains to be done
• greedy decoding from an
initial state (empty tree) to
a final state (all words have
an head)
• in essence : which arc and
label should we add next ?

15
The Arc-Eager Parser [Nivre 2003]

• Configuration : hS, B, Ai S = stack, B = buffer, A = set of arcs that have been

built

• Initial configuration : h[], [0, 1, ...n], {}i


• Terminal configuration : hS, [], Ai
• Actions :
• Shift : hS, i|B, Ai → hS|i, B, Ai
• Reduce : hS|i, B, Ai → hS, B, Ai iif h(i, A)
• Right-Arc(k) : hS|i, j|B, Ai → hS|i|j, B, A ∪ {(i, j, k)}i
• Left-Arc(k) : hS|i, j|B, Ai → hS, j|B, A ∪ {(i, j, k)}i iif
¬h(i, A) ∧ i 6= ROOT

Notations :

• S|i = stack with top i and remainder S


• j|B = buffer with head j and remainder B
• h(i, A) = i has a head in A

16
Example

Root Economic news had little effect on financial markets .

Stack Buffer
[Root, Economic, news, had, little, effect, on,
[]
financial, market, .]

17
Example

Root Economic news had little effect on financial markets .

Stack Buffer
[Economic, news, had, little, effect, on, financial,
[Root]
market, .]

S-S

17
Example

Root Economic news had little effect on financial markets .

Stack Buffer
[Root, Economic] [news, had, little, effect, on, financial, market, .]

S-S-L(amod)

17
Example

amod

Root Economic news had little effect on financial markets .

Stack Buffer
[Root] [news, had, little, effect, on, financial, market, .]

S-S-L(amod)-S

17
Example

amod

Root Economic news had little effect on financial markets .

Stack Buffer
[Root, news] [had, little, effect, on, financial, market, .]

S-S-L(amod)-S-L(nsubj)

17
Example

amod nsubj

Root Economic news had little effect on financial markets .

Stack Buffer
[Root] [had, little, effect, on, financial, market, .]

S-S-L(amod)-S-L(nsubj)-R(root)

17
Example

root

amod nsubj

Root Economic news had little effect on financial markets .

Stack Buffer
[Root, had] [little, effect, on, financial, market, .]

S-S-L(amod)-S-L(nsubj)-R(root)-S

17
Example

root

amod nsubj amod

Root Economic news had little effect on financial markets .

Stack Buffer
[Root, had, little] [effect, on, financial, market, .]

S-S-L(amod)-S-L(nsubj)-R(root)-S-L(amod)

17
Exercise

What is the oracle for the


sentence :

She sent him a letter.

18
Non-Determinism

1. SH-RA-LA-SH-RA-SH-LA-
RE-RA-RE-RA
2. SH-RA-LA-SH-RA-RE-SH-
LA-RA-RE-RA

A dependency tree can be generated by several sequence of actions

19
Open question

How to go from oracle to classifier ?

20
A multi-class classification problem

• sequence of decisions
• among all legal actions : pick
the ‘best’ one
⇒ multi-class problem
Formally :

a∗ = arg max w · φ(a, c) (1)


a∈A

where :
• A = set of legal actions
• φ = joint representation of
configuration & action

21
In practice : greedy inference

Algorithm 1: Greedy decoding for transition-based parser


input: a sentence x
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈A w · φ(a, c)
c ← c ◦ a∗

In practice : final = empty buffer

22
What kind of features ?

23
What kind of features ?

• features over tokens relative to S and B

• pos(S2 ) = ROOT
• pos(S1 ) = verb
• pos(S0 ) = noun
• pos(B2 ) = prep
• pos(B1 ) = adj
• pos(B0 ) = noun

23
What kind of features ?

• features over tokens relative to S and B

• word(S2 ) = ROOT
• word(S1 ) = had
• word(S0 ) = effect
• word(B2 ) = on
• word(B1 ) = financial
• word(B0 ) = market

23
What kind of features ?

• features over tokens relative to S and B


• features over the (partial) dependency graph defined by A

• dep(S1 ) = root
• dep(lc(S1 )) = nsubj
• dep(rc(S1 )) = dobj
• dep(S0 ) = dobj
• dep(lc(S0 )) = amod
• dep(rc(S0 )) = None

23
What kind of features ?

• features over tokens relative to S and B


• features over the (partial) dependency graph defined by A
• features over the (partial) transition sequence

• ti−1 = Right-Arc(dobj)
• ti−2 = Left-Arc(amod)
• ti−3 = Shift
• ti−4 = Right-Arc(root)
• ti−5 = Left-Arc(nsubj)
• ti−6 = Shift

23
Training a dependency parser
First idea

• perceptron-like training
• decode
• as soon as an error is made
→ correct it
• go to next example /
continue decoding

24
Training of an arc-eager parser

Algorithm 2: Training with a static oracle


for t ∈ J1, ..., T K do
x, y ← Sample(dataset)
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈Legal(c) φ(c, a) · w
â = Correct(c)
if a∗ 6= â then
w ← w + φ(c, â) − φ(c, a∗ )
c ← c ◦ â

25
Static oracle

Oracle = expert policy = define the set of correct actions in a given


state

Given a reference tree T :





 LeftArc if top(Sc ) ← first(Bc ) in T

RightArc if top(Sc ) → first(Bc ) in T

Correct(c) =


 Reduce if ∃v < top(Sc ) : v ↔ first(Bc ) in T


Shift otherwise

(2)

For the moment : oracle only defined in ‘gold’ configurations

26
Expert policy

Static oracle

• A single static canonical sequence of actions from the initial to


the terminal state
• in case of multiple correct transition : arbitrarily choose one
(e.g. prioritize shifts over other actions)

Problem with static oracle

• indirectly label alternative ‘correct’ transition as false


• static policy not well defined in states that are not part of the
gold transition

⇒ error propagation

27
Building on the shoulders of giants...

Lessons learned from imitation learning

• the configurations that are seen during training = similar to


the one that are seen during testing
• decoding & training must be closely intertwined
• learn to recover from errors

What do we need : dynamic oracle

• complete : can be computed from any configuration


• non-deterministic : more than one transition can be optimal

28
Dynamic oracle [Goldberg & Nivre, 2013]

New definition

• oracle 6= generate reference tree


• an action is correct iif it does not prevent the creation of an
arc of the reference tree
• notion of reachability

Why is it important ?

• oracle can be computed in any configuration : we are no longer


stuck on the ‘gold path’
• oracle is non-deterministic : we can be closer to the paths that
will be seen at test time

29
In practice...

30
Training of an arc-eager parser

Algorithm 3: Training with a dynamic oracle


for t ∈ J1, ..., T K do
x, y ← Sample(dataset)
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈Legal(c) φ(c, a) · w
â = arg maxa∈Correct(c) φ(c, a) · w
if a∗ ∈/ Correct(c) then
w ← w + φ(c, â) − φ(c, a∗ )
c ← c ◦ a∗

31

You might also like