Imitation Learning: Modeling & Learning Sequence of Decisions

Imitation Learning
Modeling & learning sequence of decisions
Guillaume Wisniewski
guillaume.wisniewski@limsi.fr
March 2018
Université Paris Sud & LIMSI
1
Natural Language Understanding
They eat the pizza with anchovies.
2
Dependency parsing
3
Why is NLU difficult ?
1. word = discrete space ⇒ no way to ‘program’ the fact that

cat and cats are the same ‘concept’ and it is not always true
2. Zipf law : many rare events
3. language is highly structured (recursive structure + generative
power)
4. language is ambiguous
5. language is evolving
4
Zipf Law
A plot of the rank versus frequency for the first 10 million words in 30
Wikipedias
5
Structure Prediction
Dependency parsing
A dependency tree is (Tesnières, 1954) :

• a directed graph : node = words, arcs =
relations between two words
• a tree (i.e. : each word has exactly one
parent, except one + no cycle)
• a set of relations between words : each
word is ‘attached’ to the word it qualifies
(e.g. un → chat ← noir)
• edge between a parent and its child =
relation between an head (the parent) and its
dependent (the child)
• labels on edge
⇒ represents the syntax of a sentence
6
Our new goal
• given a sentence
x = x1 , ..., xn predict its
dependency tree...
• ...knowing a set of
annotated sentences (i.e. :
sentences with their gold
dependency tree)
⇒ structure prediction
7
Structure prediction
The task
• input : sequences
• output : sequences, tree, ...
• both input and output can be divided into sub-parts ⊕
relations/dependencies between these parts
8
Understanding structure prediction
A new application
9
Starting with a simpler problem
fixed size image ⇒ 1 letter among 26
10
Starting with a simpler problem
fixed size image ⇒ 1 letter among 26

multi-classe classification
10
A first hand-written recognition system
The simplest hand-written recognition system :
• segment picture to extract/isolate each letter (‘simple’ image

processing)
• recognize each letter independently
11

processing)
11

processing)
11

processing)
11

processing)
11

processing)
→ non optimal
11
A better way...
B C A C A
P R O R E
G T E T O
• take into account other letters when making a decision

• dependencies between labels
• labels are chosen simultaneously ⇒ joint inference
12
A better way...
B C A C A
P R O R E
G T E T O
• take into account other letters when making a decision

• dependencies between labels
• labels are chosen simultaneously ⇒ joint inference
12
Why is it easy ?
• we just have to consider all letters and all labels simultaneously

• x = (xi )li=1 : sequence of observations
• y = (yi )li=1 : sequence of labels
• f(x, y) : joint representation of the whole sequences of
observations and labels
• features can depend/represent several label :
(
1 si yi = A and yi+1 = B
• fα,i =
0 otherwise
(
1 if there is three consecutive ‘z’
• fβ,i =
0 otherwise
• ...
• the problem boils down to a simple multi-class classification
problem
y∗ = arg max F (x, y; w)
y∈Y 13
Why is it hard ?
• m possible label for each

word, l letters in the word
• inference : m × l scores to
compute
• joint inference : ml scores to
compute
⇒ combinatorial optimization
problem
14
Transition-Based Dependency
Parsing
General Principle
• sequence of actions to build

the output tree incrementally
• state/configuration =
describe the partial structure
⊕ what remains to be done
• greedy decoding from an
initial state (empty tree) to
a final state (all words have
an head)
• in essence : which arc and
label should we add next ?
15
The Arc-Eager Parser [Nivre 2003]
• Configuration : hS, B, Ai S = stack, B = buffer, A = set of arcs that have been
built
• Initial configuration : h[], [0, 1, ...n], {}i

• Terminal configuration : hS, [], Ai
• Actions :
• Shift : hS, i|B, Ai → hS|i, B, Ai
• Reduce : hS|i, B, Ai → hS, B, Ai iif h(i, A)
• Right-Arc(k) : hS|i, j|B, Ai → hS|i|j, B, A ∪ {(i, j, k)}i
• Left-Arc(k) : hS|i, j|B, Ai → hS, j|B, A ∪ {(i, j, k)}i iif
¬h(i, A) ∧ i 6= ROOT
Notations :
• S|i = stack with top i and remainder S

• j|B = buffer with head j and remainder B
• h(i, A) = i has a head in A
16
Example
Root Economic news had little effect on financial markets .
Stack Buffer
[Root, Economic, news, had, little, effect, on,
[]
financial, market, .]
17
Example
Stack Buffer
[Economic, news, had, little, effect, on, financial,
[Root]
market, .]
S-S
17
Example
Stack Buffer
[Root, Economic] [news, had, little, effect, on, financial, market, .]
S-S-L(amod)
17
Example
amod
Stack Buffer
[Root] [news, had, little, effect, on, financial, market, .]
S-S-L(amod)-S
17
Example
amod
Stack Buffer
[Root, news] [had, little, effect, on, financial, market, .]
S-S-L(amod)-S-L(nsubj)
17
Example
amod nsubj
Stack Buffer
[Root] [had, little, effect, on, financial, market, .]
S-S-L(amod)-S-L(nsubj)-R(root)
17
Example
root
amod nsubj
Stack Buffer
[Root, had] [little, effect, on, financial, market, .]
S-S-L(amod)-S-L(nsubj)-R(root)-S
17
Example
root
amod nsubj amod
Stack Buffer
[Root, had, little] [effect, on, financial, market, .]
S-S-L(amod)-S-L(nsubj)-R(root)-S-L(amod)
17
Exercise
What is the oracle for the

sentence :
She sent him a letter.
18
Non-Determinism
1. SH-RA-LA-SH-RA-SH-LA-
RE-RA-RE-RA
2. SH-RA-LA-SH-RA-RE-SH-
LA-RA-RE-RA
A dependency tree can be generated by several sequence of actions
19
Open question
How to go from oracle to classifier ?
20
A multi-class classification problem
• sequence of decisions
• among all legal actions : pick
the ‘best’ one
⇒ multi-class problem
Formally :
a∗ = arg max w · φ(a, c) (1)

a∈A
where :
• A = set of legal actions
• φ = joint representation of
configuration & action
21
In practice : greedy inference
Algorithm 1: Greedy decoding for transition-based parser

input: a sentence x
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈A w · φ(a, c)
c ← c ◦ a∗
In practice : final = empty buffer
22
What kind of features ?
23
• features over tokens relative to S and B
• pos(S2 ) = ROOT
• pos(S1 ) = verb
• pos(S0 ) = noun
• pos(B2 ) = prep
• pos(B1 ) = adj
• pos(B0 ) = noun
23
• word(S2 ) = ROOT
• word(S1 ) = had
• word(S0 ) = effect
• word(B2 ) = on
• word(B1 ) = financial
• word(B0 ) = market
23

• features over the (partial) dependency graph defined by A
• dep(S1 ) = root
• dep(lc(S1 )) = nsubj
• dep(rc(S1 )) = dobj
• dep(S0 ) = dobj
• dep(lc(S0 )) = amod
• dep(rc(S0 )) = None
23

• features over the (partial) dependency graph defined by A
• features over the (partial) transition sequence
• ti−1 = Right-Arc(dobj)
• ti−2 = Left-Arc(amod)
• ti−3 = Shift
• ti−4 = Right-Arc(root)
• ti−5 = Left-Arc(nsubj)
• ti−6 = Shift
23
Training a dependency parser
First idea
• perceptron-like training
• decode
• as soon as an error is made
→ correct it
• go to next example /
continue decoding
24
Training of an arc-eager parser
Algorithm 2: Training with a static oracle

for t ∈ J1, ..., T K do
x, y ← Sample(dataset)
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈Legal(c) φ(c, a) · w
â = Correct(c)
if a∗ 6= â then
w ← w + φ(c, â) − φ(c, a∗ )
c ← c ◦ â
25
Static oracle
Oracle = expert policy = define the set of correct actions in a given

state
Given a reference tree T :




 LeftArc if top(Sc ) ← first(Bc ) in T

RightArc if top(Sc ) → first(Bc ) in T

Correct(c) =


 Reduce if ∃v < top(Sc ) : v ↔ first(Bc ) in T


Shift otherwise

(2)
For the moment : oracle only defined in ‘gold’ configurations
26
Expert policy
Static oracle
• A single static canonical sequence of actions from the initial to

the terminal state
• in case of multiple correct transition : arbitrarily choose one
(e.g. prioritize shifts over other actions)
Problem with static oracle
• indirectly label alternative ‘correct’ transition as false

• static policy not well defined in states that are not part of the
gold transition
⇒ error propagation
27
Building on the shoulders of giants...
Lessons learned from imitation learning
• the configurations that are seen during training = similar to

the one that are seen during testing
• decoding & training must be closely intertwined
• learn to recover from errors
What do we need : dynamic oracle
• complete : can be computed from any configuration

• non-deterministic : more than one transition can be optimal
28
Dynamic oracle [Goldberg & Nivre, 2013]
New definition
• oracle 6= generate reference tree

• an action is correct iif it does not prevent the creation of an
arc of the reference tree
• notion of reachability
Why is it important ?
• oracle can be computed in any configuration : we are no longer

stuck on the ‘gold path’
• oracle is non-deterministic : we can be closer to the paths that
will be seen at test time
29
In practice...
30
Training of an arc-eager parser
Algorithm 3: Training with a dynamic oracle

for t ∈ J1, ..., T K do
x, y ← Sample(dataset)
c ← Initial(x)
while ¬Final(c) do
a∗ = arg maxa∈Legal(c) φ(c, a) · w
â = arg maxa∈Correct(c) φ(c, a) · w
if a∗ ∈/ Correct(c) then
w ← w + φ(c, â) − φ(c, a∗ )
c ← c ◦ a∗
31

Imitation Learning: Modeling & Learning Sequence of Decisions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Imitation Learning: Modeling & Learning Sequence of Decisions

Uploaded by

Copyright:

Available Formats

Imitation Learning

Modeling & learning sequence of decisions

They eat the pizza with anchovies.

1. word = discrete space ⇒ no way to ‘program’ the fact that

A dependency tree is (Tesnières, 1954) :

fixed size image ⇒ 1 letter among 26

fixed size image ⇒ 1 letter among 26

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

The simplest hand-written recognition system :

• segment picture to extract/isolate each letter (‘simple’ image

• take into account other letters when making a decision

• take into account other letters when making a decision

• we just have to consider all letters and all labels simultaneously

• m possible label for each

• sequence of actions to build

• Configuration : hS, B, Ai S = stack, B = buffer, A = set of arcs that have been

• Initial configuration : h[], [0, 1, ...n], {}i

• S|i = stack with top i and remainder S

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

Root Economic news had little effect on financial markets .

amod nsubj amod

Root Economic news had little effect on financial markets .

What is the oracle for the

She sent him a letter.

A dependency tree can be generated by several sequence of actions

How to go from oracle to classifier ?

a∗ = arg max w · φ(a, c) (1)

Algorithm 1: Greedy decoding for transition-based parser

In practice : final = empty buffer

• features over tokens relative to S and B

• features over tokens relative to S and B

• features over tokens relative to S and B

• features over tokens relative to S and B

Algorithm 2: Training with a static oracle

Oracle = expert policy = define the set of correct actions in a given

Given a reference tree T :

For the moment : oracle only defined in ‘gold’ configurations

• A single static canonical sequence of actions from the initial to

Problem with static oracle

• indirectly label alternative ‘correct’ transition as false

Lessons learned from imitation learning

• the configurations that are seen during training = similar to

What do we need : dynamic oracle

• complete : can be computed from any configuration

• oracle 6= generate reference tree

• oracle can be computed in any configuration : we are no longer

Algorithm 3: Training with a dynamic oracle

You might also like