You are on page 1of 11

Data Recombination for Neural Semantic Parsing

Robin Jia Percy Liang


Computer Science Department Computer Science Department
Stanford University Stanford University
robinjia@stanford.edu pliang@cs.stanford.edu

Abstract Original Examples


what are the major cities in utah ?
Modeling crisp logical regularities is cru- what states border maine ?
cial in semantic parsing, making it difficult
arXiv:1606.03622v1 [cs.CL] 11 Jun 2016

Induce Grammar
for neural models with no task-specific
prior knowledge to achieve good results.
Synchronous CFG
In this paper, we introduce data recom-
bination, a novel framework for inject- Sample New Examples
ing such prior knowledge into a model.
From the training data, we induce a high- Recombinant Examples
precision synchronous context-free gram- what are the major cities in [states border [maine]] ?
mar, which captures important conditional what are the major cities in [states border [utah]] ?
independence properties commonly found what states border [states border [maine]] ?
what states border [states border [utah]] ?
in semantic parsing. We then train a
sequence-to-sequence recurrent network Train Model
(RNN) model with a novel attention-based
copying mechanism on datapoints sam-
Sequence-to-sequence RNN
pled from this grammar, thereby teaching
the model about these structural proper-
ties. Data recombination improves the ac- Figure 1: An overview of our system. Given a
curacy of our RNN model on three se- dataset, we induce a high-precision synchronous
mantic parsing datasets, leading to new context-free grammar. We then sample from this
state-of-the-art performance on the stan- grammar to generate new “recombinant” exam-
dard GeoQuery dataset for models with ples, which we use to train a sequence-to-sequence
comparable supervision. RNN.

1 Introduction
have made swift inroads into many structured pre-
Semantic parsing—the precise translation of nat- diction tasks in NLP, including machine trans-
ural language utterances into logical forms—has lation (Sutskever et al., 2014; Bahdanau et al.,
many applications, including question answer- 2014) and syntactic parsing (Vinyals et al., 2015b;
ing (Zelle and Mooney, 1996; Zettlemoyer and Dyer et al., 2015). Because RNNs make very few
Collins, 2005; Zettlemoyer and Collins, 2007; domain-specific assumptions, they have the poten-
Liang et al., 2011; Berant et al., 2013), instruc- tial to succeed at a wide variety of tasks with min-
tion following (Artzi and Zettlemoyer, 2013b), imal feature engineering. However, this flexibil-
and regular expression generation (Kushman and ity also puts RNNs at a disadvantage compared
Barzilay, 2013). Modern semantic parsers (Artzi to standard semantic parsers, which can generalize
and Zettlemoyer, 2013a; Berant et al., 2013) naturally by leveraging their built-in awareness of
are complex pieces of software, requiring hand- logical compositionality.
crafted features, lexicons, and grammars. In this paper, we introduce data recombina-
Meanwhile, recurrent neural networks (RNNs) tion, a generic framework for declaratively inject-
G EO
(2015b) showed that an RNN can reliably predict
x: “what is the population of iowa ?” tree-structured outputs in a linear fashion.
y: _answer ( NV , ( We evaluate our system on three existing se-
_population ( NV , V1 ) , _const (
V0 , _stateid ( iowa ) ) ) ) mantic parsing datasets. Figure 2 shows sample
ATIS input-output pairs from each of these datasets.
x: “can you list all flights from chicago to milwaukee”
y: ( _lambda $0 e ( _and
• GeoQuery (G EO) contains natural language
( _flight $0 ) questions about US geography paired with
( _from $0 chicago : _ci ) corresponding Prolog database queries. We
( _to $0 milwaukee : _ci ) ) )
use the standard split of 600 training exam-
Overnight
x: “when is the weekly standup” ples and 280 test examples introduced by
y: ( call listValue ( call Zettlemoyer and Collins (2005). We prepro-
getProperty meeting.weekly_standup cess the logical forms to De Brujin index no-
( string start_time ) ) )
tation to standardize variable naming.
• ATIS (ATIS) contains natural language
Figure 2: One example from each of our domains. queries for a flights database paired with
We tokenize logical forms as shown, thereby cast- corresponding database queries written in
ing semantic parsing as a sequence-to-sequence lambda calculus. We train on 4473 examples
task. and evaluate on the 448 test examples used
by Zettlemoyer and Collins (2007).
ing prior knowledge into a domain-general struc- • Overnight (OVERNIGHT) contains logical
tured prediction model. In data recombination, forms paired with natural language para-
prior knowledge about a task is used to build a phrases across eight varied subdomains.
high-precision generative model that expands the Wang et al. (2015) constructed the dataset by
empirical distribution by allowing fragments of generating all possible logical forms up to
different examples to be combined in particular some depth threshold, then getting multiple
ways. Samples from this generative model are natural language paraphrases for each logi-
then used to train a domain-general model. In the cal form from workers on Amazon Mechan-
case of semantic parsing, we construct a genera- ical Turk. We evaluate on the same train/test
tive model by inducing a synchronous context-free splits as Wang et al. (2015).
grammar (SCFG), creating new examples such
as those shown in Figure 1; our domain-general In this paper, we only explore learning from log-
model is a sequence-to-sequence RNN with a ical forms. In the last few years, there has an
novel attention-based copying mechanism. Data emergence of semantic parsers learned from de-
recombination boosts the accuracy of our RNN notations (Clarke et al., 2010; Liang et al., 2011;
model on three semantic parsing datasets. On the Berant et al., 2013; Artzi and Zettlemoyer, 2013b).
G EO dataset, data recombination improves test ac- While our system cannot directly learn from deno-
curacy by 4.3 percentage points over our baseline tations, it could be used to rerank candidate deriva-
RNN, leading to new state-of-the-art results for tions generated by one of these other systems.
models that do not use a seed lexicon for predi-
3 Sequence-to-sequence RNN Model
cates.
Our sequence-to-sequence RNN model is based
2 Problem statement on existing attention-based neural machine trans-
lation models (Bahdanau et al., 2014; Luong et al.,
We cast semantic parsing as a sequence-to- 2015a), but also includes a novel attention-based
sequence task. The input utterance x is a sequence copying mechanism. Similar copying mechanisms
of words x1 , . . . , xm ∈ V (in) , the input vocabulary; have been explored in parallel by Gu et al. (2016)
similarly, the output logical form y is a sequence and Gulcehre et al. (2016).
of tokens y1 , . . . , yn ∈ V (out) , the output vocab-
ulary. A linear sequence of tokens might appear 3.1 Basic Model
to lose the hierarchical structure of a logical form, Encoder. The encoder converts the input se-
but there is precedent for this choice: Vinyals et al. quence x1 , . . . , xm into a sequence of context-
sensitive embeddings b1 , . . . , bm using a bidirec- model has difficulty generalizing to the long tail of
tional RNN (Bahdanau et al., 2014). First, a word entity names commonly found in semantic parsing
embedding function φ(in) maps each word xi to a datasets. Conveniently, entity names in the input
fixed-dimensional vector. These vectors are fed as often correspond directly to tokens in the output
input to two RNNs: a forward RNN and a back- (e.g., “iowa” becomes iowa in Figure 2).1
ward RNN. The forward RNN starts with an initial To capture this intuition, we introduce a new
hidden state hF0 , and generates a sequence of hid- attention-based copying mechanism. At each time
den states hF1 , . . . , hFm by repeatedly applying the step j, the decoder generates one of two types of
recurrence actions. As before, it can write any word in the
output vocabulary. In addition, it can copy any in-
hFi = LSTM(φ(in) (xi ), hFi−1 ). (1) put word xi directly to the output, where the prob-
ability with which we copy xi is determined by
The recurrence takes the form of an LSTM the attention score on xi . Formally, we define a
(Hochreiter and Schmidhuber, 1997). The back- latent action aj that is either Write[w] for some
ward RNN similarly generates hidden states w ∈ V (out) or Copy[i] for some i ∈ {1, . . . , m}.
hBm , . . . , hB1 by processing the input sequence in We then have
reverse order. Finally, for each input position i,
we define the context-sensitive embedding bi to be P (aj = Write[w] | x, y1:j−1 ) ∝ exp(Uw [sj , cj ]),
the concatenation of hFi and hBi (8)
Decoder. The decoder is an attention-based P (aj = Copy[i] | x, y1:j−1 ) ∝ exp(eji ). (9)
model (Bahdanau et al., 2014; Luong et al., 2015a)
that generates the output sequence y1 , . . . , yn one The decoder chooses aj with a softmax over all
token at a time. At each time step j, it writes these possible actions; yj is then a deterministic
yj based on the current hidden state sj , then up- function of aj and x. During training, we maxi-
dates the hidden state to sj+1 based on sj and yj . mize the log-likelihood of y, marginalizing out a.
Formally, the decoder is defined by the following Attention-based copying can be seen as a com-
equations: bination of a standard softmax output layer of an
attention-based model (Bahdanau et al., 2014) and
s1 = tanh(W (s) [hFm , hB1 ]). (2) a Pointer Network (Vinyals et al., 2015a); in a
Pointer Network, the only way to generate output
eji = s>
j W
(a)
bi . (3)
is to copy a symbol from the input.
exp(eji )
αji = Pm . (4)
i0 =1 exp(eji0 )
4 Data Recombination
m
X 4.1 Motivation
cj = αji bi . (5)
i=1 The main contribution of this paper is a novel data
P (yj = w | x, y1:j−1 ) ∝ exp(Uw [sj , cj ]). (6) recombination framework that injects important
(out) prior knowledge into our oblivious sequence-to-
sj+1 = LSTM([φ (yj ), cj ], sj ). (7)
sequence RNN. In this framework, we induce a
When not specified, i ranges over {1, . . . , m} and high-precision generative model from the training
j ranges over {1, . . . , n}. Intuitively, the αji ’s de- data, then sample from it to generate new training
fine a probability distribution over the input words, examples. The process of inducing this generative
describing what words in the input the decoder is model can leverage any available prior knowledge,
focusing on at time j. They are computed from which is transmitted through the generated exam-
the unnormalized attention scores eji . The matri- ples to the RNN model. A key advantage of our
ces W (s) , W (a) , and U , as well as the embedding two-stage approach is that it allows us to declare
function φ(out) , are parameters of the model. desired properties of the task which might be hard
to capture in the model architecture.
3.2 Attention-based Copying 1
On G EO and ATIS, we make a point not to rely on or-
In the basic model of the previous section, the next thography for non-entities such as “state” to _state, since
this leverages information not available to previous models
output word yj is chosen via a simple softmax over (Zettlemoyer and Collins, 2005) and is much less language-
all words in the output vocabulary. However, this independent.
Examples
(“what states border texas ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas)))))
(“what is the highest mountain in ohio ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))))
Rules created by A BS E NTITIES
ROOT → h “what states border S TATE I D ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(S TATE I D ))))i
S TATE I D → h “texas”, texas i
ROOT → h “what is the highest mountain in S TATE I D ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV),
const(V0, stateid(S TATE I D )))))i
S TATE I D → h“ohio”, ohioi
Rules created by A BS W HOLE P HRASES
ROOT → h “what states border S TATE ?”, answer(NV, (state(V0), next_to(V0, NV), S TATE ))i
S TATE → h “states border texas”, state(V0), next_to(V0, NV), const(V0, stateid(texas))i
ROOT → h “what is the highest mountain in S TATE ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), S TATE )))i
Rules created by C ONCAT-2
ROOT → hS ENT1 </s> S ENT2 , S ENT1 </s> S ENT2 i
S ENT → h “what states border texas ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas)))) i
S ENT → h “what is the highest mountain in ohio ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))) i

Figure 3: Various grammar induction strategies illustrated on G EO. Each strategy converts the rules of
an input grammar into rules of an output grammar. This figure shows the base case where the input
grammar has rules ROOT → hx, yi for each (x, y) pair in the training dataset.

Our approach generalizes data augmentation, this need.


which is commonly employed to inject prior
knowledge into a model. Data augmenta- 4.2 General Setting
tion techniques focus on modeling invariances— In the general setting of data recombination, we
transformations like translating an image or start with a training set D of (x, y) pairs, which
adding noise that alter the inputs x, but do not defines the empirical distribution p̂(x, y). We then
change the output y. These techniques have fit a generative model p̃(x, y) to p̂ which gener-
proven effective in areas like computer vision alizes beyond the support of p̂, for example by
(Krizhevsky et al., 2012) and speech recognition splicing together fragments of different examples.
(Jaitly and Hinton, 2013). We refer to examples in the support of p̃ as re-
In semantic parsing, however, we would like to combinant examples. Finally, to train our actual
capture more than just invariance properties. Con- model pθ (y | x), we maximize the expected value
sider an example with the utterance “what states of log pθ (y | x), where (x, y) is drawn from p̃.
border texas ?”. Given this example, it should be
4.3 SCFGs for Semantic Parsing
easy to generalize to questions where “texas” is
replaced by the name of any other state: simply For semantic parsing, we induce a synchronous
replace the mention of Texas in the logical form context-free grammar (SCFG) to serve as the
with the name of the new state. Underlying this backbone of our generative model p̃. An SCFG
phenomenon is a strong conditional independence consists of a set of production rules X → hα, βi,
principle: the meaning of the rest of the sentence where X is a category (non-terminal), and α and β
is independent of the name of the state in ques- are sequences of terminal and non-terminal sym-
tion. Standard data augmentation is not sufficient bols. Any non-terminal symbols in α must be
to model such phenomena: instead of holding y aligned to the same non-terminal symbol in β, and
fixed, we would like to apply simultaneous trans- vice versa. Therefore, an SCFG defines a set of
formations to x and y such that the new x still joint derivations of aligned pairs of strings. In our
maps to the new y. Data recombination addresses case, we use an SCFG to represent joint deriva-
tions of utterances x and logical forms y (which 4.3.2 Abstracting Whole Phrases
for us is just a sequence of tokens). After we
Our second grammar induction strategy, A BS W-
induce an SCFG G from D, the corresponding
HOLE P HRASES , abstracts both entities and whole
generative model p̃(x, y) is the distribution over
phrases with their types. For each grammar rule
pairs (x, y) defined by sampling from G, where
X → hα, βi in Gin , we add up to two rules to
we choose production rules to apply uniformly at
Gout . First, if α contains tokens that string match
random.
to an entity in β, we replace both occurrences with
It is instructive to compare our SCFG-based
the type of the entity, similarly to rule (i) from A B -
data recombination with WASP (Wong and
S E NTITIES . Second, if we can infer that the entire
Mooney, 2006; Wong and Mooney, 2007), which
expression β evaluates to a set of a particular type
uses an SCFG as the actual semantic parsing
(e.g. state) we create a rule that maps the type
model. The grammar induced by WASP must have
to hα, βi. In practice, we also use some simple
good coverage in order to generalize to new in-
rules to strip question identifiers from α, so that
puts at test time. WASP also requires the imple-
the resulting examples are more natural. Again,
mentation of an efficient algorithm for computing
refer to Figure 3 for a concrete example.
the conditional probability p(y | x). In contrast,
our SCFG is only used to convey prior knowl- This strategy works because of a more general
edge about conditional independence structure, so conditional independence property: the meaning
it only needs to have high precision; our RNN of any semantically coherent phrase is condition-
model is responsible for boosting recall over the ally independent of the rest of the sentence, the
entire input space. We also only need to forward cornerstone of compositional semantics. Note that
sample from the SCFG, which is considerably eas- this assumption is not always correct in general:
ier to implement than conditional inference. for example, phenomena like anaphora that in-
volve long-range context dependence violate this
Below, we examine various strategies for induc-
assumption. However, this property holds in most
ing a grammar G from a dataset D. We first en-
existing semantic parsing datasets.
code D as an initial grammar with rules ROOT
→ hx, yi for each (x, y) ∈ D. Next, we will
define each grammar induction strategy as a map- 4.3.3 Concatenation
ping from an input grammar Gin to a new gram- The final grammar induction strategy is a surpris-
mar Gout . This formulation allows us to compose ingly simple approach we tried that turns out to
grammar induction strategies (Section 4.3.4). work. For any k ≥ 2, we define the C ONCAT-k
strategy, which creates two types of rules. First,
4.3.1 Abstracting Entities we create a single rule that has ROOT going to
Our first grammar induction strategy, A BS E NTI - a sequence of k S ENT’s. Then, for each root-
TIES , simply abstracts entities with their types. level rule ROOT → hα, βi in Gin , we add the rule
We assume that each entity e (e.g., texas) has S ENT → hα, βi to Gout . See Figure 3 for an ex-
a corresponding type e.t (e.g., state), which we ample.
infer based on the presence of certain predicates Unlike A BS E NTITIES and A BS W HOLE -
in the logical form (e.g. stateid). For each P HRASES, concatenation is very general, and can
grammar rule X → hα, βi in Gin , where α con- be applied to any sequence transduction problem.
tains a token (e.g., “texas”) that string matches Of course, it also does not introduce additional
an entity (e.g., texas) in β, we add two rules information about compositionality or indepen-
to Gout : (i) a rule where both occurrences are re- dence properties present in semantic parsing.
placed with the type of the entity (e.g., state), However, it does generate harder examples for the
and (ii) a new rule that maps the type to the en- attention-based RNN, since the model must learn
tity (e.g., S TATE I D → h“texas”, texasi; we re- to attend to the correct parts of the now-longer
serve the category name S TATE for the next sec- input sequence. Related work has shown that
tion). Thus, Gout generates recombinant examples training a model on more difficult examples can
that fuse most of one example with an entity found improve generalization, the most canonical case
in a second example. A concrete example from the being dropout (Hinton et al., 2012; Wager et al.,
G EO domain is given in Figure 3. 2013).
function TRAIN(dataset D, number of epochs T ,
comparison to previous work, as string matching
number of examples to sample n) between input words and predicate names is not
Induce grammar G from D commonly used. We prevent copying by prepend-
Initialize RNN parameters θ randomly
for each iteration t = 1, . . . , T do ing underscores to predicate tokens; see Figure 2
Compute current learning rate ηt for examples.
Initialize current dataset Dt to D
for i = 1, . . . , n do
On ATIS alone, when doing attention-based
Sample new example (x0 , y 0 ) from G copying and data recombination, we leverage
Add (x0 , y 0 ) to Dt an external lexicon that maps natural language
end for
Shuffle Dt phrases (e.g., “kennedy airport”) to entities (e.g.,
for each example (x, y) in Dt do jfk:ap). When we copy a word that is part of
θ ← θ + ηt ∇ log pθ (y | x) a phrase in the lexicon, we write the entity asso-
end for
end for ciated with that lexicon entry. When performing
end function data recombination, we identify entity alignments
based on matching phrases and entities from the
lexicon.
Figure 4: The training procedure with data recom- We run all experiments with 200 hidden units
bination. We first induce an SCFG, then sample and 100-dimensional word vectors. We initial-
new recombinant examples from it at each epoch. ize all parameters uniformly at random within
the interval [−0.1, 0.1]. We maximize the log-
4.3.4 Composition likelihood of the correct logical form using
stochastic gradient descent. We train the model
We note that grammar induction strategies can
for a total of 30 epochs with an initial learning rate
be composed, yielding more complex grammars.
of 0.1, and halve the learning rate every 5 epochs,
Given any two grammar induction strategies f1
starting after epoch 15. We replace word vectors
and f2 , the composition f1 ◦ f2 is the grammar
for words that occur only once in the training set
induction strategy that takes in Gin and returns
with a universal <unk> word vector. Our model
f1 (f2 (Gin )). For the strategies we have defined,
is implemented in Theano (Bergstra et al., 2010).
we can perform this operation symbolically on the
grammar rules, without having to sample from the When performing data recombination, we sam-
intermediate grammar f2 (Gin ). ple a new round of recombinant examples from
our grammar at each epoch. We add these ex-
5 Experiments amples to the original training dataset, randomly
shuffle all examples, and train the model for the
We evaluate our system on three domains: G EO, epoch. Figure 4 gives pseudocode for this training
ATIS, and OVERNIGHT. For ATIS, we report procedure. One important hyperparameter is how
logical form exact match accuracy. For G EO and many examples to sample at each epoch: we found
OVERNIGHT, we determine correctness based on that a good rule of thumb is to sample as many re-
denotation match, as in Liang et al. (2011) and combinant examples as there are examples in the
Wang et al. (2015), respectively. training dataset, so that half of the examples the
5.1 Choice of Grammar Induction Strategy model sees at each epoch are recombinant.
At test time, we use beam search with beam size
We note that not all grammar induction strate-
5. We automatically balance missing right paren-
gies make sense for all domains. In particular,
theses by adding them at the end. On G EO and
we only apply A BS W HOLE P HRASES to G EO and
OVERNIGHT, we then pick the highest-scoring
OVERNIGHT. We do not apply A BS W HOLE -
logical form that does not yield an executor error
P HRASES to ATIS, as the dataset has little nesting
when the corresponding denotation is computed.
structure.
On ATIS, we just pick the top prediction on the
5.2 Implementation Details beam.
We tokenize logical forms in a domain-specific
5.3 Impact of the Copying Mechanism
manner, based on the syntax of the formal lan-
guage being used. On G EO and ATIS, we dis- First, we measure the contribution of the attention-
allow copying of predicate names to ensure a fair based copying mechanism to the model’s overall
G EO ATIS OVERNIGHT For our main results, we train our model with a va-
No Copying 74.6 69.9 76.7
With Copying 85.0 76.3 75.8 riety of data recombination strategies on all three
datasets. These results are summarized in Tables 2
Table 1: Test accuracy on G EO, ATIS, and and 3. We compare our system to the baseline of
OVERNIGHT, both with and without copying. On not using any data recombination, as well as to
OVERNIGHT, we average across all eight domains. state-of-the-art systems on all three datasets.
We find that data recombination consistently
G EO ATIS
Previous Work improves accuracy across the three domains we
Zettlemoyer and Collins (2007) 84.6 evaluated on, and that the strongest results come
Kwiatkowski et al. (2010) 88.9 from composing multiple strategies. Combin-
Liang et al. (2011)2 91.1
Kwiatkowski et al. (2011) 88.6 82.8 ing A BS W HOLE P HRASES, A BS E NTITIES, and
Poon (2013) 83.5 C ONCAT-2 yields a 4.3 percentage point improve-
Zhao and Huang (2015) 88.9 84.2 ment over the baseline without data recombina-
Our Model
No Recombination 85.0 76.3 tion on G EO, and an average of 1.7 percentage
A BS E NTITIES 85.4 79.9 points on OVERNIGHT. In fact, on G EO, we
A BS W HOLE P HRASES 87.5 achieve test accuracy of 89.3%, which surpasses
C ONCAT-2 84.6 79.0
C ONCAT-3 77.5 the previous state-of-the-art, excluding Liang et al.
AWP + AE 88.9 (2011), which used a seed lexicon for predicates.
AE + C2 78.8
AWP + AE + C2 89.3
On ATIS, we experiment with concatenating more
AE + C3 83.3 than 2 examples, to make up for the fact that we
cannot apply A BS W HOLE P HRASES, which gen-
Table 2: Test accuracy using different data recom- erates longer examples. We obtain a test accu-
bination strategies on G EO and ATIS. AE is A B - racy of 83.3 with A BS E NTITIES composed with
S E NTITIES , AWP is A BS W HOLE P HRASES , C2 is C ONCAT-3, which beats the baseline by 7 percent-
C ONCAT-2, and C3 is C ONCAT-3. age points and is competitive with the state-of-the-
art.
performance. On each task, we train and evalu-
Data recombination without copying. For
ate two models: one with the copying mechanism,
completeness, we also investigated the effects
and one without. Training is done without data re-
of data recombination on the model without
combination. The results are shown in Table 1.
attention-based copying. We found that recom-
On G EO and ATIS, the copying mechanism
bination helped significantly on G EO and ATIS,
helps significantly: it improves test accuracy by
but hurt the model slightly on OVERNIGHT. On
10.4 percentage points on G EO and 6.4 points
G EO, the best data recombination strategy yielded
on ATIS. However, on OVERNIGHT, adding the
test accuracy of 82.9%, for a gain of 8.3 percent-
copying mechanism actually makes our model
age points over the baseline with no copying and
perform slightly worse. This result is somewhat
no recombination; on ATIS, data recombination
expected, as the OVERNIGHT dataset contains a
gives test accuracies as high as 74.6%, a 4.7 point
very small number of distinct entities. It is also
gain over the same baseline. However, no data re-
notable that both systems surpass the previous best
combination strategy improved average test accu-
system on OVERNIGHT by a wide margin.
racy on OVERNIGHT; the best one resulted in a
We choose to use the copying mechanism in all
0.3 percentage point decrease in test accuracy. We
subsequent experiments, as it has a large advan-
hypothesize that data recombination helps less on
tage in realistic settings where there are many dis-
OVERNIGHT in general because the space of pos-
tinct entities in the world. The concurrent work of
sible logical forms is very limited, making it more
Gu et al. (2016) and Gulcehre et al. (2016), both of
like a large multiclass classification task. There-
whom propose similar copying mechanisms, pro-
fore, it is less important for the model to learn
vides additional evidence for the utility of copying
good compositional representations that general-
on a wide range of NLP tasks.
ize to new logical forms at test time.
5.4 Main Results ours, as they as they used a seed lexicon mapping words to
predicates. We explicitly avoid using such prior knowledge
2
The method of Liang et al. (2011) is not comparable to in our system.
BASKETBALL B LOCKS C ALENDAR H OUSING P UBLICATIONS R ECIPES R ESTAURANTS S OCIAL Avg.
Previous Work
Wang et al. (2015) 46.3 41.9 74.4 54.0 59.0 70.8 75.9 48.2 58.8
Our Model
No Recombination 85.2 58.1 78.0 71.4 76.4 79.6 76.2 81.4 75.8
A BS E NTITIES 86.7 60.2 78.0 65.6 73.9 77.3 79.5 81.3 75.3
A BS W HOLE P HRASES 86.7 55.9 79.2 69.8 76.4 77.8 80.7 80.9 75.9
C ONCAT-2 84.7 60.7 75.6 69.8 74.5 80.1 79.5 80.8 75.7
AWP + AE 85.2 54.1 78.6 67.2 73.9 79.6 81.9 82.1 75.3
AWP + AE + C2 87.5 60.2 81.0 72.5 78.3 81.0 79.5 79.6 77.5

Table 3: Test accuracy using different data recombination strategies on the OVERNIGHT tasks.

Depth-2 (same length)


as well. In comparison, applying A BS E NTITIES
x: “rel:12 of rel:17 of ent:14” alone, which generates examples of the same
y: ( _rel:12 ( _rel:17 _ent:14 ) ) length as those in the original dataset, was
Depth-4 (longer) generally less effective.
x: “rel:23 of rel:36 of rel:38 of rel:10 of ent:05”
y: ( _rel:23 ( _rel:36 ( _rel:38 We conducted additional experiments on artifi-
( _rel:10 _ent:05 ) ) ) ) cial data to investigate the importance of adding
longer, harder examples. We experimented with
adding new examples via data recombination, as
Figure 5: A sample of our artificial data. well as adding new independent examples (e.g. to
simulate the acquisition of more training data). We
constructed a simple world containing a set of enti-
100

ties and a set of binary relations. For any n, we can


generate a set of depth-n examples, which involve
80

the composition of n relations applied to a single


entity. Example data points are shown in Figure 5.
Test accuracy (%)

60

We train our model on various datasets, then test


it on a set of 500 randomly chosen depth-2 exam-
40

ples. The model always has access to a small seed


training set of 100 depth-2 examples. We then add
Same length, independent
20

Longer, independent one of four types of examples to the training set:


Same length, recombinant
Longer, recombinant • Same length, independent: New randomly
0

chosen depth-2 examples.3


0 100 200 300 400 500
• Longer, independent: Randomly chosen
Number of additional examples depth-4 examples.
• Same length, recombinant: Depth-2 exam-
Figure 6: The results of our artificial data exper- ples sampled from the grammar induced by
iments. We see that the model learns more from applying A BS E NTITIES to the seed dataset.
longer examples than from same-length examples.
• Longer, recombinant: Depth-4 examples
sampled from the grammar induced by apply-
5.5 Effect of Longer Examples ing A BS W HOLE P HRASES followed by A B -
S E NTITIES to the seed dataset.
Interestingly, strategies like A BS W HOLE -
P HRASES and C ONCAT-2 help the model even To maintain consistency between the independent
though the resulting recombinant examples are and recombinant experiments, we fix the recombi-
generally not in the support of the test distribution. nant examples across all epochs, instead of resam-
In particular, these recombinant examples are on pling at every epoch. In Figure 6, we plot accu-
average longer than those in the actual dataset, racy on the test set versus the number of additional
which makes them harder for the attention-based examples added of each of these four types. As
model. Indeed, for every domain, our best 3
Technically, these are not completely independent, as we
accuracy numbers involved some form of concate- sample these new examples without replacement. The same
nation, and often involved A BS W HOLE P HRASES applies to the longer “independent” examples.
expected, independent examples are more help- recombination strategies, these techniques only
ful than the recombinant ones, but both help the change inputs x, while keeping the labels y fixed.
model improve considerably. In addition, we see Additionally, these paraphrasing-based transfor-
that even though the test dataset only has short ex- mations can be described in terms of grammar
amples, adding longer examples helps the model induction, so they can be incorporated into our
more than adding shorter ones, in both the inde- framework.
pendent and recombinant cases. These results un- In data recombination, data generated by a high-
derscore the importance training on longer, harder precision generative model is used to train a sec-
examples. ond, domain-general model. Generative oversam-
pling (Liu et al., 2007) learns a generative model
6 Discussion in a multiclass classification setting, then uses it
to generate additional examples from rare classes
In this paper, we have presented a novel frame-
in order to combat label imbalance. Uptraining
work we term data recombination, in which we
(Petrov et al., 2010) uses data labeled by an ac-
generate new training examples from a high-
curate but slow model to train a computationally
precision generative model induced from the orig-
cheaper second model. Vinyals et al. (2015b) gen-
inal training dataset. We have demonstrated
erate a large dataset of constituency parse trees
its effectiveness in improving the accuracy of a
by taking sentences that multiple existing systems
sequence-to-sequence RNN model on three se-
parse in the same way, and train a neural model on
mantic parsing datasets, using a synchronous
this dataset.
context-free grammar as our generative model.
Some of our induced grammars generate ex-
There has been growing interest in applying
amples that are not in the test distribution, but
neural networks to semantic parsing and related
nonetheless aid in generalization. Related work
tasks. Dong and Lapata (2016) concurrently de-
has also explored the idea of training on altered
veloped an attention-based RNN model for se-
or out-of-domain data, often interpreting it as a
mantic parsing, although they did not use data re-
form of regularization. Dropout training has been
combination. Grefenstette et al. (2014) proposed
shown to be a form of adaptive regularization
a non-recurrent neural model for semantic pars-
(Hinton et al., 2012; Wager et al., 2013). Guu et al.
ing, though they did not run experiments. Mei et
(2015) showed that encouraging a knowledge base
al. (2016) use an RNN model to perform a related
completion model to handle longer path queries
task of instruction following.
acts as a form of structural regularization.
Our proposed attention-based copying mech-
Language is a blend of crisp regularities and
anism bears a strong resemblance to two mod-
soft relationships. Our work takes RNNs, which
els that were developed independently by other
excel at modeling soft phenomena, and uses a
groups. Gu et al. (2016) apply a very similar copy-
highly structured tool—synchronous context free
ing mechanism to text summarization and single-
grammars—to infuse them with an understanding
turn dialogue generation. Gulcehre et al. (2016)
of crisp structure. We believe this paradigm for si-
propose a model that decides at each step whether
multaneously modeling the soft and hard aspects
to write from a “shortlist” vocabulary or copy from
of language should have broader applicability be-
the input, and report improvements on machine
yond semantic parsing.
translation and text summarization. Another piece
of related work is Luong et al. (2015b), who train Acknowledgments This work was supported by
a neural machine translation system to copy rare the NSF Graduate Research Fellowship under
words, relying on an external system to generate Grant No. DGE-114747, and the DARPA Com-
alignments. municating with Computers (CwC) program under
Prior work has explored using paraphrasing for ARO prime contract no. W911NF-15-1-0462.
data augmentation on NLP tasks. Zhang et al.
(2015) augment their data by swapping out words Reproducibility. All code, data, and
for synonyms from WordNet. Wang and Yang experiments for this paper are avail-
(2015) use a similar strategy, but identify similar able on the CodaLab platform at https:
words and phrases based on cosine distance be- //worksheets.codalab.org/worksheets/
tween vector space embeddings. Unlike our data 0x50757a37779b485f89012e4ba03b6f4f/.
References [Hinton et al.2012] G. E. Hinton, N. Srivastava,
A. Krizhevsky, I. Sutskever, and R. R. Salakhut-
[Artzi and Zettlemoyer2013a] Y. Artzi and L. Zettle- dinov. 2012. Improving neural networks by
moyer. 2013a. UW SPF: The University of Wash- preventing co-adaptation of feature detectors. arXiv
ington semantic parsing framework. arXiv preprint preprint arXiv:1207.0580.
arXiv:1311.3011.
[Hochreiter and Schmidhuber1997] S. Hochreiter and
[Artzi and Zettlemoyer2013b] Y. Artzi and L. Zettle- J. Schmidhuber. 1997. Long short-term memory.
moyer. 2013b. Weakly supervised learning of se- Neural Computation, 9(8):1735–1780.
mantic parsers for mapping instructions to actions.
Transactions of the Association for Computational [Jaitly and Hinton2013] N. Jaitly and G. E. Hinton.
Linguistics (TACL), 1:49–62. 2013. Vocal tract length perturbation (vtlp) im-
proves speech recognition. In International Confer-
[Bahdanau et al.2014] D. Bahdanau, K. Cho, and ence on Machine Learning (ICML).
Y. Bengio. 2014. Neural machine translation by
jointly learning to align and translate. arXiv preprint [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever,
arXiv:1409.0473. and G. E. Hinton. 2012. Imagenet classification
with deep convolutional neural networks. In Ad-
[Berant et al.2013] J. Berant, A. Chou, R. Frostig, and vances in Neural Information Processing Systems
P. Liang. 2013. Semantic parsing on Freebase (NIPS), pages 1097–1105.
from question-answer pairs. In Empirical Methods
in Natural Language Processing (EMNLP). [Kushman and Barzilay2013] N. Kushman and
R. Barzilay. 2013. Using semantic unification
[Bergstra et al.2010] J. Bergstra, O. Breuleux, to generate regular expressions from natural lan-
F. Bastien, P. Lamblin, R. Pascanu, G. Des- guage. In Human Language Technology and
jardins, J. Turian, D. Warde-Farley, and Y. Bengio. North American Association for Computational
2010. Theano: a CPU and GPU math expression Linguistics (HLT/NAACL), pages 826–836.
compiler. In Python for Scientific Computing
Conference. [Kwiatkowski et al.2010] T. Kwiatkowski, L. Zettle-
moyer, S. Goldwater, and M. Steedman. 2010.
[Clarke et al.2010] J. Clarke, D. Goldwasser, Inducing probabilistic CCG grammars from logi-
M. Chang, and D. Roth. 2010. Driving semantic cal form with higher-order unification. In Em-
parsing from the world’s response. In Computa- pirical Methods in Natural Language Processing
tional Natural Language Learning (CoNLL), pages (EMNLP), pages 1223–1233.
18–27.
[Kwiatkowski et al.2011] T. Kwiatkowski, L. Zettle-
[Dong and Lapata2016] L. Dong and M. Lapata. 2016. moyer, S. Goldwater, and M. Steedman. 2011. Lex-
Language to logical form with neural attention. In ical generalization in CCG grammar induction for
Association for Computational Linguistics (ACL). semantic parsing. In Empirical Methods in Natural
Language Processing (EMNLP), pages 1512–1523.
[Dyer et al.2015] C. Dyer, M. Ballesteros, W. Ling,
A. Matthews, and N. A. Smith. 2015. Transition- [Liang et al.2011] P. Liang, M. I. Jordan, and D. Klein.
based dependency parsing with stack long short- 2011. Learning dependency-based compositional
term memory. In Association for Computational semantics. In Association for Computational Lin-
Linguistics (ACL). guistics (ACL), pages 590–599.
[Grefenstette et al.2014] E. Grefenstette, P. Blunsom, [Liu et al.2007] A. Liu, J. Ghosh, and C. Martin. 2007.
N. de Freitas, and K. M. Hermann. 2014. A deep Generative oversampling for mining imbalanced
architecture for semantic parsing. In ACL Workshop datasets. In International Conference on Data Min-
on Semantic Parsing, pages 22–27. ing (DMIN).

[Gu et al.2016] J. Gu, Z. Lu, H. Li, and V. O. Li. 2016. [Luong et al.2015a] M. Luong, H. Pham, and C. D.
Incorporating copying mechanism in sequence-to- Manning. 2015a. Effective approaches to
sequence learning. In Association for Computa- attention-based neural machine translation. In Em-
tional Linguistics (ACL). pirical Methods in Natural Language Processing
(EMNLP), pages 1412–1421.
[Gulcehre et al.2016] C. Gulcehre, S. Ahn, R. Nallap-
ati, B. Zhou, and Y. Bengio. 2016. Pointing the [Luong et al.2015b] M. Luong, I. Sutskever, Q. V. Le,
unknown words. In Association for Computational O. Vinyals, and W. Zaremba. 2015b. Addressing
Linguistics (ACL). the rare word problem in neural machine translation.
In Association for Computational Linguistics (ACL),
[Guu et al.2015] K. Guu, J. Miller, and P. Liang. 2015. pages 11–19.
Traversing knowledge graphs in vector space. In
Empirical Methods in Natural Language Processing [Mei et al.2016] H. Mei, M. Bansal, and M. R. Walter.
(EMNLP). 2016. Listen, attend, and walk: Neural mapping of
navigational instructions to action sequences. In As- [Zettlemoyer and Collins2007] L. S. Zettlemoyer and
sociation for the Advancement of Artificial Intelli- M. Collins. 2007. Online learning of relaxed
gence (AAAI). CCG grammars for parsing to logical form. In
Empirical Methods in Natural Language Process-
[Petrov et al.2010] S. Petrov, P. Chang, M. Ringgaard, ing and Computational Natural Language Learning
and H. Alshawi. 2010. Uptraining for accurate de- (EMNLP/CoNLL), pages 678–687.
terministic question parsing. In Empirical Methods
in Natural Language Processing (EMNLP). [Zhang et al.2015] X. Zhang, J. Zhao, and Y. LeCun.
2015. Character-level convolutional networks for
[Poon2013] H. Poon. 2013. Grounded unsupervised
text classification. In Advances in Neural Informa-
semantic parsing. In Association for Computational
tion Processing Systems (NIPS).
Linguistics (ACL).
[Sutskever et al.2014] I. Sutskever, O. Vinyals, and [Zhao and Huang2015] K. Zhao and L. Huang. 2015.
Q. V. Le. 2014. Sequence to sequence learning with Type-driven incremental semantic parsing with
neural networks. In Advances in Neural Information polymorphism. In North American Association for
Processing Systems (NIPS), pages 3104–3112. Computational Linguistics (NAACL).

[Vinyals et al.2015a] O. Vinyals, M. Fortunato, and


N. Jaitly. 2015a. Pointer networks. In Advances
in Neural Information Processing Systems (NIPS),
pages 2674–2682.
[Vinyals et al.2015b] O. Vinyals, L. Kaiser, T. Koo,
S. Petrov, I. Sutskever, and G. Hinton. 2015b.
Grammar as a foreign language. In Advances
in Neural Information Processing Systems (NIPS),
pages 2755–2763.
[Wager et al.2013] S. Wager, S. I. Wang, and P. Liang.
2013. Dropout training as adaptive regularization.
In Advances in Neural Information Processing Sys-
tems (NIPS).
[Wang and Yang2015] W. Y. Wang and D. Yang. 2015.
That’s so annoying!!!: A lexical and frame-semantic
embedding based data augmentation approach to au-
tomatic categorization of annoying behaviors using
#petpeeve tweets. In Empirical Methods in Natural
Language Processing (EMNLP).
[Wang et al.2015] Y. Wang, J. Berant, and P. Liang.
2015. Building a semantic parser overnight. In As-
sociation for Computational Linguistics (ACL).
[Wong and Mooney2006] Y. W. Wong and R. J.
Mooney. 2006. Learning for semantic pars-
ing with statistical machine translation. In North
American Association for Computational Linguis-
tics (NAACL), pages 439–446.
[Wong and Mooney2007] Y. W. Wong and R. J.
Mooney. 2007. Learning synchronous grammars
for semantic parsing with lambda calculus. In Asso-
ciation for Computational Linguistics (ACL), pages
960–967.
[Zelle and Mooney1996] M. Zelle and R. J. Mooney.
1996. Learning to parse database queries using in-
ductive logic programming. In Association for the
Advancement of Artificial Intelligence (AAAI), pages
1050–1055.
[Zettlemoyer and Collins2005] L. S. Zettlemoyer and
M. Collins. 2005. Learning to map sentences to log-
ical form: Structured classification with probabilis-
tic categorial grammars. In Uncertainty in Artificial
Intelligence (UAI), pages 658–666.

You might also like