Professional Documents
Culture Documents
Induce Grammar
for neural models with no task-specific
prior knowledge to achieve good results.
Synchronous CFG
In this paper, we introduce data recom-
bination, a novel framework for inject- Sample New Examples
ing such prior knowledge into a model.
From the training data, we induce a high- Recombinant Examples
precision synchronous context-free gram- what are the major cities in [states border [maine]] ?
mar, which captures important conditional what are the major cities in [states border [utah]] ?
independence properties commonly found what states border [states border [maine]] ?
what states border [states border [utah]] ?
in semantic parsing. We then train a
sequence-to-sequence recurrent network Train Model
(RNN) model with a novel attention-based
copying mechanism on datapoints sam-
Sequence-to-sequence RNN
pled from this grammar, thereby teaching
the model about these structural proper-
ties. Data recombination improves the ac- Figure 1: An overview of our system. Given a
curacy of our RNN model on three se- dataset, we induce a high-precision synchronous
mantic parsing datasets, leading to new context-free grammar. We then sample from this
state-of-the-art performance on the stan- grammar to generate new “recombinant” exam-
dard GeoQuery dataset for models with ples, which we use to train a sequence-to-sequence
comparable supervision. RNN.
1 Introduction
have made swift inroads into many structured pre-
Semantic parsing—the precise translation of nat- diction tasks in NLP, including machine trans-
ural language utterances into logical forms—has lation (Sutskever et al., 2014; Bahdanau et al.,
many applications, including question answer- 2014) and syntactic parsing (Vinyals et al., 2015b;
ing (Zelle and Mooney, 1996; Zettlemoyer and Dyer et al., 2015). Because RNNs make very few
Collins, 2005; Zettlemoyer and Collins, 2007; domain-specific assumptions, they have the poten-
Liang et al., 2011; Berant et al., 2013), instruc- tial to succeed at a wide variety of tasks with min-
tion following (Artzi and Zettlemoyer, 2013b), imal feature engineering. However, this flexibil-
and regular expression generation (Kushman and ity also puts RNNs at a disadvantage compared
Barzilay, 2013). Modern semantic parsers (Artzi to standard semantic parsers, which can generalize
and Zettlemoyer, 2013a; Berant et al., 2013) naturally by leveraging their built-in awareness of
are complex pieces of software, requiring hand- logical compositionality.
crafted features, lexicons, and grammars. In this paper, we introduce data recombina-
Meanwhile, recurrent neural networks (RNNs) tion, a generic framework for declaratively inject-
G EO
(2015b) showed that an RNN can reliably predict
x: “what is the population of iowa ?” tree-structured outputs in a linear fashion.
y: _answer ( NV , ( We evaluate our system on three existing se-
_population ( NV , V1 ) , _const (
V0 , _stateid ( iowa ) ) ) ) mantic parsing datasets. Figure 2 shows sample
ATIS input-output pairs from each of these datasets.
x: “can you list all flights from chicago to milwaukee”
y: ( _lambda $0 e ( _and
• GeoQuery (G EO) contains natural language
( _flight $0 ) questions about US geography paired with
( _from $0 chicago : _ci ) corresponding Prolog database queries. We
( _to $0 milwaukee : _ci ) ) )
use the standard split of 600 training exam-
Overnight
x: “when is the weekly standup” ples and 280 test examples introduced by
y: ( call listValue ( call Zettlemoyer and Collins (2005). We prepro-
getProperty meeting.weekly_standup cess the logical forms to De Brujin index no-
( string start_time ) ) )
tation to standardize variable naming.
• ATIS (ATIS) contains natural language
Figure 2: One example from each of our domains. queries for a flights database paired with
We tokenize logical forms as shown, thereby cast- corresponding database queries written in
ing semantic parsing as a sequence-to-sequence lambda calculus. We train on 4473 examples
task. and evaluate on the 448 test examples used
by Zettlemoyer and Collins (2007).
ing prior knowledge into a domain-general struc- • Overnight (OVERNIGHT) contains logical
tured prediction model. In data recombination, forms paired with natural language para-
prior knowledge about a task is used to build a phrases across eight varied subdomains.
high-precision generative model that expands the Wang et al. (2015) constructed the dataset by
empirical distribution by allowing fragments of generating all possible logical forms up to
different examples to be combined in particular some depth threshold, then getting multiple
ways. Samples from this generative model are natural language paraphrases for each logi-
then used to train a domain-general model. In the cal form from workers on Amazon Mechan-
case of semantic parsing, we construct a genera- ical Turk. We evaluate on the same train/test
tive model by inducing a synchronous context-free splits as Wang et al. (2015).
grammar (SCFG), creating new examples such
as those shown in Figure 1; our domain-general In this paper, we only explore learning from log-
model is a sequence-to-sequence RNN with a ical forms. In the last few years, there has an
novel attention-based copying mechanism. Data emergence of semantic parsers learned from de-
recombination boosts the accuracy of our RNN notations (Clarke et al., 2010; Liang et al., 2011;
model on three semantic parsing datasets. On the Berant et al., 2013; Artzi and Zettlemoyer, 2013b).
G EO dataset, data recombination improves test ac- While our system cannot directly learn from deno-
curacy by 4.3 percentage points over our baseline tations, it could be used to rerank candidate deriva-
RNN, leading to new state-of-the-art results for tions generated by one of these other systems.
models that do not use a seed lexicon for predi-
3 Sequence-to-sequence RNN Model
cates.
Our sequence-to-sequence RNN model is based
2 Problem statement on existing attention-based neural machine trans-
lation models (Bahdanau et al., 2014; Luong et al.,
We cast semantic parsing as a sequence-to- 2015a), but also includes a novel attention-based
sequence task. The input utterance x is a sequence copying mechanism. Similar copying mechanisms
of words x1 , . . . , xm ∈ V (in) , the input vocabulary; have been explored in parallel by Gu et al. (2016)
similarly, the output logical form y is a sequence and Gulcehre et al. (2016).
of tokens y1 , . . . , yn ∈ V (out) , the output vocab-
ulary. A linear sequence of tokens might appear 3.1 Basic Model
to lose the hierarchical structure of a logical form, Encoder. The encoder converts the input se-
but there is precedent for this choice: Vinyals et al. quence x1 , . . . , xm into a sequence of context-
sensitive embeddings b1 , . . . , bm using a bidirec- model has difficulty generalizing to the long tail of
tional RNN (Bahdanau et al., 2014). First, a word entity names commonly found in semantic parsing
embedding function φ(in) maps each word xi to a datasets. Conveniently, entity names in the input
fixed-dimensional vector. These vectors are fed as often correspond directly to tokens in the output
input to two RNNs: a forward RNN and a back- (e.g., “iowa” becomes iowa in Figure 2).1
ward RNN. The forward RNN starts with an initial To capture this intuition, we introduce a new
hidden state hF0 , and generates a sequence of hid- attention-based copying mechanism. At each time
den states hF1 , . . . , hFm by repeatedly applying the step j, the decoder generates one of two types of
recurrence actions. As before, it can write any word in the
output vocabulary. In addition, it can copy any in-
hFi = LSTM(φ(in) (xi ), hFi−1 ). (1) put word xi directly to the output, where the prob-
ability with which we copy xi is determined by
The recurrence takes the form of an LSTM the attention score on xi . Formally, we define a
(Hochreiter and Schmidhuber, 1997). The back- latent action aj that is either Write[w] for some
ward RNN similarly generates hidden states w ∈ V (out) or Copy[i] for some i ∈ {1, . . . , m}.
hBm , . . . , hB1 by processing the input sequence in We then have
reverse order. Finally, for each input position i,
we define the context-sensitive embedding bi to be P (aj = Write[w] | x, y1:j−1 ) ∝ exp(Uw [sj , cj ]),
the concatenation of hFi and hBi (8)
Decoder. The decoder is an attention-based P (aj = Copy[i] | x, y1:j−1 ) ∝ exp(eji ). (9)
model (Bahdanau et al., 2014; Luong et al., 2015a)
that generates the output sequence y1 , . . . , yn one The decoder chooses aj with a softmax over all
token at a time. At each time step j, it writes these possible actions; yj is then a deterministic
yj based on the current hidden state sj , then up- function of aj and x. During training, we maxi-
dates the hidden state to sj+1 based on sj and yj . mize the log-likelihood of y, marginalizing out a.
Formally, the decoder is defined by the following Attention-based copying can be seen as a com-
equations: bination of a standard softmax output layer of an
attention-based model (Bahdanau et al., 2014) and
s1 = tanh(W (s) [hFm , hB1 ]). (2) a Pointer Network (Vinyals et al., 2015a); in a
Pointer Network, the only way to generate output
eji = s>
j W
(a)
bi . (3)
is to copy a symbol from the input.
exp(eji )
αji = Pm . (4)
i0 =1 exp(eji0 )
4 Data Recombination
m
X 4.1 Motivation
cj = αji bi . (5)
i=1 The main contribution of this paper is a novel data
P (yj = w | x, y1:j−1 ) ∝ exp(Uw [sj , cj ]). (6) recombination framework that injects important
(out) prior knowledge into our oblivious sequence-to-
sj+1 = LSTM([φ (yj ), cj ], sj ). (7)
sequence RNN. In this framework, we induce a
When not specified, i ranges over {1, . . . , m} and high-precision generative model from the training
j ranges over {1, . . . , n}. Intuitively, the αji ’s de- data, then sample from it to generate new training
fine a probability distribution over the input words, examples. The process of inducing this generative
describing what words in the input the decoder is model can leverage any available prior knowledge,
focusing on at time j. They are computed from which is transmitted through the generated exam-
the unnormalized attention scores eji . The matri- ples to the RNN model. A key advantage of our
ces W (s) , W (a) , and U , as well as the embedding two-stage approach is that it allows us to declare
function φ(out) , are parameters of the model. desired properties of the task which might be hard
to capture in the model architecture.
3.2 Attention-based Copying 1
On G EO and ATIS, we make a point not to rely on or-
In the basic model of the previous section, the next thography for non-entities such as “state” to _state, since
this leverages information not available to previous models
output word yj is chosen via a simple softmax over (Zettlemoyer and Collins, 2005) and is much less language-
all words in the output vocabulary. However, this independent.
Examples
(“what states border texas ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas)))))
(“what is the highest mountain in ohio ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))))
Rules created by A BS E NTITIES
ROOT → h “what states border S TATE I D ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(S TATE I D ))))i
S TATE I D → h “texas”, texas i
ROOT → h “what is the highest mountain in S TATE I D ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV),
const(V0, stateid(S TATE I D )))))i
S TATE I D → h“ohio”, ohioi
Rules created by A BS W HOLE P HRASES
ROOT → h “what states border S TATE ?”, answer(NV, (state(V0), next_to(V0, NV), S TATE ))i
S TATE → h “states border texas”, state(V0), next_to(V0, NV), const(V0, stateid(texas))i
ROOT → h “what is the highest mountain in S TATE ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), S TATE )))i
Rules created by C ONCAT-2
ROOT → hS ENT1 </s> S ENT2 , S ENT1 </s> S ENT2 i
S ENT → h “what states border texas ?”,
answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas)))) i
S ENT → h “what is the highest mountain in ohio ?”,
answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))) i
Figure 3: Various grammar induction strategies illustrated on G EO. Each strategy converts the rules of
an input grammar into rules of an output grammar. This figure shows the base case where the input
grammar has rules ROOT → hx, yi for each (x, y) pair in the training dataset.
Table 3: Test accuracy using different data recombination strategies on the OVERNIGHT tasks.
60
[Gu et al.2016] J. Gu, Z. Lu, H. Li, and V. O. Li. 2016. [Luong et al.2015a] M. Luong, H. Pham, and C. D.
Incorporating copying mechanism in sequence-to- Manning. 2015a. Effective approaches to
sequence learning. In Association for Computa- attention-based neural machine translation. In Em-
tional Linguistics (ACL). pirical Methods in Natural Language Processing
(EMNLP), pages 1412–1421.
[Gulcehre et al.2016] C. Gulcehre, S. Ahn, R. Nallap-
ati, B. Zhou, and Y. Bengio. 2016. Pointing the [Luong et al.2015b] M. Luong, I. Sutskever, Q. V. Le,
unknown words. In Association for Computational O. Vinyals, and W. Zaremba. 2015b. Addressing
Linguistics (ACL). the rare word problem in neural machine translation.
In Association for Computational Linguistics (ACL),
[Guu et al.2015] K. Guu, J. Miller, and P. Liang. 2015. pages 11–19.
Traversing knowledge graphs in vector space. In
Empirical Methods in Natural Language Processing [Mei et al.2016] H. Mei, M. Bansal, and M. R. Walter.
(EMNLP). 2016. Listen, attend, and walk: Neural mapping of
navigational instructions to action sequences. In As- [Zettlemoyer and Collins2007] L. S. Zettlemoyer and
sociation for the Advancement of Artificial Intelli- M. Collins. 2007. Online learning of relaxed
gence (AAAI). CCG grammars for parsing to logical form. In
Empirical Methods in Natural Language Process-
[Petrov et al.2010] S. Petrov, P. Chang, M. Ringgaard, ing and Computational Natural Language Learning
and H. Alshawi. 2010. Uptraining for accurate de- (EMNLP/CoNLL), pages 678–687.
terministic question parsing. In Empirical Methods
in Natural Language Processing (EMNLP). [Zhang et al.2015] X. Zhang, J. Zhao, and Y. LeCun.
2015. Character-level convolutional networks for
[Poon2013] H. Poon. 2013. Grounded unsupervised
text classification. In Advances in Neural Informa-
semantic parsing. In Association for Computational
tion Processing Systems (NIPS).
Linguistics (ACL).
[Sutskever et al.2014] I. Sutskever, O. Vinyals, and [Zhao and Huang2015] K. Zhao and L. Huang. 2015.
Q. V. Le. 2014. Sequence to sequence learning with Type-driven incremental semantic parsing with
neural networks. In Advances in Neural Information polymorphism. In North American Association for
Processing Systems (NIPS), pages 3104–3112. Computational Linguistics (NAACL).