Knowledge As A Teacher: Knowledge-Guided Structural Attention Networks

Knowledge as a Teacher:
Knowledge-Guided Structural Attention Networks
Yun-Nung Chen? Dilek Hakkani-Tür† Gokhan Tur†

Asli Celikyilmaz‡ Jianfeng Gao‡ Li Deng‡
?
National Taiwan University, Taipei, Taiwan
†
Google Research, Mountain View, CA
‡
Microsoft Research, Redmond, WA
{ y.v.chen, † dilek, † gokhan}@ieee.org
?
{ aslicel, ‡ jfgao, ‡ deng}@microsoft.com

‡
arXiv:1609.03286v1 [cs.AI] 12 Sep 2016
Abstract tants Microsoft’s Cortana and Apple’s Siri, are be-

ing incorporated in various devices and allow users
Natural language understanding (NLU) is a to speak to systems freely in order to finish tasks
core component of a spoken dialogue system. more efficiently. A key component of these conver-
Recently recurrent neural networks (RNN) ob-
sational systems is the natural language understand-
tained strong results on NLU due to their supe-
rior ability of preserving sequential informa- ing (NLU) module-it refers to the targeted under-
tion over time. Traditionally, the NLU mod- standing of human speech directed at machines (Tur
ule tags semantic slots for utterances consid- and De Mori, 2011). The goal of such “targeted” un-
ering their flat structures, as the underlying derstanding is to convert the recognized user speech
RNN structure is a linear chain. However, into a task-specific semantic representation of the
natural language exhibits linguistic properties user’s intention, at each turn, that aligns with the
that provide rich, structured information for back-end knowledge and action sources for task
better understanding. This paper introduces
completion. The dialogue manager then interprets
a novel model, knowledge-guided structural
attention networks (K-SAN), a generalization the semantics of the user’s request and associated
of RNN to additionally incorporate non-flat back-end results, and decides the most appropriate
network topologies guided by prior knowl- system action, by exploiting semantic context and
edge. There are two characteristics: 1) impor- user specific meta-information, such as geo-location
tant substructures can be captured from small and personal preferences (McTear, 2004; Rudnicky
training data, allowing the model to generalize and Xu, 1999).
to previously unseen test data; 2) the model
automatically figures out the salient substruc- A typical pipeline of NLU includes: domain
tures that are essential to predict the semantic classification, intent determination, and slot fill-
tags of the given sentences, so that the under- ing (Tur and De Mori, 2011). NLU first decides
standing performance can be improved. The the domain of user’s request given the input utter-
experiments on the benchmark Air Travel In-
ance, and based on the domain, predicts the in-
formation System (ATIS) data show that the
proposed K-SAN architecture can effectively tent and fills associated slots corresponding to a
extract salient knowledge from substructures domain-specific semantic template. For example,
with an attention mechanism, and outperform Figure 1 shows a user utterance, “show me the
the performance of the state-of-the-art neural flights from seattle to san francisco” and its seman-
network based frameworks. tic frame, find flight(origin=“seattle”, dest=“san
francisco”). It is easy to see the relationship be-
tween the origin city and the destination city in
1 Introduction
this example, although these do not appear next to
In the past decade, goal-oriented spoken dialogue each other. Traditionally, domain detection and in-
systems (SDS), such as the virtual personal assis- tent prediction are framed as utterance classification
show me the flights from seattle to san francisco tracted syntactic structural features and semantic de-
pendency features enhance inference model learn-
O O O O O B-origin O B-dest I-dest ing, and the model achieves better language under-
standing performance in various domains.
Figure 1: An example utterance annotated with its se-
mantic slots in the IOB format (S). Even with the emerging paradigm of integrating
deep learning and linguistic knowledge for differ-
problems, where several classifiers such as support ent NLP tasks (Socher et al., 2014), most of the
vector machines and maximum entropy have been previous work utilized such linguistic knowledge
employed (Haffner et al., 2003; Chelba et al., 2003; and knowledge bases as additional features as in-
Chen et al., 2014). Then slot filling is framed as a put to neural networks, and then learned the mod-
word sequence tagging task, where the IOB (in-out- els for tagging sequences. These feature enrich-
begin) format is applied for representing slot tags ment based approaches have some possible limita-
as illustrated in Figure 1, and hidden Markov mod- tions: 1) poor generalization and 2) error propaga-
els (HMM) or conditional random fields (CRF) have tion. Poor generalization comes from the mismatch
been employed for slot tagging (Pieraccini et al., between knowledge bases and the input data, and
1992; Wang et al., 2005). then the incorrectly extracted features due to errors
With the advances on deep learning, deep be- in previous processing propagate errors to the neu-
lief networks (DBNs) with deep neural networks ral models. In order to address the issues and bet-
(DNNs) have been applied to domain and intent ter learn the sequence tagging models, this paper
classification tasks (Sarikaya et al., 2011; Tur et al., proposes knowledge-guided structural attention net-
2012; Sarikaya et al., 2014). Recently, Ravuri and works, K-SAN, a generalization of RNNs that au-
Stolcke (2015) proposed an RNN architecture for tomatically learn the attention guided by external or
intent determination. For slot filling, deep learning prior knowledge and generate sentence-based rep-
has been viewed as a feature generator and the neu- resentations specifically for modeling sequence tag-
ral architecture can be merged with CRFs (Xu and ging. The main difference between K-SAN and pre-
Sarikaya, 2013). Yao et al. (2013) and Mesnil et vious approaches is that knowledge plays the role of
al. (2015) later employed RNNs for sequence label- a teacher to guide networks where and how much
ing in order to perform slot filling. However, the to focus attention considering the whole linguistic
above studies benefit from large training data with- structure simultaneously. Our main contributions
out leveraging any existing knowledge. When tag- are three-fold:
ging sequences RNNs consider them as flat struc-
tures, with their underlying linear chain structures, • End-to-end learning
potentially ignoring the structured information typi- To our knowledge, this is the first neural net-
cal of natural language sequences. work approach that utilizes general knowledge
Hierarchical structures and semantic relationships as guidance in an end-to-end fashion, where the
contain linguistic characteristics of input word se- model automatically learns important substruc-
quences forming sentences, and such information tures with an attention mechanism.
may help interpret their meaning. Furthermore, • Generalization for different knowledge
prior knowledge would help in the tagging of se- There is no required schema of knowledge, and
quences, especially when dealing with previously different types of parsing results, such as de-
unseen sequences (Tur et al., 2010; Deoras and pendency relations, knowledge graph-specific
Sarikaya, 2013). Prior work exploited external relations, and parsing output of hand-crafted
web-scale knowledge graphs such as Freebase and grammars, can serve as the knowledge guid-
Wikipedia for improving NLU (Heck et al., 2013; ance in this model.
Ma et al., 2015b; Chen et al., 2014) Liu et al.
(2013) and Chen et al. (2015) proposed approaches • Efficiency and parallelizability
that leverage linguistic knowledge encoded in parse Because the substructures from the input ut-
trees for language understanding, where the ex- terance are modeled separately, modeling time
may not increase linearly with respect to the temporality of transitive reasoning steps for different
number of words in the input sentence. QA tasks. The idea is to encode important knowl-
In the following sections, we empirically show the edge and store it into memory for future usage with
benefit of K-SAN on the targeted NLU task. attention mechanisms. Attention mechanisms allow
neural network models to selectively pay attention to
2 Related Work specific parts. There are also various tasks showing
the effectiveness of attention mechanisms.
Knowledge-Based Representations There is an
However, most previous work focused on the clas-
emerging trend of learning representations at dif-
sification or prediction tasks (predicting a single
ferent levels, such as word embeddings (Mikolov
word given a question), and there are few stud-
et al., 2013), character embeddings (Ling et al.,
ies for NLU tasks (slot tagging). Based on the
2015), and sentence embeddings (Le and Mikolov,
fact that the linguistic or knowledge-based substruc-
2014; Huang et al., 2013). In addition to fully
tures can be treated as prior knowledge to bene-
unsupervised embedding learning, knowledge bases
fit language understanding, this work borrows the
have been widely utilized to learn entity embeddings
idea from memory models to improve NLU. Un-
with specific functions or relations (Celikyilmaz and
like the prior NLU work that utilized representations
Hakkani-Tur, 2015; Yang et al., 2014). Different
learned from knowledge bases to enrich features of
from prior work, this paper focuses on learning com-
the current sentence, this paper directly learns a sen-
posable substructure embeddings that are informa-
tence representation incorporating memorized sub-
tive for understanding.
structures with an automatically decided attention
Recently linguistic structures are taken into ac-
mechanism in an end-to-end manner.
count in the deep learning framework. Ma et
al. (2015a) and Tai et al. (2015) both proposed
3 Knowledge-Guided Structural Attention
dependency-based approaches to combine deep
Networks (K-SAN)
learning and linguistic structures, where the model
used tree-based n-grams instead of surface ones For the NLU task, given an utterance with a se-
to capture knowledge-guided relations for sentence quence of words/tokens ~s = w1 , ..., wT , our model
modeling and classification. Roth and Lapata (2016) is to predict corresponding semantic tags ~y =
utilized lexicalized dependency paths to learn em- y1 , ..., yT for each word/token by incorporating
bedding representations for semantic role label- knowledge-guided structures. The proposed model
ing. However, the performance of these approaches is illustrated in Figure 2. The knowledge encoding
highly depends on the quality of “whole” sentence module first leverages external knowledge to gener-
parsing, and there is no control of degree of atten- ate a linguistic structure for the utterance, where a
tions on different substructures. Learning robust discrete set of knowledge-guided substructures {xi }
representations incorporating whole structures still is encoded into a set of vector representations (§ 3.1).
remains unsolved. In this paper, we address the lim- The model learns the representation for the whole
itation by proposing K-SAN to learn robust repre- sentence by paying different attention on the sub-
sentations of whole sentences, where the whole rep- structures (§ 3.2). Then the learned vector encoding
resentation is composed of the salient substructures the knowledge-guided structure is used for improv-
in order to avoid error propagation. ing the semantic tagger (§ 4).
Neural Attention and Memory Model One of
3.1 Knowledge Encoding Module
the earliest work with a memory component applied
to language processing is memory networks (Weston The prior knowledge obtained from external re-
et al., 2015; Sukhbaatar et al., 2015), which encode sources, such as dependency relations, knowledge
facts into vectors and store them in the memory for bases, etc., provides richer information to help de-
question answering (QA). Following their success, cide the semantic tags given an input utterance. This
Xiong et al. (2016) proposed dynamic memory net- paper takes dependency relations as an example for
works (DMN) to additionally capture position and knowledge encoding, and other structured relations
Input Sentence s
“show me the flights from seattle to san francisco”
Knowledge Encoding Module Knowledge-Guided
Sentence
ROOT Representation
Encoder
show me the flights from seattle to san francisco Min ∑ Mout

u o
Inner RNN Tagger
Product
knowledge-guided structure {xi} M wt-1 M wt M wt+1
U U U
pi Knowledge Attention Distribution
Knowledge ht-1 ht ht+1
W W W W
Encoder Mkg V V V
yt-1 yt yt+1
mi Weighted
Sum slot tagging sequence y
Encoded Knowledge Representation h
Figure 2: The illustration of knowledge-guided structural attention networks (K-SAN) for NLU.
ROOT
Sentence s 3.2 Model Architecture
show
show me the flights from seattle to san francisco The model embeds all knowledge-guided substruc-
1.
me flights Knowledge-Guided Substructure xi tures into a continuous space and stores embeddings
1. show me
2.
the from to of all x’s in the knowledge memory. The represen-
2. show flights the
3. tation of the input utterance is then compared with
seattle francisco 3. show flights from seattle
4. 4. show flights to francisco san encoded knowledge representations to integrate the
san carried structure guided by knowledge via an atten-
tion mechanism. Then the knowledge-guided rep-
Figure 3: The knowledge-guided substructures of depen- resentation of the sentence is taken together with
dency parsing, xi , on an example sentence s. the word sequence for estimating the semantic tags.
Four main procedures are described below.
can be applied in the same way. The input utterance Encoded Knowledge Representation To store
is parsed by a dependency parser, and the substruc- the knowledge-guided structure, we convert each
tures are built according to the paths from the root to substructure (e.g. path starting from the root to the
all leaves (Chen and Manning, 2014). For example, leaf in the dependency tree), xi , into a structure vec-
the dependency parsing of the utterance “show me tor mi with dimension d by embedding the substruc-
the flights from seattle to san francisco” is shown in ture in a continuous space through the knowledge
Figure 3, where the associated substructures are ob- encoding model Mkg . The input utterance s is also
tained from the parsing tree for knowledge encod- embedded to a vector u with the same dimension
ing. Here we do not utilize the dependency relation through the model Min .
labels in the experiments for better generalization,
because the labels may not be always available for mi = Mkg (xi ), (1)
different knowledge resources. Note that the num- u = Min (s). (2)
ber of substructures may be less than the number of
words in the utterance, because non-leaf nodes do We apply the three types for knowledge encoding
not have corresponding substructure in order to re- models, Mkg and Min , in order to model multiple
duce the duplicated information in the model. The words from a substructure xi or an input sentence s
top-left component of Figure 2 illustrates the mod- into a vector representation: 1) fully-connected neu-
ule for modeling knowledge-guided substructures. ral networks (NN) with linear activation, 2) recur-
rent neural networks (RNN), and 3) convolutional 4 Recurrent Neural Network Tagger
neural networks (CNN) with a window size 3 and
4.1 Chain-Based RNN Tagger
a max-pooling operation. For example, one of sub-
structures shown in Figure 3, “show flights seattle Given ~s = w1 , ..., wT , the model is to predict ~y =
from”, is encoded into a vector embedding. In the y1 , ..., yT where the tag yi is aligned with the word
experiments, the weights of Mkg and Min are tied to- wi . We use the Elman RNN architecture, consist-
gether based on their consistent ability of sequence ing of an input layer, a hidden layer, and an output
encoding. layer (Elman, 1990). The input, hidden and output
layers consist of a set of neurons representing the in-
Knowledge Attention Distribution In the em- put, hidden, and output at each time step t, wt , ht ,
bedding space, we compute the match between the and yt , respectively.
current utterance vector u and its substructure vec-
tor mi by taking their inner product followed by a ht = φ(W wt + U ht−1 ), (7)
softmax. yˆt = softmax(V ht ), (8)
pi = softmax(uT mi ), (3)
where φ is a smooth bounded function such as tanh,
where softmax(zi ) = ezi / j ezj and pi can be
P
and yˆt is the probability distribution over of semantic
viewed as attention distribution for modeling impor-
tags given the current hidden state ht . The sequence
tant substructures from external knowledge in order
probability can be formulated as
to understand the current utterance.
Y
Sentence Representation In order to encode the p(~y | ~s) = p(~y | w1 , ..., wT ) = p(yi | w1 , ..., wi ).
knowledge-guided structure, a vector h is a sum over i
(9)
the encoded knowledge embeddings weighted by the
The model can be trained using backpropagation to
attention distribution.
maximize the conditional likelihood of the training
set labels.
X
h= pi mi , (4)
i To overcome the frequent vanishing gradients is-
sue when modeling long-term dependencies, gated
which indicates that the sentence pays different at- RNN was designed to use a more sophisticated ac-
tention to different substructures guided from exter- tivation function than a usual activation function,
nal knowledge. Because the function from input to consisting of affine transformation followed by a
output is smooth, we can easily compute gradients simple element-wise nonlinearity by using gating
and back propagate through it. Then the sum of units (Chung et al., 2014), such as long short-
the substructure vector h and the current input em- term memory (LSTM) and gated recurrent unit
bedding u are then passed through a neural network (GRU) (Hochreiter and Schmidhuber, 1997; Cho et
model Mout to generate an output knowledge-guided al., 2014). RNNs employing either of these recur-
representation o. rent units have been shown to perform well in tasks
that require capturing long-term dependencies (Mes-
o = Mout (h + u), (5)
nil et al., 2015; Yao et al., 2014; Graves et al.,
where we employ a fully-connected dense network 2013; Sutskever et al., 2014). In this paper, we use
for Mout . RNN with GRU cells to allow each recurrent unit
to adaptively capture dependencies of different time
Sequence Tagging To estimate the tag sequence ~y scales (Cho et al., 2014; Chung et al., 2014), because
corresponding to an input word sequence ~s, we use RNN-GRU can yield comparable performance as
an RNN module for training a slot tagger, where the RNN-LSTM with need of fewer parameters and less
knowledge-guided representation o is fed into the in- data for generalization (Chung et al., 2014)
put of the model in order to incorporate the structure A GRU has two gates, a reset gate r, and an up-
information. date gate z (Cho et al., 2014; Chung et al., 2014).
~y = RNN(o, ~s) (6) The reset gate determines the combination between
input sentence s wt-1 wt wt+1 4.3 Joint RNN Tagger
1
Chain-Based U1 U2 U1 U2 U U2
RNN Tagger W1 W1 W1
Because the chain-based tagger and the knowledge-
W1
h1t-1 h1t h1t+1
guided tagger carry different information, the joint
RNN tagger is proposed to balance the information
Knowledge-Guided M M M
RNN Tagger between two model architectures. Figure 4 presents
o h2t-1 h2t h2t+1 the architecture of the joint RNN tagger.
W2 W2 W2 W2
y
V
V
V
V
V
V h1t = φ(W 1 wt + U 1 ht−1 ), (15)
slot tagging sequence yt-1 yt yt+1
h2t 2
= φ(M o + W wt + U ht−1 ), 2
(16)
Figure 4: The joint tagging model that incorporates a ŷt = softmax(V (αh1t + (1 − α)h2t )), (17)
chain-based RNN tagger (upper block) and a knowledge-
guided RNN tagger (lower block). where α is the weight for balancing chain-based and
knowledge-guided information. By jointly consid-
the new input and the previous memory, and the up- ering chain-based information (h1t ) and knowledge-
date gate decides how much the unit updates its ac- guided information (h2t ), the joint RNN tagger is ex-
tivation, or content. pected to achieve better generalization, and the per-
formance may be less sensitive to poor structures
r = σ(W r wt + U r ht−1 ), (10) from external knowledge. In the experiments, α is
z = σ(W z wt + U z ht−1 ), (11) set to 0.5 for balancing two sides. The objective
of the proposed model is to maximize the sequence
where σ is a logistic sigmoid function. probability p(~y | ~s) in (9), and the model can be
Then the final activation of the GRU at time t, ht , trained in an end-to-end manner, where the error
is a linear interpolation between the previous activa- would be back-propagated through the whole archi-
tion ht−1 and the candidate activation h̃t : tecture.
ht = (1 − z) h̃t + z ht−1 , (12) 5 Experiments

h̃t = φ(W h wt + U h (ht−1 r))), (13) 5.1 Experimental Setup
The dataset for experiments is the benchmark ATIS
where is an element-wise multiplication. When
corpus, which is extensively used by the NLU com-
the reset gate is off, it effectively makes the unit
munity (Mesnil et al., 2015). There are 4978 train-
act as if it is reading the first symbol of an input
ing utterances selected from Class A (context in-
sequence, allowing it to forget the previously com-
dependent) in the ATIS-2 and ATIS-3, while there
puted state. Then yˆt can be computed by (8).
are 893 utterances selected from the ATIS-3 Nov93
4.2 Knowledge-Guided RNN Tagger and Dec94. In the experiments, we only use lexi-
cal features. In order to show the robustness to data
In order to model the encoded knowledge from pre-
scarcity, we conduct the experiments with 3 different
vious turns, for each time step t, the knowledge-
sizes of training data (Small, Medium, and Large),
guided sentence representation o in (5) is fed into
where Small is 1/40 of the original set, Medium is
the RNN model together with the word wt . For the
1/10 of the original set, and Large is the full set. The
plain RNN, the hidden layer can be formulated as
evaluation metrics for NLU is F-measure on the pre-
dicted slots1 .
ht = φ(M o + W wt + U ht−1 ) (14)
For experiments with K-SAN, we parse all data
to replace (7) as illustrated in the right block of with the Stanford dependency parser (Chen and
Figure 2. RNN-GRU can incorporate the encoded Manning, 2014) and represent words as their em-
knowledge in the similar way, where M o can be beddings trained on the in-domain data, where the
added into gating mechanisms for modeling contex- parser is pre-trained on PTB. The loss function is
1
tual knowledge similarly. The used evaluation script is conlleval.
Model Dataset
Encoder (Mkg /Min ) Knowledge Tagger Small Medium Large
Baseline - 7 CRF 58.94 78.74 89.73
- 7 RNN 68.58 84.55 92.97
CNN 7 RNN 73.57 85.52 93.88
Structural - 3 CRF 59.55 78.71 90.13
DCNN 3 RNN 70.24 83.80 93.25
Tree-RNN 3 RNN 73.50 83.92 92.28
Proposed K-SAN (NN) 3 RNN 74.11† 85.97 93.98†
K-SAN (RNN) 3 RNN 73.13 86.85† 94.97†
K-SAN (CNN) 3 RNN 74.60† 87.99† 94.86†
Table 1: The F1 scores of predicted slots on the different size of ATIS training examples, where K-SAN utilizes the
dependency relations parsed from the Stanford parser. Small: 1/40 set; Medium: 1/10 set; Large: original set. (†
indicates that the performance is significantly better than all baseline models with p < 0.05 in the t-test.)
cross-entropy, and the optimizer we use is adam with dow) and syntactic (dependent head in the
the default setting (Kingma and Ba, 2014), where parsing tree) features.
the learning rate λ = 0.001, β1 = 0.9, β2 = 0.999, – DCNN (Ma et al., 2015a): predicts slots
and = 1e−08 . The maximum iteration for training by incorporating sentence embeddings
our K-SAN models is set as 300. The dimensional- learned by a convolutional model with
ity of input word embeddings is 100, and the hidden consideration of dependency tree struc-
layer sizes are in {50, 100, 150}. The dropout rates tures.
are set as {0.25, 0.50}. All reported results are from – Tree-RNN (Tai et al., 2015): predicts slots
the joint RNN tagger, and the hyperparameters are with sentence embeddings learned by an
tuned in the dev set for all experiments. RNN model based on the tree structures
of sentences.
5.2 Baseline
5.3 Slot Filling Results
To validate the effectiveness of the proposed model,
we compare the performance with the following Table 1 shows the performance of slot filling on dif-
baselines. ferent size of training data, where there are three
datasets (Small, Medium, and Large use 1/40, 1/10,
• Baseline:
and whole training data). For baselines (models
– CRF Tagger (Tur et al., 2010): predicts a without knowledge features), CNN Encoder-Tagger
semantic slot for each word with a context achieves the best performance on all datasets.
window (size = 5). Among structural models (models with knowl-
– RNN Tagger (Mesnil et al., 2015): pre- edge encoding), Tree-RNN Encoder-Tagger per-
dicts a semantic slot for each word. forms better for Small data but slightly worse than
– CNN Encoder-Tagger (Kim, 2014): tag
the DCNN Encoder-Tagger.
semantic slots with consideration of sen-
CNN (Kim, 2014) performs better compared to
tence embeddings learned by a convolu-
DCNN (Ma et al., 2015a) and Tree-RNN (Tai et
tional model.
al., 2015), even though CNN does not leverage ex-
• Structural: The NLU models utilize linguis- ternal knowledge when encoding sentences. When
tic information when tagging slots, where comparing the NLU performance between baselines
DCNN and Tree-RNN are the state-of-the-art and other state-of-the-art structural models, there
approaches for embedding sentences with lin- is no significant difference. This suggests that en-
guistic structures. coding sentence information without distinguishing
– CRF Tagger (Tur et al., 2010): predicts substructure may not capture salient semantics in or-
slots based on the lexical (5-word win- der to improve understanding performance.
Small find nonstop flights from salt lake city to new york on saturday april ninth
Medium find nonstop flights from salt lake city to new york on saturday april ninth
Large find nonstop flights from salt lake city to new york on saturday april ninth
flight_stop fromloc.city_name toloc.city_name depart_date. depart_date. depart_date.
day_name month_name day_number
Figure 5: The visualization of the decoded knowledge-guided structural attention for both relations and words learned
from different size of training data. Relations and words with darker color indicate higher attention weights generated
by the proposed K-SAN with CNN. The slot tags are shown in the figure for reference. Note that the dependency
relations are incorrectly parsed by the Stanford parser in this example, but our model is still able to benefit from the
structural information.
Among the proposed K-SAN models, CNN for for predicting correct slots, e.g. origin, destination,
encoding performs best on Small (75% on F1) and time. Furthermore, the difference of attention
and Medium (88% on F1), and RNN for en- distribution between three datasets is not significant;
coding performs best on the Large set (95% on this suggests that our proposed model is able to pay
F1). Also, most of the proposed models outper- correct attention to important substructures guided
form all baselines, where the improvement for the by the external knowledge even the training data is
small dataset is more significant. This suggests scarce.
that the proposed models carry better generaliza-
tion and are less sensitive to unseen data. For ex- 5.5 Knowledge Generalization
ample, given an utterance “which flights leave on
In order to show the capacity of generalization to
monday from montreal and arrive in chicago in
different knowledge resources, we perform the K-
the morning”, “morning” can be correctly tagged
SAN model for different knowledge bases. Below
with a semantic tag B-arrive time.period of day
we compare two types of knowledge formats: de-
by K-SAN, but it is incorrectly tagged with B-
pendency tree and Abstract Meaning Representation
depart time.period of day by baselines, because
(AMR). AMR is a semantic formalism in which the
knowledge guides the model to pay correct atten-
meaning of a sentence is encoded as a rooted, di-
tion to salient substructures. The proposed model
rected, acyclic graph (Banarescu et al., 2013), where
presents the state-of-the-art performance on the
nodes represent concepts, and labeled directed edges
large dataset (RNN-BLSTM in baselines), showing
represent the relations between two concepts. The
the effectiveness of leveraging knowledge-guided
formalism is based on propositional logic and neo-
structures for learning embeddings that can be used
Davidsonian event representations (Parsons, 1990;
for specific tasks and the robustness to data scarcity
Davidson, 1967). The semantic concepts in AMR
and mismatch.
were leveraged to benefit multiple NLP tasks (Liu
et al., 2015). Unlike syntactic information from de-
5.4 Attention Analysis
pendency trees, the AMR graph contains semantic
In order to show the effectiveness of boosting per- information, which may offer more specific concep-
formance by learning correct attention from much tual relations. Figure 6 shows the comparison of a
smaller training data through the proposed model, dependency tree and an AMR graph associated with
we present the visualization of the attention for both the same example utterance and how the knowledge-
words and relations decoded by K-SAN with CNN guided substructures are constructed.
in the Figure 5. The darker color of blocks and lines Table 2 presents the performance of CRF and
indicates the higher attention for words and relations K-SAN with CNN taggers that utilize dependency
respectively. From the figure, the words and the rela- relations and AMR edges as knowledge guid-
tions with higher attention are the most crucial parts ance on the same datasets, where CRF takes the
Sentence s Sentence s
show me the flights from seattle to san francisco show me the flights from seattle to san francisco
ROOT Knowledge-Guided Substructure xi
show Knowledge-Guided Substructure xi
1.
show 1. show me
you 1. show you
1. 2. show flights the
4. 2. show flight seattle
me flights 3. show flights from seattle flight
I 3. show flight san francisco
4. show flights to francisco san city
2. 2. 4. show i
the from to Seattle
city
3. 3.
seattle francisco
San Francisco
4.
san
(a) Syntax: the dependency tree (b) Semantics: the AMR graph
Figure 6: The constructing procedure of knowledge-guided substructures, xi , on an example sentence s.
Approach Knowledge (Max #Substructure) Small Medium Large

Stanford - 59.55 78.71 90.13
Dependency Tree
SyntaxNet - 61.09 78.87 90.92
CRF
Rule-Based - 59.55 79.15 89.97
AMR Graph
JAMR - 61.12 78.64 90.25
Stanford 53 74.60 87.99 94.86
Dependency Tree
SyntaxNet 25 74.35 88.40 95.00
K-SAN (CNN)
Rule-Based 19 74.32 88.14 94.85
AMR Graph
JAMR 8 74.27 88.27 94.89
Table 2: The F1 scores of predicted slots with knowledge from different resources.
head words from either dependency trees or AMR state-of-the-art NLU tagger, showing the effective-
graphs as additional features and K-SAN incorpo- ness, generalization, and robustness of the proposed
rates knowledge-guided substructures as illustrated K-SAN model.
in Figure 6. The dependency trees are obtained from
the Stanford dependency parser or the SyntaxNet 6 Conclusion
parser2 , and AMR graphs are generated by a rule- This paper proposes a novel model, knowledge-
based AMR parser or JAMR3 . guided structural attention networks (K-SAN), that
Among four knowledge resources (different types leverages prior knowledge as guidance to incorpo-
and obtained from different parsers), all results show rate non-flat topologies and learn suitable attention
the similar performance for three sizes of datasets. for different substructures that are salient for spe-
The maximum number of substructures for the de- cific tasks. The structured information can be cap-
pendency tree is larger than the number in the AMR tured from small training data, so the model has
graph (53 and 25 v.s. 19 and 8), because syntax is better generalization and robustness. The experi-
more general and may provide richer cues for guid- ments show benefits and effectiveness of the pro-
ing more attention while semantics is more specific posed model on the language understanding task,
and may offer stronger guidance. In sum, the mod- where all knowledge-guided substructures captured
els applying four different resources achieve simi- by different resources help tagging performance,
lar performance, and all significantly outperform the and the state-of-the-art performance is achieved on
2 the ATIS benchmark dataset.
https://github.com/tensorflow/models/
tree/master/syntaxnet
3
https://github.com/jflanigan/jamr
References Acoustics, Speech, and Signal Processing, 2003. Pro-
ceedings.(ICASSP), volume 1, pages I–632. IEEE.
Laura Banarescu, Claire Bonial, Shu Cai, Madalina
Larry P Heck, Dilek Hakkani-Tür, and Gokhan Tur.
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin
2013. Leveraging knowledge graphs for web-scale
Knight, Philipp Koehn, Martha Palmer, and Nathan
unsupervised semantic parsing. In INTERSPEECH,
Schneider. 2013. Abstract meaning representation for
pages 1594–1598.
sembanking. In Proceedings of the Linguistic Annota-
tion Workshop and Interoperability with Discourse. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Asli Celikyilmaz and Dilek Hakkani-Tur. 2015. Convo- short-term memory. Neural computation, 9(8):1735–
lutional neural network based semantic tagging with 1780.
entity embeddings. In NIPS Workshop on Machine Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Learning for SLU and Interaction. Alex Acero, and Larry Heck. 2013. Learning deep
Ciprian Chelba, Monika Mahajan, and Alex Acero. structured semantic models for web search using click-
2003. Speech utterance classification. In 2003 IEEE through data. In Proceedings of the 22nd ACM inter-
International Conference on Acoustics, Speech, and national conference on Conference on information &
Signal Processing, 2003. Proceedings.(ICASSP), vol- knowledge management, pages 2333–2338. ACM.
ume 1, pages I–280. IEEE. Yoon Kim. 2014. Convolutional neural networks for sen-
Danqi Chen and Christopher D Manning. 2014. A tence classification. arXiv preprint arXiv:1408.5882.
fast and accurate dependency parser using neural net- Diederik Kingma and Jimmy Ba. 2014. Adam: A
works. In EMNLP, pages 740–750. method for stochastic optimization. arXiv preprint
Yun-Nung Chen, Dilek Hakkani-Tur, and Gokan Tur. arXiv:1412.6980.
2014. Deriving local relational surface forms from Quoc V Le and Tomas Mikolov. 2014. Distributed repre-
dependency-based entity embeddings for unsuper- sentations of sentences and documents. arXiv preprint
vised spoken language understanding. In 2014 IEEE arXiv:1405.4053.
Spoken Language Technology Workshop (SLT), pages Wang Ling, Tiago Luı́s, Luı́s Marujo, Ramón Fernan-
242–247. IEEE. dez Astudillo, Silvio Amir, Chris Dyer, Alan W
Yun-Nung Chen, William Yang Wang, Anatole Gersh- Black, and Isabel Trancoso. 2015. Finding func-
man, and Alexander I Rudnicky. 2015. Matrix fac- tion in form: Compositional character models for
torization with knowledge graph propagation for unsu- open vocabulary word representation. arXiv preprint
pervised spoken language understanding. Proceedings arXiv:1508.02096.
of ACL-IJCNLP. Jingjing Liu, Panupong Pasupat, Yining Wang, Scott
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- Cyphers, and James Glass. 2013. Query under-
danau, and Yoshua Bengio. 2014. On the proper- standing enhanced by hierarchical parsing structures.
ties of neural machine translation: Encoder-decoder In Automatic Speech Recognition and Understanding
approaches. arXiv preprint arXiv:1409.1259. (ASRU), 2013 IEEE Workshop on, pages 72–77. IEEE.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh,
and Yoshua Bengio. 2014. Empirical evaluation of and Noah A Smith. 2015. Toward abstractive sum-
gated recurrent neural networks on sequence model- marization using semantic representations. In In Pro-
ing. arXiv preprint arXiv:1412.3555. ceedings of the Conference of the North American
Donald Davidson. 1967. The logical form of action sen- Chapter of the Association for Computational Lin-
tences. guistics: Human Language Technologies, pages 1077–
Anoop Deoras and Ruhi Sarikaya. 2013. Deep belief 1086.
network based semantic taggers for spoken language Mingbo Ma, Liang Huang, Bing Xiang, and Bowen
understanding. In INTERSPEECH, pages 2713–2717. Zhou. 2015a. Dependency-based convolutional neu-
Jeffrey L Elman. 1990. Finding structure in time. Cog- ral networks for sentence embedding. In Proceed-
nitive science, 14(2):179–211. ings of the 53rd Annual Meeting of the Association
Alan Graves, Abdel-rahman Mohamed, and Geoffrey for Computational Linguistics and the 7th Interna-
Hinton. 2013. Speech recognition with deep recurrent tional Joint Conference on Natural Language Process-
neural networks. In 2013 IEEE International Con- ing, pages 174–179.
ference on Acoustics, Speech and Signal Processing Yi Ma, Paul A Crook, Ruhi Sarikaya, and Eric Fosler-
(ICASSP), pages 6645–6649. IEEE. Lussier. 2015b. Knowledge graph inference for spo-
Patrick Haffner, Gokhan Tur, and Jerry H Wright. ken dialog systems. In 2015 IEEE International Con-
2003. Optimizing svms for complex call classifi- ference on Acoustics, Speech and Signal Processing
cation. In 2003 IEEE International Conference on (ICASSP), pages 5346–5350. IEEE.
Michael F McTear. 2004. Spoken dialogue technology: Neural Information Processing Systems, pages 2431–
toward the conversational user interface. Springer 2439.
Science & Business Media. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Sequence to sequence learning with neural networks.
Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, In Advances in neural information processing systems,
Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Us- pages 3104–3112.
ing recurrent neural networks for slot filling in spoken Kai Sheng Tai, Richard Socher, and Christopher D
language understanding. IEEE/ACM Transactions on Manning. 2015. Improved semantic representa-
Audio, Speech, and Language Processing, 23(3):530– tions from tree-structured long short-term memory
539. networks. arXiv preprint arXiv:1503.00075.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Gokhan Tur and Renato De Mori. 2011. Spoken lan-
rado, and Jeff Dean. 2013. Distributed representa- guage understanding: Systems for extracting semantic
tions of words and phrases and their compositionality. information from speech. John Wiley & Sons.
In Advances in neural information processing systems, Gokhan Tur, Dilek Hakkani-Tür, and Larry Heck. 2010.
pages 3111–3119. What is left to be understood in atis? In Spoken Lan-
Terence Parsons. 1990. Events in the semantics of en- guage Technology Workshop (SLT), 2010 IEEE, pages
glish: A study in subatomic semantics. 19–24. IEEE.
Roberto Pieraccini, Evelyne Tzoukermann, Zakhar Gokhan Tur, Li Deng, Dilek Hakkani-Tür, and Xiaodong
Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui He. 2012. Towards deeper understanding: Deep
Lee, and Jay G Wilpon. 1992. A speech understand- convex networks for semantic utterance classification.
ing system based on statistical representation of se- In 2012 IEEE International Conference on Acoustics,
mantics. In 1992 IEEE International Conference on Speech and Signal Processing (ICASSP), pages 5045–
Acoustics, Speech, and Signal Processing (ICASSP), 5048. IEEE.
volume 1, pages 193–196. IEEE. Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-
Suman Ravuri and Andreas Stolcke. 2015. Recurrent ken language understanding. IEEE Signal Processing
neural network and lstm models for lexical utterance Magazine, 22(5):16–31.
classification. In Sixteenth Annual Conference of the Jason Weston, Sumit Chopra, and Antoine Bordesa.
International Speech Communication Association. 2015. Memory networks. In International Conference
Michael Roth and Mirella Lapata. 2016. Neural seman- on Learning Representations (ICLR).
tic role labeling with dependency path embeddings. Caiming Xiong, Stephen Merity, and Richard Socher.
arXiv preprint arXiv:1605.07515. 2016. Dynamic memory networks for visual
Alexander Rudnicky and Wei Xu. 1999. An agenda- and textual question answering. arXiv preprint
based dialog management architecture for spoken lan- arXiv:1603.01417.
guage systems. In IEEE Automatic Speech Recog- Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neu-
nition and Understanding Workshop, volume 13, ral network based triangular CRF for joint intent detec-
page 17. tion and slot filling. In 2013 IEEE Workshop on Auto-
Ruhi Sarikaya, Geoffrey E Hinton, and Bhuvana Ramab- matic Speech Recognition and Understanding (ASRU),
hadran. 2011. Deep belief nets for natural language pages 78–83. IEEE.
call-routing. In 2011 IEEE International Conference Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao,
on Acoustics, Speech and Signal Processing (ICASSP), and Li Deng. 2014. Embedding entities and relations
pages 5680–5683. IEEE. for learning and inference in knowledge bases. arXiv
preprint arXiv:1412.6575.
Ruhi Sarikaya, Geoffrey E Hinton, and Anoop Deoras.
2014. Application of deep belief networks for natural Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang,
language understanding. IEEE/ACM Transactions on Yangyang Shi, and Dong Yu. 2013. Recurrent neu-
Audio, Speech, and Language Processing, 22(4):778– ral networks for language understanding. In INTER-
784. SPEECH, pages 2524–2528.
Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-
Richard Socher, Andrej Karpathy, Quoc V Le, Christo-
offrey Zweig, and Yangyang Shi. 2014. Spoken
pher D Manning, and Andrew Y Ng. 2014. Grounded
language understanding using long short-term mem-
compositional semantics for finding and describing
ory neural networks. In 2014 IEEE Spoken Language
images with sentences. Transactions of the Associa-
Technology Workshop (SLT), pages 189–194. IEEE.
tion for Computational Linguistics, 2:207–218.
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
2015. End-to-end memory networks. In Advances in

Knowledge As A Teacher: Knowledge-Guided Structural Attention Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knowledge As A Teacher: Knowledge-Guided Structural Attention Networks

Uploaded by

Copyright:

Available Formats

Knowledge as a Teacher:

Knowledge-Guided Structural Attention Networks

Yun-Nung Chen? Dilek Hakkani-Tür† Gokhan Tur†

{ aslicel, ‡ jfgao, ‡ deng}@microsoft.com

Abstract tants Microsoft’s Cortana and Apple’s Siri, are be-

show me the flights from seattle to san francisco Min ∑ Mout

ht = (1 − z) h̃t + z ht−1 , (12) 5 Experiments

Approach Knowledge (Max #Substructure) Small Medium Large

You might also like