Professional Documents
Culture Documents
Figure 2: The illustration of knowledge-guided structural attention networks (K-SAN) for NLU.
ROOT
Sentence s 3.2 Model Architecture
show
show me the flights from seattle to san francisco The model embeds all knowledge-guided substruc-
1.
me flights Knowledge-Guided Substructure xi tures into a continuous space and stores embeddings
1. show me
2.
the from to of all x’s in the knowledge memory. The represen-
2. show flights the
3. tation of the input utterance is then compared with
seattle francisco 3. show flights from seattle
4. 4. show flights to francisco san encoded knowledge representations to integrate the
san carried structure guided by knowledge via an atten-
tion mechanism. Then the knowledge-guided rep-
Figure 3: The knowledge-guided substructures of depen- resentation of the sentence is taken together with
dency parsing, xi , on an example sentence s. the word sequence for estimating the semantic tags.
Four main procedures are described below.
can be applied in the same way. The input utterance Encoded Knowledge Representation To store
is parsed by a dependency parser, and the substruc- the knowledge-guided structure, we convert each
tures are built according to the paths from the root to substructure (e.g. path starting from the root to the
all leaves (Chen and Manning, 2014). For example, leaf in the dependency tree), xi , into a structure vec-
the dependency parsing of the utterance “show me tor mi with dimension d by embedding the substruc-
the flights from seattle to san francisco” is shown in ture in a continuous space through the knowledge
Figure 3, where the associated substructures are ob- encoding model Mkg . The input utterance s is also
tained from the parsing tree for knowledge encod- embedded to a vector u with the same dimension
ing. Here we do not utilize the dependency relation through the model Min .
labels in the experiments for better generalization,
because the labels may not be always available for mi = Mkg (xi ), (1)
different knowledge resources. Note that the num- u = Min (s). (2)
ber of substructures may be less than the number of
words in the utterance, because non-leaf nodes do We apply the three types for knowledge encoding
not have corresponding substructure in order to re- models, Mkg and Min , in order to model multiple
duce the duplicated information in the model. The words from a substructure xi or an input sentence s
top-left component of Figure 2 illustrates the mod- into a vector representation: 1) fully-connected neu-
ule for modeling knowledge-guided substructures. ral networks (NN) with linear activation, 2) recur-
rent neural networks (RNN), and 3) convolutional 4 Recurrent Neural Network Tagger
neural networks (CNN) with a window size 3 and
4.1 Chain-Based RNN Tagger
a max-pooling operation. For example, one of sub-
structures shown in Figure 3, “show flights seattle Given ~s = w1 , ..., wT , the model is to predict ~y =
from”, is encoded into a vector embedding. In the y1 , ..., yT where the tag yi is aligned with the word
experiments, the weights of Mkg and Min are tied to- wi . We use the Elman RNN architecture, consist-
gether based on their consistent ability of sequence ing of an input layer, a hidden layer, and an output
encoding. layer (Elman, 1990). The input, hidden and output
layers consist of a set of neurons representing the in-
Knowledge Attention Distribution In the em- put, hidden, and output at each time step t, wt , ht ,
bedding space, we compute the match between the and yt , respectively.
current utterance vector u and its substructure vec-
tor mi by taking their inner product followed by a ht = φ(W wt + U ht−1 ), (7)
softmax. yˆt = softmax(V ht ), (8)
pi = softmax(uT mi ), (3)
where φ is a smooth bounded function such as tanh,
where softmax(zi ) = ezi / j ezj and pi can be
P
and yˆt is the probability distribution over of semantic
viewed as attention distribution for modeling impor-
tags given the current hidden state ht . The sequence
tant substructures from external knowledge in order
probability can be formulated as
to understand the current utterance.
Y
Sentence Representation In order to encode the p(~y | ~s) = p(~y | w1 , ..., wT ) = p(yi | w1 , ..., wi ).
knowledge-guided structure, a vector h is a sum over i
(9)
the encoded knowledge embeddings weighted by the
The model can be trained using backpropagation to
attention distribution.
maximize the conditional likelihood of the training
set labels.
X
h= pi mi , (4)
i To overcome the frequent vanishing gradients is-
sue when modeling long-term dependencies, gated
which indicates that the sentence pays different at- RNN was designed to use a more sophisticated ac-
tention to different substructures guided from exter- tivation function than a usual activation function,
nal knowledge. Because the function from input to consisting of affine transformation followed by a
output is smooth, we can easily compute gradients simple element-wise nonlinearity by using gating
and back propagate through it. Then the sum of units (Chung et al., 2014), such as long short-
the substructure vector h and the current input em- term memory (LSTM) and gated recurrent unit
bedding u are then passed through a neural network (GRU) (Hochreiter and Schmidhuber, 1997; Cho et
model Mout to generate an output knowledge-guided al., 2014). RNNs employing either of these recur-
representation o. rent units have been shown to perform well in tasks
that require capturing long-term dependencies (Mes-
o = Mout (h + u), (5)
nil et al., 2015; Yao et al., 2014; Graves et al.,
where we employ a fully-connected dense network 2013; Sutskever et al., 2014). In this paper, we use
for Mout . RNN with GRU cells to allow each recurrent unit
to adaptively capture dependencies of different time
Sequence Tagging To estimate the tag sequence ~y scales (Cho et al., 2014; Chung et al., 2014), because
corresponding to an input word sequence ~s, we use RNN-GRU can yield comparable performance as
an RNN module for training a slot tagger, where the RNN-LSTM with need of fewer parameters and less
knowledge-guided representation o is fed into the in- data for generalization (Chung et al., 2014)
put of the model in order to incorporate the structure A GRU has two gates, a reset gate r, and an up-
information. date gate z (Cho et al., 2014; Chung et al., 2014).
~y = RNN(o, ~s) (6) The reset gate determines the combination between
input sentence s wt-1 wt wt+1 4.3 Joint RNN Tagger
1
Chain-Based U1 U2 U1 U2 U U2
RNN Tagger W1 W1 W1
Because the chain-based tagger and the knowledge-
W1
h1t-1 h1t h1t+1
guided tagger carry different information, the joint
RNN tagger is proposed to balance the information
Knowledge-Guided M M M
RNN Tagger between two model architectures. Figure 4 presents
o h2t-1 h2t h2t+1 the architecture of the joint RNN tagger.
W2 W2 W2 W2
y
V
V
V
V
V
V h1t = φ(W 1 wt + U 1 ht−1 ), (15)
slot tagging sequence yt-1 yt yt+1
h2t 2
= φ(M o + W wt + U ht−1 ), 2
(16)
Figure 4: The joint tagging model that incorporates a ŷt = softmax(V (αh1t + (1 − α)h2t )), (17)
chain-based RNN tagger (upper block) and a knowledge-
guided RNN tagger (lower block). where α is the weight for balancing chain-based and
knowledge-guided information. By jointly consid-
the new input and the previous memory, and the up- ering chain-based information (h1t ) and knowledge-
date gate decides how much the unit updates its ac- guided information (h2t ), the joint RNN tagger is ex-
tivation, or content. pected to achieve better generalization, and the per-
formance may be less sensitive to poor structures
r = σ(W r wt + U r ht−1 ), (10) from external knowledge. In the experiments, α is
z = σ(W z wt + U z ht−1 ), (11) set to 0.5 for balancing two sides. The objective
of the proposed model is to maximize the sequence
where σ is a logistic sigmoid function. probability p(~y | ~s) in (9), and the model can be
Then the final activation of the GRU at time t, ht , trained in an end-to-end manner, where the error
is a linear interpolation between the previous activa- would be back-propagated through the whole archi-
tion ht−1 and the candidate activation h̃t : tecture.
Table 1: The F1 scores of predicted slots on the different size of ATIS training examples, where K-SAN utilizes the
dependency relations parsed from the Stanford parser. Small: 1/40 set; Medium: 1/10 set; Large: original set. (†
indicates that the performance is significantly better than all baseline models with p < 0.05 in the t-test.)
cross-entropy, and the optimizer we use is adam with dow) and syntactic (dependent head in the
the default setting (Kingma and Ba, 2014), where parsing tree) features.
the learning rate λ = 0.001, β1 = 0.9, β2 = 0.999, – DCNN (Ma et al., 2015a): predicts slots
and = 1e−08 . The maximum iteration for training by incorporating sentence embeddings
our K-SAN models is set as 300. The dimensional- learned by a convolutional model with
ity of input word embeddings is 100, and the hidden consideration of dependency tree struc-
layer sizes are in {50, 100, 150}. The dropout rates tures.
are set as {0.25, 0.50}. All reported results are from – Tree-RNN (Tai et al., 2015): predicts slots
the joint RNN tagger, and the hyperparameters are with sentence embeddings learned by an
tuned in the dev set for all experiments. RNN model based on the tree structures
of sentences.
5.2 Baseline
5.3 Slot Filling Results
To validate the effectiveness of the proposed model,
we compare the performance with the following Table 1 shows the performance of slot filling on dif-
baselines. ferent size of training data, where there are three
datasets (Small, Medium, and Large use 1/40, 1/10,
• Baseline:
and whole training data). For baselines (models
– CRF Tagger (Tur et al., 2010): predicts a without knowledge features), CNN Encoder-Tagger
semantic slot for each word with a context achieves the best performance on all datasets.
window (size = 5). Among structural models (models with knowl-
– RNN Tagger (Mesnil et al., 2015): pre- edge encoding), Tree-RNN Encoder-Tagger per-
dicts a semantic slot for each word. forms better for Small data but slightly worse than
– CNN Encoder-Tagger (Kim, 2014): tag
the DCNN Encoder-Tagger.
semantic slots with consideration of sen-
CNN (Kim, 2014) performs better compared to
tence embeddings learned by a convolu-
DCNN (Ma et al., 2015a) and Tree-RNN (Tai et
tional model.
al., 2015), even though CNN does not leverage ex-
• Structural: The NLU models utilize linguis- ternal knowledge when encoding sentences. When
tic information when tagging slots, where comparing the NLU performance between baselines
DCNN and Tree-RNN are the state-of-the-art and other state-of-the-art structural models, there
approaches for embedding sentences with lin- is no significant difference. This suggests that en-
guistic structures. coding sentence information without distinguishing
– CRF Tagger (Tur et al., 2010): predicts substructure may not capture salient semantics in or-
slots based on the lexical (5-word win- der to improve understanding performance.
Small find nonstop flights from salt lake city to new york on saturday april ninth
Medium find nonstop flights from salt lake city to new york on saturday april ninth
Large find nonstop flights from salt lake city to new york on saturday april ninth
flight_stop fromloc.city_name toloc.city_name depart_date. depart_date. depart_date.
day_name month_name day_number
Figure 5: The visualization of the decoded knowledge-guided structural attention for both relations and words learned
from different size of training data. Relations and words with darker color indicate higher attention weights generated
by the proposed K-SAN with CNN. The slot tags are shown in the figure for reference. Note that the dependency
relations are incorrectly parsed by the Stanford parser in this example, but our model is still able to benefit from the
structural information.
Among the proposed K-SAN models, CNN for for predicting correct slots, e.g. origin, destination,
encoding performs best on Small (75% on F1) and time. Furthermore, the difference of attention
and Medium (88% on F1), and RNN for en- distribution between three datasets is not significant;
coding performs best on the Large set (95% on this suggests that our proposed model is able to pay
F1). Also, most of the proposed models outper- correct attention to important substructures guided
form all baselines, where the improvement for the by the external knowledge even the training data is
small dataset is more significant. This suggests scarce.
that the proposed models carry better generaliza-
tion and are less sensitive to unseen data. For ex- 5.5 Knowledge Generalization
ample, given an utterance “which flights leave on
In order to show the capacity of generalization to
monday from montreal and arrive in chicago in
different knowledge resources, we perform the K-
the morning”, “morning” can be correctly tagged
SAN model for different knowledge bases. Below
with a semantic tag B-arrive time.period of day
we compare two types of knowledge formats: de-
by K-SAN, but it is incorrectly tagged with B-
pendency tree and Abstract Meaning Representation
depart time.period of day by baselines, because
(AMR). AMR is a semantic formalism in which the
knowledge guides the model to pay correct atten-
meaning of a sentence is encoded as a rooted, di-
tion to salient substructures. The proposed model
rected, acyclic graph (Banarescu et al., 2013), where
presents the state-of-the-art performance on the
nodes represent concepts, and labeled directed edges
large dataset (RNN-BLSTM in baselines), showing
represent the relations between two concepts. The
the effectiveness of leveraging knowledge-guided
formalism is based on propositional logic and neo-
structures for learning embeddings that can be used
Davidsonian event representations (Parsons, 1990;
for specific tasks and the robustness to data scarcity
Davidson, 1967). The semantic concepts in AMR
and mismatch.
were leveraged to benefit multiple NLP tasks (Liu
et al., 2015). Unlike syntactic information from de-
5.4 Attention Analysis
pendency trees, the AMR graph contains semantic
In order to show the effectiveness of boosting per- information, which may offer more specific concep-
formance by learning correct attention from much tual relations. Figure 6 shows the comparison of a
smaller training data through the proposed model, dependency tree and an AMR graph associated with
we present the visualization of the attention for both the same example utterance and how the knowledge-
words and relations decoded by K-SAN with CNN guided substructures are constructed.
in the Figure 5. The darker color of blocks and lines Table 2 presents the performance of CRF and
indicates the higher attention for words and relations K-SAN with CNN taggers that utilize dependency
respectively. From the figure, the words and the rela- relations and AMR edges as knowledge guid-
tions with higher attention are the most crucial parts ance on the same datasets, where CRF takes the
Sentence s Sentence s
show me the flights from seattle to san francisco show me the flights from seattle to san francisco
ROOT Knowledge-Guided Substructure xi
show Knowledge-Guided Substructure xi
1.
show 1. show me
you 1. show you
1. 2. show flights the
4. 2. show flight seattle
me flights 3. show flights from seattle flight
I 3. show flight san francisco
4. show flights to francisco san city
2. 2. 4. show i
the from to Seattle
city
3. 3.
seattle francisco
San Francisco
4.
san
(a) Syntax: the dependency tree (b) Semantics: the AMR graph
Figure 6: The constructing procedure of knowledge-guided substructures, xi , on an example sentence s.
head words from either dependency trees or AMR state-of-the-art NLU tagger, showing the effective-
graphs as additional features and K-SAN incorpo- ness, generalization, and robustness of the proposed
rates knowledge-guided substructures as illustrated K-SAN model.
in Figure 6. The dependency trees are obtained from
the Stanford dependency parser or the SyntaxNet 6 Conclusion
parser2 , and AMR graphs are generated by a rule- This paper proposes a novel model, knowledge-
based AMR parser or JAMR3 . guided structural attention networks (K-SAN), that
Among four knowledge resources (different types leverages prior knowledge as guidance to incorpo-
and obtained from different parsers), all results show rate non-flat topologies and learn suitable attention
the similar performance for three sizes of datasets. for different substructures that are salient for spe-
The maximum number of substructures for the de- cific tasks. The structured information can be cap-
pendency tree is larger than the number in the AMR tured from small training data, so the model has
graph (53 and 25 v.s. 19 and 8), because syntax is better generalization and robustness. The experi-
more general and may provide richer cues for guid- ments show benefits and effectiveness of the pro-
ing more attention while semantics is more specific posed model on the language understanding task,
and may offer stronger guidance. In sum, the mod- where all knowledge-guided substructures captured
els applying four different resources achieve simi- by different resources help tagging performance,
lar performance, and all significantly outperform the and the state-of-the-art performance is achieved on
2 the ATIS benchmark dataset.
https://github.com/tensorflow/models/
tree/master/syntaxnet
3
https://github.com/jflanigan/jamr
References Acoustics, Speech, and Signal Processing, 2003. Pro-
ceedings.(ICASSP), volume 1, pages I–632. IEEE.
Laura Banarescu, Claire Bonial, Shu Cai, Madalina
Larry P Heck, Dilek Hakkani-Tür, and Gokhan Tur.
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin
2013. Leveraging knowledge graphs for web-scale
Knight, Philipp Koehn, Martha Palmer, and Nathan
unsupervised semantic parsing. In INTERSPEECH,
Schneider. 2013. Abstract meaning representation for
pages 1594–1598.
sembanking. In Proceedings of the Linguistic Annota-
tion Workshop and Interoperability with Discourse. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Asli Celikyilmaz and Dilek Hakkani-Tur. 2015. Convo- short-term memory. Neural computation, 9(8):1735–
lutional neural network based semantic tagging with 1780.
entity embeddings. In NIPS Workshop on Machine Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Learning for SLU and Interaction. Alex Acero, and Larry Heck. 2013. Learning deep
Ciprian Chelba, Monika Mahajan, and Alex Acero. structured semantic models for web search using click-
2003. Speech utterance classification. In 2003 IEEE through data. In Proceedings of the 22nd ACM inter-
International Conference on Acoustics, Speech, and national conference on Conference on information &
Signal Processing, 2003. Proceedings.(ICASSP), vol- knowledge management, pages 2333–2338. ACM.
ume 1, pages I–280. IEEE. Yoon Kim. 2014. Convolutional neural networks for sen-
Danqi Chen and Christopher D Manning. 2014. A tence classification. arXiv preprint arXiv:1408.5882.
fast and accurate dependency parser using neural net- Diederik Kingma and Jimmy Ba. 2014. Adam: A
works. In EMNLP, pages 740–750. method for stochastic optimization. arXiv preprint
Yun-Nung Chen, Dilek Hakkani-Tur, and Gokan Tur. arXiv:1412.6980.
2014. Deriving local relational surface forms from Quoc V Le and Tomas Mikolov. 2014. Distributed repre-
dependency-based entity embeddings for unsuper- sentations of sentences and documents. arXiv preprint
vised spoken language understanding. In 2014 IEEE arXiv:1405.4053.
Spoken Language Technology Workshop (SLT), pages Wang Ling, Tiago Luı́s, Luı́s Marujo, Ramón Fernan-
242–247. IEEE. dez Astudillo, Silvio Amir, Chris Dyer, Alan W
Yun-Nung Chen, William Yang Wang, Anatole Gersh- Black, and Isabel Trancoso. 2015. Finding func-
man, and Alexander I Rudnicky. 2015. Matrix fac- tion in form: Compositional character models for
torization with knowledge graph propagation for unsu- open vocabulary word representation. arXiv preprint
pervised spoken language understanding. Proceedings arXiv:1508.02096.
of ACL-IJCNLP. Jingjing Liu, Panupong Pasupat, Yining Wang, Scott
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- Cyphers, and James Glass. 2013. Query under-
danau, and Yoshua Bengio. 2014. On the proper- standing enhanced by hierarchical parsing structures.
ties of neural machine translation: Encoder-decoder In Automatic Speech Recognition and Understanding
approaches. arXiv preprint arXiv:1409.1259. (ASRU), 2013 IEEE Workshop on, pages 72–77. IEEE.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh,
and Yoshua Bengio. 2014. Empirical evaluation of and Noah A Smith. 2015. Toward abstractive sum-
gated recurrent neural networks on sequence model- marization using semantic representations. In In Pro-
ing. arXiv preprint arXiv:1412.3555. ceedings of the Conference of the North American
Donald Davidson. 1967. The logical form of action sen- Chapter of the Association for Computational Lin-
tences. guistics: Human Language Technologies, pages 1077–
Anoop Deoras and Ruhi Sarikaya. 2013. Deep belief 1086.
network based semantic taggers for spoken language Mingbo Ma, Liang Huang, Bing Xiang, and Bowen
understanding. In INTERSPEECH, pages 2713–2717. Zhou. 2015a. Dependency-based convolutional neu-
Jeffrey L Elman. 1990. Finding structure in time. Cog- ral networks for sentence embedding. In Proceed-
nitive science, 14(2):179–211. ings of the 53rd Annual Meeting of the Association
Alan Graves, Abdel-rahman Mohamed, and Geoffrey for Computational Linguistics and the 7th Interna-
Hinton. 2013. Speech recognition with deep recurrent tional Joint Conference on Natural Language Process-
neural networks. In 2013 IEEE International Con- ing, pages 174–179.
ference on Acoustics, Speech and Signal Processing Yi Ma, Paul A Crook, Ruhi Sarikaya, and Eric Fosler-
(ICASSP), pages 6645–6649. IEEE. Lussier. 2015b. Knowledge graph inference for spo-
Patrick Haffner, Gokhan Tur, and Jerry H Wright. ken dialog systems. In 2015 IEEE International Con-
2003. Optimizing svms for complex call classifi- ference on Acoustics, Speech and Signal Processing
cation. In 2003 IEEE International Conference on (ICASSP), pages 5346–5350. IEEE.
Michael F McTear. 2004. Spoken dialogue technology: Neural Information Processing Systems, pages 2431–
toward the conversational user interface. Springer 2439.
Science & Business Media. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Sequence to sequence learning with neural networks.
Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, In Advances in neural information processing systems,
Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Us- pages 3104–3112.
ing recurrent neural networks for slot filling in spoken Kai Sheng Tai, Richard Socher, and Christopher D
language understanding. IEEE/ACM Transactions on Manning. 2015. Improved semantic representa-
Audio, Speech, and Language Processing, 23(3):530– tions from tree-structured long short-term memory
539. networks. arXiv preprint arXiv:1503.00075.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Gokhan Tur and Renato De Mori. 2011. Spoken lan-
rado, and Jeff Dean. 2013. Distributed representa- guage understanding: Systems for extracting semantic
tions of words and phrases and their compositionality. information from speech. John Wiley & Sons.
In Advances in neural information processing systems, Gokhan Tur, Dilek Hakkani-Tür, and Larry Heck. 2010.
pages 3111–3119. What is left to be understood in atis? In Spoken Lan-
Terence Parsons. 1990. Events in the semantics of en- guage Technology Workshop (SLT), 2010 IEEE, pages
glish: A study in subatomic semantics. 19–24. IEEE.
Roberto Pieraccini, Evelyne Tzoukermann, Zakhar Gokhan Tur, Li Deng, Dilek Hakkani-Tür, and Xiaodong
Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui He. 2012. Towards deeper understanding: Deep
Lee, and Jay G Wilpon. 1992. A speech understand- convex networks for semantic utterance classification.
ing system based on statistical representation of se- In 2012 IEEE International Conference on Acoustics,
mantics. In 1992 IEEE International Conference on Speech and Signal Processing (ICASSP), pages 5045–
Acoustics, Speech, and Signal Processing (ICASSP), 5048. IEEE.
volume 1, pages 193–196. IEEE. Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-
Suman Ravuri and Andreas Stolcke. 2015. Recurrent ken language understanding. IEEE Signal Processing
neural network and lstm models for lexical utterance Magazine, 22(5):16–31.
classification. In Sixteenth Annual Conference of the Jason Weston, Sumit Chopra, and Antoine Bordesa.
International Speech Communication Association. 2015. Memory networks. In International Conference
Michael Roth and Mirella Lapata. 2016. Neural seman- on Learning Representations (ICLR).
tic role labeling with dependency path embeddings. Caiming Xiong, Stephen Merity, and Richard Socher.
arXiv preprint arXiv:1605.07515. 2016. Dynamic memory networks for visual
Alexander Rudnicky and Wei Xu. 1999. An agenda- and textual question answering. arXiv preprint
based dialog management architecture for spoken lan- arXiv:1603.01417.
guage systems. In IEEE Automatic Speech Recog- Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neu-
nition and Understanding Workshop, volume 13, ral network based triangular CRF for joint intent detec-
page 17. tion and slot filling. In 2013 IEEE Workshop on Auto-
Ruhi Sarikaya, Geoffrey E Hinton, and Bhuvana Ramab- matic Speech Recognition and Understanding (ASRU),
hadran. 2011. Deep belief nets for natural language pages 78–83. IEEE.
call-routing. In 2011 IEEE International Conference Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao,
on Acoustics, Speech and Signal Processing (ICASSP), and Li Deng. 2014. Embedding entities and relations
pages 5680–5683. IEEE. for learning and inference in knowledge bases. arXiv
preprint arXiv:1412.6575.
Ruhi Sarikaya, Geoffrey E Hinton, and Anoop Deoras.
2014. Application of deep belief networks for natural Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang,
language understanding. IEEE/ACM Transactions on Yangyang Shi, and Dong Yu. 2013. Recurrent neu-
Audio, Speech, and Language Processing, 22(4):778– ral networks for language understanding. In INTER-
784. SPEECH, pages 2524–2528.
Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-
Richard Socher, Andrej Karpathy, Quoc V Le, Christo-
offrey Zweig, and Yangyang Shi. 2014. Spoken
pher D Manning, and Andrew Y Ng. 2014. Grounded
language understanding using long short-term mem-
compositional semantics for finding and describing
ory neural networks. In 2014 IEEE Spoken Language
images with sentences. Transactions of the Associa-
Technology Workshop (SLT), pages 189–194. IEEE.
tion for Computational Linguistics, 2:207–218.
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
2015. End-to-end memory networks. In Advances in