Professional Documents
Culture Documents
Sequence-level Encoding
We apply sequence-level encoding to incorporate contextual
information. Let ept and eqt be the tth embedding vectors
of the passage and the question respectively. The embed-
ding vectors are fed to a bi-directional LSTM (BiLSTM)
(Hochreiter and Schmidhuber 1997). Considering that the
outputs of the BiLSTMs are unfolded across time, we rep-
resent the outputs as P ∈ RT ×H and Q ∈ RU ×H for pas-
sage and question respectively. H is the number of hidden
units for the BiLSTMs. At every time step, the hidden unit
representation of the BiLSTMs is obtained by concatenat-
ing the hidden unit representations of the corresponding for-
Figure 1: Architecture of the proposed model. Hidden unit ward and backward LSTMs. For the passage, at time step
representations of Bi-LSTMs, B and E, are shown to illus- t, the forward and backward LSTM hidden unit representa-
trate the answer pointers. Blue and red arrows represent the tions can be written as:
start and end answer pointers respectively. →
−p −−−−→ → −p
h t = LSTM( h t−1 , ept )
←
−p ←−−−− ← −p
h t = LSTM( h t+1 , ept ) (1)
nism which learns to identify the meaningful portions of →
−p ← −p
a question. We also incorporate sequence-level represen- The tth row of P is represented as pt = h t || h t , where
tations of the first wh-word and its immediately following || represents the concatenation of two vectors. Similarly, the
→
−q ← −q
word in a question as an additional source of question type sequence level encoding for a question is qt = h t || h t ,
information. where qt is the tth row of Q.
Figure 2: Multi-factor attention weights (darker regions sig- Table 2: Results on the NewsQA dataset. † denotes the mod-
nify higher weights). els of (Weissenborn, Wiese, and Seiffe 2017).
The probability distributions for the beginning pointer b and although it is quite far in the context passage and thus effec-
the ending pointer e for a given passage P and a question Q tively infers the correct answer by deeper understanding. In
can be given as: Figure 3, it is clear that the important question word name is
Pr(b | P, Q) = softmax(sb ) getting a higher weight than the other question words. This
Pr(e | P, Q, b) = softmax(se ) (13) helps to infer the fine-grained answer type during prediction,
i.e., a person’s name in this example.
The joint probability distribution for obtaining the answer a
is given as: Experiments
Pr(a | P, Q) = Pr(b | P, Q) Pr(e | P, Q, b) (14) We evaluated AMANDA on three challenging QA datasets:
To train our model, we minimize the cross entropy loss: NewsQA, TriviaQA, and SearchQA. Using the NewsQA de-
X velopment set as a benchmark, we perform rigorous analysis
loss = − log Pr(a | P, Q) (15) for better understanding of how our proposed model works.
summing over all training instances. During prediction, we Datasets
select the locations in the passage for which the product of
Pr(b) and Pr(e) is maximum keeping the constraint 1 ≤ b ≤ The NewsQA dataset (Trischler et al. 2016) consists of
e ≤ T. around 100K answerable questions in total. Similar to
(Trischler et al. 2016; Weissenborn, Wiese, and Seiffe 2017),
we do not consider the unanswerable questions in our exper-
Visualization iments. NewsQA is more challenging compared to the pre-
To understand how the proposed model works, for the ex- viously released datasets as a significant proportion of ques-
ample given in Table 1, we visualize the normalized multi- tions requires reasoning beyond simple word- and context-
factor attention weights F̃ and the attention weights k which matching. This is due to the fact that the questions in
are used for max-attentional question aggregation. NewsQA were formulated only based on summaries without
In Figure 2, a small portion of F̃ has been shown, in accessing the main text of the articles. Moreover, NewsQA
which the answer words Robert and Park are both assigned passages are significantly longer (average length of 616
higher weights when paired with the context word Korean- words) and cover a wider range of topics.
American. Due to the use of multi-factor attention, the an- TriviaQA (Joshi et al. 2017) consists of question-answer
swer segment pays more attention to the important keyword pairs authored by trivia enthusiasts and independently gath-
Distant Supervision Verified
Model Domain Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1
‡
Random 12.72 22.91 12.74 22.35 14.81 23.31 15.41 25.44
‡
Classifier 23.42 27.68 22.45 26.52 24.91 29.43 27.23 31.37
‡ Wiki
BiDAF 40.26 45.74 40.32 45.91 47.47 53.70 44.86 50.71
?
MEMEN 43.16 46.90 - - 49.28 55.83 - -
AMANDA 46.95 52.51 46.67 52.22 52.86 58.74 50.51 55.93
‡
Classifier 24.64 29.08 24.00 28.38 27.38 31.91 30.17 34.67
‡
BiDAF Web 41.08 47.40 40.74 47.05 51.38 55.47 49.54 55.80
?
MEMEN 44.25 48.34 - - 53.27 57.64 - -
AMANDA 46.68 53.27 46.58 53.13 60.31 64.90 55.14 62.88
Table 3: Results on the TriviaQA dataset. ‡ (Joshi et al. 2017), ? (Pan et al. 2017)
Figure 4: (a) Results for different question types. (b) Results Related Work
for different predicted answer lengths.
Recently, several neural network-based models have been
Answer ambiguity (42%) proposed for QA. Models based on the idea of chunking and
Q: What happens to the power supply? ranking include Yu et al. (2016) and Lee et al. (2016). Yang
... customers possible.” The outages were due mostly et al. (2017) used a fine-grained gating mechanism to cap-
to power lines downed by Saturday’s hurricane-force ture the correlation between a passage and a question. Wang
winds, which knocked over trees and utility poles. At ... and Jiang (2017) used a Match-LSTM to encode the ques-
Context mismatch (22%) tion and passage together and a boundary model determined
Q: Who was Kandi Burrus’s fiance? the beginning and ending boundary of an answer. Trischler
Kandi Burruss, the newest cast member of the reality et al. (2016) reimplemented Match-LSTM for the NewsQA
show “The Real Housewives of Atlanta” ... fiance, who dataset and proposed a faster version of it. Xiong, Zhong,
died ... fiance, 34-year-old Ashley “A.J.” Jewell, also... and Socher (2017) used a co-attentive encoder followed by
Complex inference (10%) a dynamic decoder for iteratively estimating the boundary
Q: When did the Delta Queen first serve? pointers. Seo et al. (2017) proposed a bi-directional atten-
... the Delta Queen steamboat, a floating National ... tion flow approach to capture the interactions between pas-
scheduled voyage this week ... The Delta Queen will go sages and questions. Weissenborn, Wiese, and Seiffe (2017)
... Supporters of the boat, which has roamed the nation’s proposed a simple context matching-based neural encoder
waterways since 1927 and helped the Navy ... and incorporated word overlap and term frequency features
Paraphrasing issues (6%) to estimate the start and end pointers. Wang et al. (2017)
Q: What was Ralph Lauren’s first job? proposed a gated self-matching approach which encodes
Ralph Lauren has ... Japan. For four ... than the former the passage and question together using a self-matching at-
tie salesman from the Bronx. “Those ties ... Lauren orig- tention mechanism. Pan et al. (2017) proposed a memory
inally named his company Polo because ... network-based multi-layer embedding model and reported
results on the TriviaQA dataset.
Table 9: Examples of different error types and their percent- Different from all prior approaches, our proposed multi-
ages. Ground truth answers are bold-faced and predicted an- factor attentive encoding helps to aggregate relevant evi-
swers are underlined. dence by using a tensor-based multi-factor attention mecha-
nism. This in turn helps to infer the answer by synthesizing
information from multiple sentences. AMANDA also learns
Quantitative Error Analysis to focus on the important question words to encode the ag-
gregated question vector for predicting the answer with suit-
We analyzed the performance of AMANDA across differ- able answer type.
ent question types and different predicted answer lengths.
Figure 4(a) shows that it performs poorly on why and other
questions whose answers are usually longer. Figure 4(b) sup- Conclusion
ports this fact as well. When the predicted answer length in- In this paper, we have proposed a question-focused multi-
creases, both F1 and EM start to degrade. The gap between factor attention network (AMANDA), which learns to ag-
F1 and EM also increases for longer answers. This is be- gregate meaningful evidence from multiple sentences with
cause for longer answers, the model is not able to decide deeper understanding and to focus on the important words
the exact boundaries (low EM score) but manages to predict in a question for extracting an answer span from the passage
some correct words which partially overlap with the refer- with suitable answer type. AMANDA achieves state-of-
ence answer (relatively higher F1 score). the-art performance on NewsQA, TriviaQA, and SearchQA
datasets, outperforming all prior published models by signif-
Qualitative Error Analysis icant margins. Ablation results show the importance of the
On the NewsQA development set, AMANDA predicted proposed components.
completely wrong answers on 25.1% of the questions. We
randomly picked 50 such questions for analysis. The ob- Acknowledgement
served types of errors are given in Table 9 with examples.
42% of the errors are due to answer ambiguities, i.e., no This research was supported by research grant R-252-000-
unique answer is present. 22% of the errors are due to mis- 634-592.
References Richardson, M.; Burges, C. J. C.; and Renshaw, E. 2013.
Chen, D.; Bolton, J.; and Manning, C. D. 2016. A thorough MCTest: A challenge dataset for the open-domain machine
examination of the CNN / Daily Mail reading comprehen- comprehension of text. In Proceedings of EMNLP.
sion task. In Proceedings of ACL. Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017.
Cui, Y.; Chen, Z.; Wei, S.; Wang, S.; Liu, T.; and Hu, G. Bidirectional attention flow for machine comprehension. In
2017. Attention-over-attention neural networks for reading Proceedings of ICLR.
comprehension. In Proceedings of ACL. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: A simple way to prevent
Dhingra, B.; Liu, H.; Yang, Z.; Cohen, W.; and Salakhutdi-
neural networks from overfitting. JMLR 15(1):1929–1958.
nov, R. 2017. Gated-attention readers for text comprehen-
sion. In Proceedings of ACL. Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.;
Bachman, P.; and Suleman, K. 2016. NewsQA: A machine
Dunn, M.; Sagun, L.; Higgins, M.; Güney, V. U.; Cirik, V.;
comprehension dataset. arXiv preprint arXiv:1611.09830.
and Cho, K. 2017. SearchQA: A new Q&A dataset aug-
mented with context from a search engine. arXiv preprint Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
arXiv:1704.05179. networks. In Advances in NIPS.
Golub, D.; Huang, P.-S.; He, X.; and Deng, L. 2017. Two- Wang, S., and Jiang, J. 2017. Machine comprehension using
stage synthesis networks for transfer learning in machine Match-LSTM and answer pointer. In Proceedings of ICLR.
comprehension. arXiv preprint arXiv:1706.09789. Wang, Z.; Mi, H.; Hamza, W.; and Florian, R. 2016. Multi-
Hermann, K. M.; Kociský, T.; Grefenstette, E.; Espeholt, L.; perspective context matching for machine comprehension.
Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching arXiv preprint arXiv:1612.04211.
machines to read and comprehend. In Advances in NIPS. Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term Gated self-matching networks for reading comprehension
memory. Neural Computation 9(8):1735–1780. and question answering. In Proceedings of ACL.
Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017. Weissenborn, D.; Wiese, G.; and Seiffe, L. 2017. Making
TriviaQA: A large scale distantly supervised challenge neural QA as simple as possible but not simpler. In Proceed-
dataset for reading comprehension. In Proceedings of ACL. ings of CoNLL.
Kadlec, R.; Schmid, M.; Bajgar, O.; and Kleindienst, J. Weissenborn, D. 2017. Reading twice for natural language
2016. Text understanding with the attention sum reader net- understanding. arXiv preprint arXiv:1706.02596.
work. In Proceedings of ACL. Xiong, C.; Zhong, V.; and Socher, R. 2017. Dynamic coat-
tention networks for question answering. In Proceedings of
Kim, Y. 2014. Convolutional neural networks for sentence
ICLR.
classification. In Proceedings of EMNLP.
Yang, Z.; Dhingra, B.; Yuan, Y.; Hu, J.; Cohen, W. W.; and
Kingma, D. P., and Ba, J. L. 2015. Adam: A method for
Salakhutdinov, R. 2017. Words or characters? Fine-grained
stochastic optimization. In Proceedings of ICLR.
gating for reading comprehension. In Proceedings of ICLR.
Kobayashi, S.; Tian, R.; Okazaki, N.; and Inui, K. 2016.
Yu, Y.; Zhang, W.; Hasan, K.; Yu, M.; Xiang, B.; and
Dynamic entity representation with max-pooling improves
Zhou, B. 2016. End-to-end answer chunk extraction
machine reading. In Proceedings of NAACL.
and ranking for reading comprehension. arXiv preprint
Lee, K.; Kwiatkowski, T.; Parikh, A.; and Das, D. 2016. arXiv:1610.09996.
Learning recurrent span representations for extractive ques-
tion answering. arXiv preprint arXiv:1611.01436.
Li, Q.; Li, T.; and Chang, B. 2016. Discourse parsing with
attention-based hierarchical neural networks. In Proceed-
ings of EMNLP.
Pan, B.; Li, H.; Zhao, Z.; Cao, B.; Cai, D.; and He,
X. 2017. MEMEN: Multi-layer embedding with mem-
ory networks for machine comprehension. arXiv preprint
arXiv:1707.09098.
Pei, W.; Ge, T.; and Chang, B. 2014. Max-margin tensor
neural network for Chinese word segmentation. In Proceed-
ings of ACL.
Pennington, J.; Socher, R.; and Manning, C. D. 2014.
GloVe: Global vectors for word representation. In Proceed-
ings of EMNLP.
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
SQuAD: 100,000+ questions for machine comprehension of
text. In Proceedings of EMNLP.