You are on page 1of 10

Read + Verify: Machine Reading Comprehension

with Unanswerable Questions

Minghao Hu†∗, Furu Wei§ , Yuxing Peng† , Zhen Huang† , Nan Yang§ , Ming Zhou§

College of Computer, National University of Defense Technology
§
Microsoft Research Asia
{huminghao09,pengyuxing,huangzhen}@nudt.edu.cn
{fuwei,nanya,mingzhou}@microsoft.com

Abstract Question: What is France a region of?

Machine reading comprehension with unan- Passage: The Normans were the people who in
swerable questions aims to abstain from an- the 10th and 11th centuries gave their name to
arXiv:1808.05759v4 [cs.CL] 5 Sep 2018

Normandy, a region in France. They were ...


swering when no answer can be inferred. In
addition to extract answers, previous works
Read
usually predict an additional “no-answer”
probability to detect unanswerable cases.
Answer: Normandy NA Prob: 0.4
However, they fail to validate the answerabil-
ity of the question by verifying the legitimacy Sentence: The Normans … in France.
of the predicted answer. To address this prob-
lem, we propose a novel read-then-verify sys- Verify
tem, which not only utilizes a neural reader
to extract candidate answers and produce no- NA Prob: 0.9 Final Prob: 0.65
answer probabilities, but also leverages an an-
swer verifier to decide whether the predicted No answer
answer is entailed by the input snippets. More-
over, we introduce two auxiliary losses to help
the reader better handle answer extraction as Figure 1: An overview of our approach. The reader first
well as no-answer detection, and investigate extracts a candidate answer and produce a no-answer
three different architectures for the answer ver- probability (NA Prob). The answer verifier then checks
ifier. Our experiments on the SQuAD 2.0 whether the extracted answer is legitimate or not. Fi-
dataset show that our system achieves a score nally, the system aggregates previous results and out-
of 74.2 F1 on the test set, outperforming all puts the final prediction.
previous approaches at the time of submission
(Aug. 23th, 2018).
first place. Recently, a new version of Stanford
1 Introduction Question Answering Dataset (SQuAD), namely
SQuAD 2.0 (Rajpurkar et al., 2018), has been pro-
The ability to comprehend text and answer ques- posed to test the ability of answering answerable
tions is crucial for natural language process- questions as well as detecting unanswerable cases.
ing. Due to the creation of various large-scale To deal with unanswerable cases, systems must
datasets (Hermann et al., 2015; Nguyen et al., learn to identify a wide range of linguistic phe-
2016; Joshi et al., 2017; Kočiskỳ et al., 2018), nomena such as negation, antonymy and entity
remarkable advancements have been made in the changes between the passage and the question.
task of machine reading comprehension.
Previous works (Levy et al., 2017; Clark and
Nevertheless, one important hypothesis behind
Gardner, 2018) all apply a shared-normalization
current approaches is that there always exists a
operation between a “no-answer” score and an-
correct answer in the context passage. There-
swer span scores, so as to produce a probability
fore, the models only need to choose a most
that a question is unanswerable and output a can-
plausible text span based on the question, in-
didate answer simultaneously. However, they have
stead of checking if there exists an answer in the
not considered further validating the answerability

Preprint. Work in progress. of the question by verifying the legitimacy of the
predicted answer. Here, answerability denotes if while the second one attempts to model inferences
the question has an answer, and legitimacy means between two sentences interactively. The last one
if the extracted text is supported by the passage is a hybrid model that combines the above two
and the question. Human, on the contrary, tends models to test if the performance can be further
to first find a plausible answer given a question, improved.
and then checks if there exists any contradictory Lastly, we evaluate our system on the SQuAD
semantics. 2.0 dataset (Rajpurkar et al., 2018), a reading com-
To address the above issue, we propose a novel prehension benchmark augmented with unanswer-
read-then-verify system that aims to be robust to able questions. Our best reader achieves a F1
unanswerable questions in this paper. As shown in score of 73.7 and 69.1 on the development set,
Figure 1, our system consists of two components: with or without ELMo embeddings (Peters et al.,
(1) a no-answer reader for extracting candidate 2018). When combined with the answer verifier,
answers and detecting unanswerable questions, the whole system improves to 74.8 F1 and 72.3 F1
and (2) an answer verifier for deciding whether or respectively. Moreover, the best system achieves
not the extracted candidate is legitimate. The key a score of 74.2 F1 on the test set, outperforming
contributions of our work are three-fold. all previous approaches at the time of submission
(Aug. 23th, 2018).
First, we augment existing readers with two
auxiliary losses, to better handle answer extraction 2 Background
and no-answer detection respectively. Since the
downstream verifying stage always requires a can- Existing reading comprehension models focus on
didate answer, the reader must be able to extract answering questions where a correct answer is
plausible answers for all questions. However, pre- guaranteed to exist. However, they are not able
vious approaches are not trained to find potential to identify unanswerable questions but tend to re-
candidates for unanswerable questions. We solve turn an unreliable text span. Consequently, we first
this problem by introducing an independent span give a brief introduction on the unanswerable read-
loss that aims to concentrate on the answer ex- ing comprehension task, and then investigate cur-
traction task regardless of the answerability of the rent solutions.
question. In order to not conflict with no-answer
2.1 Task Description
detection, we leverage a multi-head pointer net-
work to generate two pairs of span scores, where Given a context passage and a question, the ma-
one pair is normalized with the no-answer score chine needs to not only find answers to answer-
and the other is used for our auxiliary loss. Be- able questions but also detect unanswerable cases.
sides, we present another independent no-answer The passage and the question are described as se-
lp
loss to further alleviate the confliction, by focusing quences of word tokens, denoted as P = {xpi }i=1
lq
on the no-answer detection task without consider- and Q = {xqj }j=1 respectively, where lp is the pas-
ing the shared normalization of answer extraction. sage length and lq is the question length. Our goal
Second, in addition to the standard reading is to predict an answer A, which is constrained as
phase, we introduce an additional answer verify- a segment of text in the passage: A = {xpi }li=l
b
a
, or
ing phase, which aims at finding local entailment return an empty string if there is no answer, where
that supports the answer by comparing the answer la and lb indicate the answer boundary.
sentence with the question. This is based on the
2.2 No-Answer Reader
observation that the core phenomenon of unan-
swerable questions usually occurs between a few To predict an answer span, current approaches first
passage words and question words. Take Figure embed and encode both of passage and question
1 for example, after comparing the passage snip- into two series of fix-sized vectors. Then they
pet “Normandy, a region in France” with the ques- leverage various attention mechanisms, such as
tion, we can easily determine that no answer exists bi-attention (Seo et al., 2017) or reattention (Hu
since the question asks for an impossible condi- et al., 2018a), to build interdependent represen-
tion. We investigate three architectures for the an- tations for passage and question, which are de-
lp lq
swer verifying model. The first one is a sequential noted as U = {ui }i=1 and V = {vj }j=1 respec-
model that takes two sentences as a long sequence, tively. Finally, they summarize the question rep-
resentation into a dense vector t, and utilize the shared normalization between span scores and no-
pointer network (Vinyals et al., 2015) to produce answer score. Since the sum of these normalized
two scores over passage words that indicate the an- scores is always 1, an over-confident span proba-
swer boundary (Wang et al., 2017): bility would cause an unconfident no-answer prob-
ability, and vice versa. Therefore, inaccurate con-
lq
X e oj fidences on answer extraction, which have been
oj = wvT vj , t= Plq vj observed by Clark and Gardner (2018), could lead
ok
j=1 k=1 e
to imprecise predictions on no-answer detection.
α, β = pointer network(U, t) To address the above issues, we propose two aux-
iliary losses to optimize and enhance each task in-
where α and β are the span scores for answer start dependently without interfering with each other.
and end bounds.
Independent Span Loss: This loss is designed to
In order to additionally detect if the question is
concentrate on answer extraction. In this task, the
unanswerable, previous approaches (Levy et al.,
model is asked to extract a candidate answer for all
2017; Clark and Gardner, 2018) attempt to predict
possible questions. Therefore, besides answerable
a special no-answer score z in addition to the dis-
questions, we also include unanswerable cases as
tribution over answer spans. Concretely, a shared
positive examples, and consider the plausible an-
softmax function can be applied to normalize both
swer as gold answer 1 . In order to not conflict with
of no-answer score and span scores, yielding a
no-answer detection, we propose to use a multi-
joint no-answer objective defined as:
head pointer network to additionally produce an-
! other pair of span scores α̃ and β̃:
(1 − δ)ez + δeαa βb
Ljoint = − log Plp Plp α β
ez + i=1 j=1 e
i j lq
X eõj
õj = w̃vT vj , t̃ = Plq vj
õk
where a and b are the ground-truth start and end j=1 k=1 e
positions, and δ is 1 if the question is answerable α̃, β̃ = pointer network(U, t̃)
and 0 otherwise. At test time, a question is de-
tected as being unanswerable once the normalized where multiple heads share the same network ar-
no-answer score exceeds some threshold. chitecture but with different parameters.
Then, we define an independent span loss as:
3 Approach
!
In this section we describe our proposed read- eα̃ã β̃b̃
Lindep−I = − log Plp Plp
then-verify system. The system first leverages a α̃i β̃j
i=1 j=1 e
neural reader to extract a candidate answer and de-
tect if the question is unanswerable. It then uti- where ã and b̃ are the augmented ground-truth an-
lizes an answer verifier to further check the legit- swer boundaries. The final span probability is ob-
imacy of the predicted answer. We enhance the tained using a simple mean pooling over the two
reader with two novel auxiliary losses, and inves- pairs of softmax-normalized span scores.
tigate three different architectures for the answer Independent No-Answer Loss: Despite a multi-
verifier. head pointer network being used to prevent the
3.1 Reader with Auxiliary Losses confliction problem, no-answer detection can still
be weakened since the no-answer score z is nor-
Although previous no-answer readers are capa- malized with span scores. Therefore, we con-
ble of jointly learning answer extraction and no- sider exclusively encouraging the prediction on
answer detection, there exists two problems for no-answer detection. This is achieved by intro-
each individual task. For the answer extraction, ducing an independent no-answer loss as:
previous readers are not trained to find candidate
answers for unanswerable questions. In our sys- Lindep−II = −(1 − δ) log σ(z) − δ log(1 − σ(z))
tem, however, the reader is required to extract a
1
plausible answer that is fed to the downstream ver- In SQuAD 2.0, the plausible answer is annotated by hu-
man for every unanswerable question. A pre-trained reader
ifying stage for all questions. As for no-answer de- can also be used to extract plausible answers if no annotation
tection, a confliction could be triggered due to the is provided.
Answer
y y
Sentence
Question Answer Model-I y
Mean-max Pooling

Answer Add & Norm s q


Sentence BiLSTM BiLSTM

Question Model-II y Feed Forward Intra-Sent Intra-Sent


Modeling Modeling
12x
Answer
s q
Add & Norm

Answer Inference
Sentence
Question Answer Model-I Masked Multi
Modeling
Self Attention

Merge y s q
BiLSTM BiLSTM
Model-II Text & Position Embed Text Embed Text Embed
(a) (b) (c)

Figure 2: An overview of answer verifiers. (a) Input structures for running three different models. (b) Finetuned
Transformer model proposed by Radford et al. (2018). (c) Our proposed interactive model.

where σ is the sigmoid activation function. the task. The model is a multi-layer Transformer
Through this loss, we expect the model to produce decoder (Liu et al., 2018a), which is first trained
a more confident prediction on no-answer score z with a language modeling objective on a large un-
without considering the shared-normalization op- labeled text corpus and then finetuned on the spe-
eration. cific target task.
Finally, we combine the above losses as fol- Specifically, given an answer sentence S, a
lows: question Q and an extracted answer A, we
concatenate the two sentences with the answer
L = Ljoint + γLindep−I + λLindep−II while adding a delimiter token in between to get
[S; Q; $; A]. The sequence, which can also be de-
where γ and λ are two hyper-parameters that con- noted as a series of tokens X = {xi }li=1
m
, is then
trol the weight of two auxiliary losses. encoded by a multi-head self-attention operation
followed by position-wise feed-forward layers as
3.2 Answer Verifier
follows:
After the answer is extracted, an answer verifier
is used to compare the answer sentence with the h0 = We [X] + Wp
question, so as to recognize local textual entail- hi = transformer block(hi−1 ), ∀i ∈ [1, n]
ment that supports the answer. Here, we define the
answer sentence as the context sentence that con- where X denotes the sequence’s indexes in the vo-
tains either gold answers or plausible answers. We cab, We is the token embedding matrix, Wp is the
explore three different architectures for the veri- position embedding matrix, and n is the number
fying task (shown in Figure 2): (1) a sequential of layers.
model that takes the inputs as a long sequence, (2) The last token’s activation hlnm is then fed into
an interactive model that encodes two sentences a linear projection layer followed by a softmax
interdependently, and (3) a hybrid model that takes function to output the no-answer probability y:
both of the two approaches into account.
p(y|X) = softmax(hlnm Wy )
3.2.1 Model-I: Sequential Architecture
In Model-I, we convert the answer sentence A standard cross-entropy objective is used to
and the question along with the extracted an- minimize the negative log-likelihood:
swer into an ordered input sequence. Then X
we adapt the recently proposed Finetuned Trans- L(θ) = − log p(y|X)
former model (Radford et al., 2018) to perform (X,y)
3.2.2 Model-II: Interactive Architecture Intra-Sentence Modeling: Next we apply an
Since the answer verifier needs to model infer- intra-sentence modeling layer to capture self cor-
ences between two sentences, therefore we also relations inside each sentence. The input is first
consider an interaction-based approach that has passed through another BiLSTM layer for encod-
the following layers: ing. We then use the same attention mechanism
Encoding: We embed words using the GloVe em- described above, only now between the sentence
bedding (Pennington et al., 2014), and also em- and itself, and we set aij = −inf if i = j to
bed characters of each word with trainable vectors. ensure that the word is not aligned with itself. An-
We run a bidirectional LSTM (BiLSTM) (Hochre- other fusion function is used to produce ŝi and q̂j
iter and Schmidhuber, 1997) to encode the charac- respectively.
ters and concatenate two last hidden states to get Prediction: Before the final prediction, we apply
character-level embeddings. In addition, we use a a concatenated residual connection and model the
binary feature to indicate if a word is part of the sentences with a BiLSTM as:
answer. All embeddings along with the feature are s̄i = BiLSTM([s̃i ; ŝi ]) , q̄j = BiLSTM([q̃j ; q̂j ])
then concatenated and encoded by a weight-shared
A mean-max pooling operation is then applied
BiLSTM, yielding two series of contextual repre-
to summarize the representation of two sentences.
sentations:
All summarized vectors are then concatenated and
si = BiLSTM([wordsi ; charsi ; feasi ]), ∀i ∈ [1, ls ] fed into a feed-forward classifier that consists of
qj = BiLSTM([wordqj ; charqj ; feaqj ]), ∀j ∈ [1, lq ] a projection sublayer with gelu activation and a
softmax output sublayer, yielding the no-answer
where ls is the length of answer sentence, and [·; ·] probability. As before, we optimize the negative
denotes concatenation. log-likelihood objective function.
Inference Modeling: An inference modeling 3.2.3 Model-III: Hybrid Architecture
layer is used to capture the interactions between
To explore how the features extracted by Model-I
two sentences and produce two inference-aware
and Model-II can be integrated to yield better rep-
sentence representations. We first compute the
resentation capacities, we investigate the combi-
dot products of all tuples < si , qj > as attention
nation of the above two models, namely Model-
weights, and then normalize these weights so as to
III. We merge the output vectors of two models
obtain attended vectors as follows:
into a single joint representation. An unified feed-
aij = sT
i qj , ∀i ∈ [1, ls ], ∀j ∈ [1, lq ]
forward classifier is then applied to output the no-
lq ls
answer probability. Such design allows us to test
X eaij X eaij whether the performance can benefit from the inte-
bi = qj , cj = s
Plq akj i
Pls
aik gration of two different architectures. In practice
j=1 k=1 e i=1 k=1 e
we use a simple concatenation to merge the two
We separately fuse local inference informa- sources of information.
tion between aligned pairs {(si , bi )}li=1s
and
lq
{(qj , cj )}j=1 using a weight-shared function F : 4 Experimental Setup
4.1 Dataset
s̃i = F (si , bi ) , q̃j = F (qj , cj )
We evaluate our approach on the SQuAD 2.0
A heuristic fusion function o = F (x, y) pro- dataset (Rajpurkar et al., 2018). SQuAD 2.0 is a
posed by Hu et al. (2018a) is used as: new machine reading comprehension benchmark
that aims to test the models whether they have
r = gelu (Wr [x; y; x ◦ y; x − y]) truely understood the questions by knowing what
g = σ (Wg [x; y; x ◦ y; x − y]) they don’t know. It combines answerable ques-
o = g ◦ r + (1 − g) ◦ x tions from the previous SQuAD 1.1 dataset (Ra-
jpurkar et al., 2016) with 53,775 unanswerable
where gelu is the Gaussian Error Linear questions about the same passages. Crowdsourc-
Unit (Hendrycks and Gimpel, 2016), ◦ is ing workers craft these questions with a plausible
element-wise multiplication, and the bias term is answer in mind, and make sure that they are rele-
omitted. vant to the corresponding passages.
4.2 Training and Inference Dev Test
Model
EM F1 EM F1
Our no-answer reader is trained on context pas- BNA1 59.8 62.6 59.2 62.1
sages, while the answer verifier is trained on oracle DocQA2 61.9 64.8 59.3 62.3
answer sentences. Model-I follows a procedure DocQA + ELMo 65.1 67.6 63.4 66.3
of unsupervised pre-training and supervised fine- VS3 −Net† - - 68.4 71.3
tuning. That is, the model is first optimized with a SAN3 - - 68.6 71.4
language modeling objective on a large unlabeled FusionNet++(ensemble)4 - - 70.3 72.6
text corpus to initialize its parameters. Then it RMR + ELMo + Verifier 72.3 74.8 71.7 74.2
Human 86.3 89.0 86.9 89.5
adapts the parameters to the answer verifying task
with our supervised objective. For Model-II, we
Table 1: Comparison of different approaches on the
directly train it with the supervised loss. Model-
SQuAD 2.0 test set, extracted on Aug 23, 2018: Levy
III, however, consists of two different architectures et al. (2017)1 , Clark and Gardner (2018)2 , Liu et al.
that require different training procedures. There- (2018b)3 and Huang et al. (2018)4 . † indicates unpub-
fore, we initialize Model-III with the pre-trained lished works.
parameters from both of Model-I and Model-II,
and then fine-tune the whole model until conver-
and 32 for Model-I and Model-III. We use the
gence.
GloVe (Pennington et al., 2014) 100D embeddings
At test time, the reader first predicts a candidate for the reader, and 300D embeddings for Model-II
answer as well as a passage-level no-answer prob- and Model-III. We utilize the nltk tokenizer 2 to
ability. The answer verifier then validates the ex- preprocess passages and questions, as well as split
tracted answer along with its sentence and outputs sentences. The passages and the sentences are
a sentence-level probability. Following the offi- truncated to not exceed 300 words and 150 words
cial evaluation setting, a question is detected to be respectively.
unanswerable once the joint no-answer probabil-
ity, which is computed as the mean of the above 5 Evaluation
two probabilities, exceeds some threshold. We
tune this threshold to maximize F1 score on the 5.1 Main Results
development set, and report both of EM (Exact We first submit our approach on the hidden test set
Match) and F1 metrics. We also evaluate the per- of SQuAD 2.0 for evaluation 3 , which is shown in
formance on no-answer detection with an accuracy Table 1. We use Model-III as the default answer
metric (ACC), where its threshold is set as 0.5 by verifier, and only report the best result. As we can
default. see, our system achieves an EM score of 71.7 and
a F1 score of 74.2, outperforming all previous ap-
4.3 Implementation proaches at the time of submission.
We use the Reinforced Mnemonic Reader
5.2 Ablation Study
(RMR) (Hu et al., 2018a), one of the state-of-the-
art reading comprehension models on the SQuAD Next, we do an ablation study on the SQuAD
1.1 dataset, as our base reader. The reader is 2.0 development set to show the effects of our
configurated with its default setting, and trained proposed methods for each individual component.
with the no-answer objective with our auxiliary Table 2 first shows the ablation results of dif-
losses. ELMo (Embeddings from Language Mod- ferent auxiliary losses on the reader. Remov-
els) (Peters et al., 2018) is exclusively listed in our ing the independent span loss (indep-I) results in
experimental configuration. The hyper-parameter a performance drop for all answerable questions
γ is set as 0.3, and λ is 1. As for answer verifiers, (HasAns), indicating that this loss helps the model
we use the original configuration from Radford in better identifying the answer boundary. Ab-
et al. (2018) for Model-I. For Model-II, the Adam lating independent no-answer loss (indep-II), on
optimizer (Kingma and Ba, 2014) with a learning the other hand, causes little influence on HasAns,
rate of 0.0008 is used, the hidden size is set as but leads to a severe decline on no-answer accu-
300, and a dropout (Srivastava et al., 2014) of racy (NoAns ACC). This suggests that a conflic-
0.3 is applied for preventing overfitting. The 2
https://www.nltk.org/
3
batch size is 48 for the reader, 64 for Model-II https://rajpurkar.github.io/SQuAD-explorer/
HasAns All NoAns All NoAns
Configuration
EM F1 EM F1 ACC EM F1 ACC
RMR 72.6 81.6 66.9 69.1 73.1 RMR 66.9 69.1 73.1
- indep-I 71.3 80.4 66.0 68.6 72.8 + Model-I 68.3 71.1 76.2
- indep-II 72.4 81.4 64.0 66.1 69.8 + Model-II 68.1 70.8 75.6
- both 71.9 80.9 65.2 67.5 71.4 + Model-II + ELMo 68.2 70.9 75.9
RMR + ELMo 79.4 86.8 71.4 73.7 77.0 + Model-III 68.5 71.5 77.1
- indep-I 78.9 86.5 71.2 73.5 76.7 + Model-III + ELMo 68.5 71.2 76.5
- indep-II 79.5 86.6 69.4 71.4 75.1 RMR + ELMo 71.4 73.7 77.0
- both 78.7 86.2 70.0 71.9 75.3 + Model-I 71.8 74.4 77.3
+ Model-II 71.8 74.2 78.1
Table 2: Comparison of readers with different auxil- + Model-II + ELMo 72.0 74.3 78.2
iary losses. + Model-III 72.3 74.8 78.6
+ Model-III + ELMo 71.8 74.3 78.3
Answer Verifier NoAns ACC
Model-I 74.5 Table 4: Comparison of readers with different answer
Model-II 74.6 verifiers.
Model-II + ELMo 75.3
Model-III 76.2
Finally, we plot the precision-recall curves of
Model-III + ELMo 76.1
F1 score on the development set in Figure 3. We
observe that RMR + ELMo + Verifier achieves the
Table 3: Comparison of different architectures for the
answer verifier.
best precision when the recall is less than 80. Af-
ter the recall exceeds 80, the precision of RMR +
ELMo becomes slightly better. Ablating two aux-
tion between answer extraction and no-answer de- iliary losses, however, leads to an overall degrada-
tection indeed happens. Finally, deleting both of tion on the curve, but it still outperforms the base-
two losses causes a degradation of more than 1.5 line by a large margin.
points on the overall performance in terms of F1,
with or without ELMo embeddings. 5.3 Error Analysis
Table 3 details the results of various architec- To perform error analysis, we first categorize all
tures for the answer verifier. Model-III outper- examples on the development set into 5 classes:
forms all of other competitors, achieving a no-
answer accuracy of 76.2. This illustrates that • Case1: the question is answerable, the no-
the combination of two different architectures can answer probability is less than the threshold,
bring in further improvement. Adding ELMo and the answer is correct.
embeddings, however, does not boost the perfor-
mance. We hythosize that the bytepair encod- • Case2: the question is unanswerable, and
ing (Sennrich et al., 2016) from Model-I and the the no-answer probability is larger than the
word/character embeddings from Model-II have threshold.
provided enough representation capacities.
• Case3: almost the same as case1, except that
After doing separate ablations on each compo-
the predicted answer is wrong.
nent, we then compare the performance of the
whole system, as shown in Table 4. We notice • Case4: the question is unanswerable, but the
that the combination of base reader with any an- no-answer probability is less than the thresh-
swer verifier can always result in considerable per- old.
formance gains, and combining the reader with
Model-III obtains the best result. We conjecture • Case5: the question is answerable, but
that the improvement comes from the significant the no-answer probability is larger than the
increasement on no-answer accuracy: the metric threshold.
raises by 4 points when using RMR as the base
reader. The similar observation can be found when We then show the percentage of each category
ELMo embeddings are used, demonstrating that in the Table 5. As we can see, the base reader
the improvements are consistent and stable. trained with auxiliary losses is notably better at
Configuration Case1 3 Case2 3 Case3 7 Case4 7 Case5 7
RMR - both 27.8 37.3 6.5 12.7 15.7
RMR 27 39.9 5.9 10.2 17
RMR + Verifier 30.3 38.2 8.4 11.8 11.3
RMR + ELMo - both 31.5 38.3 5.6 11.8 12.8
RMR + ELMo 31.2 40.2 5.5 9.9 13.2
RMR + ELMo + Verifier 32.5 39.8 6.5 10.3 10.9

Table 5: Percentage of five categories. Correct predictions are denoted with 3, while wrong cases are marked
with 7.

Percentage (%)
90 Phenomenon
All Error
Negation 9 0
80
Antonym 20 8
Precision

70 Entity Swap 21 24
Mutual Exclusion 15 16
60 DocQA + ELMo
Impossible Condition 4 14
RMR + ELMo - both Other Neutral 24 32
50 RMR + ELMo Answerable 7 6
RMR + ELMo + Verifier
40 Table 6: Linguistic phenomena exhibited by all neg-
0 20 40 60 80 ative examples (statistics from Rajpurkar et al. (2018))
Recall
and sampled error cases of RMR + ELMo + Verifier.

Figure 3: Precision-Recall curves of F1 score.


negation and antonym only require to detect one
single word in the question, such as “never” or
case2 and case4 compared to the baseline, im- “not” for negation and “increase” to “decrease”
plying that our proposed losses help the model for antonym. However, impossible condition and
mainly improve upon unanswerable cases. Af- other neutral types acount for 44% of the error
ter adding the answer verifier, we observe that set, indicating that our system performs less effec-
although the system’s performance on unanswer- tively on these more difficult phenomena.
able cases slightly decreases, the results on case1
and case5 have been improved. This demonstrates 6 Related Work
that the answer verifier does well on detecting an-
swerable question rather than unanswerable one. Reading Comprehension Datasets. Various
However, there are still more than 20% of exam- large-scale reading comprehension datasets, such
ples that are misclassified even with our best sys- as cloze-style test (Hermann et al., 2015), answer
tem. Therefore we argue that the main perfor- extraction benchmark (Rajpurkar et al., 2016;
mance bottleneck lies in no-answer detection in- Joshi et al., 2017) and answer generation bench-
stead of answer extraction. mark (Nguyen et al., 2016; Kočiskỳ et al., 2018),
Next, to understand the challenges our approach have been proposed. However, these datasets still
faces, we manually investigate 50 incorrectly pre- guarantee that the given context must contain an
dicted unanswerable examples (based on F1) that answer. Recently, Some works construct neg-
are randomly sampled from the development set. ative examples by retrieving passages for exist-
Following the types of negative examples defined ing questions based on Lucene (Tan et al., 2018)
by Rajpurkar et al. (2018), we categorize the sam- and TF-IDF (Clark and Gardner, 2018), or us-
pled examples and show them in Table 6. As we ing crowdworkers to craft unanswerable ques-
can see, our system is good at recognize negation tions (Rajpurkar et al., 2018). Compared to au-
and antonym. The frequency of negation decreases tomatically retrieved negative examples, human-
from 9 to 0 and only 4 antonym examples are pre- annotated examples are more difficult to detect for
dicted wrongly. We think that this is because the two reasons: (1) the questions are relevant to the
two types are relatively easier to identify. Both of passage and (2) the passage contains a plausible
answer to the question. Therefore, we choose to 7 Conclusion
work on the SQuAD 2.0 dataset in this paper.
We proposed a read-then-verify system that is able
Neural Networks for Reading Comprehension. to abstain from answering when a question has
Neural reading models typically leverage vari- no answer given the passage. We first introduce
ous attention mechanisms to build interdepen- two auxiliary losses to help the reader concentrate
dent representations of passage and question, and on answer extraction and no-answer detection re-
sequentially predict the answer boundary (Seo spectively, and then utilize an answer verifier to
et al., 2017; Hu et al., 2018a; Wang et al., 2017; validate the legitimacy of the predicted answer,
Yu et al., 2018; Hu et al., 2018b). However, in which three different architectures are investi-
these approaches are not designed to handle no- gated. Our system has achieved state-of-the-art
answer cases. To address this problem, previ- results on the SQuAD 2.0 dataset, outperforming
ous works (Levy et al., 2017; Clark and Gard- all previous approaches at the time of submission
ner, 2018) predict a no-answer probability in ad- (Aug. 23th, 2018). Looking forward, we plan to
dition to the distribution over answer spans, so as design new network structures for the answer veri-
to jointly learn no-answer detection as well as an- fying model that requires to handle questions with
swer extraction. Our no-answer reader extends more complicated inferences.
existing approaches by introducing two auxiliary
losses that enhance these two tasks independently. Acknowledgments
Recognizing Textual Entailment. Recognizing We would like to thank Pranav Rajpurkar and
textual entailment (RTE) (Dagan et al., 2010; Robin Jia for their helps with SQuAD 2.0 submis-
Marelli et al., 2014), or known as natural language sions.
inference (NLI) (Bowman et al., 2015), requires
systems to understand entailment, contradiction or
semantic neutrality between two sentences. This References
task is strongly related to no-answer detection, Samuel R Bowman, Gabor Angeli, Christopher Potts,
where the machine needs to understand if the pas- and Christopher D Manning. 2015. A large anno-
sage and the question supports the answer. To tated corpus for learning natural language inference.
recognize entailment, various branches of works In Proceedings of EMNLP.
have been proposed, including encoding-based ap- Samuel R Bowman, Jon Gauthier, Abhinav Ras-
proach (Bowman et al., 2016), interaction-based togi, Raghav Gupta, Christopher D Manning, and
approach (Parikh et al., 2016) and sequence-based Christopher Potts. 2016. A fast unified model for
approach (Radford et al., 2018). In this paper we parsing and sentence understanding. arXiv preprint
arXiv:1603.06021.
investigate the last two branches and further pro-
pose a hybrid architecture that combines both of Christopher Clark and Matt Gardner. 2018. Simple
them properly. and effective multi-paragraph reading comprehen-
sion. In Proceedings of ACL.
Answer Validation. Early answer validation
task (Magnini et al., 2002) aims at ranking mul- Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan
Roth. 2010. Recognizing textual entailment: ratio-
tiple candidate answers to return a most reliable
nal, evaluation and approaches. Natural Language
one. Later, the answer validation exercise (Ro- Engineering, 16(1):105–105.
drigo et al., 2008) has been proposed to decide
whether an answer is correct or not according to Dan Hendrycks and Kevin Gimpel. 2016. Bridg-
ing nonlinearities and stochastic regularizers with
a given supporting text and a question, but the gaussian error linear units. arXiv preprint
dataset is too small for neural network-based ap- arXiv:1606.08415.
proaches. Recently, Tan et al. (2018) propose an
answer validation based method to validate the an- Karl Moritz Hermann, Tomas Kocisky, Edward
Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
swer for reading comprehension tasks, by compar- leyman, and Phil Blunsom. 2015. Teaching ma-
ing the question with the passage. Our answer chines to read and comprehend. In Proceedings of
verifier, on the contrary, denoises the passage by NIPS.
comparing questions with answer sentences, so as Sepp Hochreiter and Jrgen Schmidhuber. 1997.
to focus on finding local entailment that supports Long short-term memory. Neural computation,
the answer. 9(8):1735–1780.
Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Jeffrey Pennington, Richard Socher, and Christo-
Furu Wei, and Ming Zhou. 2018a. Reinforced pher D. Manning. 2014. Glove: Global vectors for
mnemonic reader for machine reading comprehen- word representation. In Proceedings of EMNLP.
sion. In Proceedings of IJCAI.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Minghao Hu, Yuxing Peng, Furu Wei, Zhen Huang, Gardner, Christopher Clark, Kenton Lee, and Luke
DongshengLi, Nan Yang, and Ming Zhou. 2018b. Zettlemoyer. 2018. Deep contextualized word prep-
Attention-guided answer distillation for machine resentations. In Proceedings of NAACL.
reading comprehension. In Proceedings of EMNLP.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Ilya Sutskever. 2018. Improving language under-
Weizhu Chen. 2018. Fusionnet: fusing via fully- standing by generative pre-training.
aware attention with application to machine compre-
hension. In Proceedings of ICLR. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you don’t know: unanswerable ques-
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke tions for squad. In Proceedings of ACL.
Zettlemoyer. 2017. Triviaqa: a large scale distantly
supervised challenge dataset for reading comprehen- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
sion. In Proceedings of ACL. Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text. In Proceedings of
Diederik P. Kingma and Lei Jimmy Ba. 2014. Adam: EMNLP.
A method for stochastic optimization. In CoRR,
abs/1412.6980. Álvaro Rodrigo, Anselmo Peñas, and Felisa Verdejo.
2008. Overview of the answer validation exer-
Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, cise 2008. In Workshop of CLEF, pages 296–313.
Chris Dyer, Karl Moritz Hermann, Gáabor Melis, Springer.
and Edward Grefenstette. 2018. The narrativeqa
reading comprehension challenge. Transactions of Rico Sennrich, Barry Haddow, and Alexandra Birch.
ACL, 6:317–328. 2016. Neural machine translation of rare words with
subword units. In Proceedings of ACL.
Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Zettlemoyer. 2017. Zero-shot relation extrac- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and
tion via reading comprehension. arXiv preprint Hananneh Hajishirzi. 2017. Bidirectional attention
arXiv:1706.04115. flow for machine comprehension. In Proceedings of
ICLR.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Noam Shazeer. 2018a. Generating wikipedia Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
by summarizing long sequences. arXiv preprint Dropout: a simple way to prevent neural networks
arXiv:1801.10198. from overfitting. JMLR, pages 1929–1958.
Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Chuanqi Tan, Furu Wei, Qingyu Zhou, Nan Yang,
Gao. 2018b. Stochastic answer networks for ma- Weifeng Lv, and Ming Zhou. 2018. I know there is
chine reading comprehension. In Proceedings of no answer: modeling answer validation for machine
ACL. reading comprehension. In Proceedings of NLPCC,
Bernardo Magnini, Matteo Negri, Roberto Prevete, and pages 85–97. Springer.
Hristo Tanev. 2002. Is it the right answer? exploit-
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
ing web redundancy for answer validation. In Pro-
2015. Pointer networks. In Proceedings of NIPS.
ceedings of ACL.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,
Bentivogli, Raffaella Bernardi, Roberto Zamparelli, and Ming Zhou. 2017. Gated self-matching net-
et al. 2014. A sick cure for the evaluation of com- works for reading comprehension and question an-
positional distributional semantic models. In LREC, swering. In Proceedings of ACL.
pages 216–223. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Zhao, Kai Chen, Mohammad Norouzi, and Quoc V
Saurabh Tiwary, Rangan Majumder, and Li Deng. Le. 2018. Qanet: combining local convolution with
2016. Ms marco: a human generated machine global self-attention for reading comprehension. In
reading comprehension dataset. arXiv preprint Proceedings of ICLR.
arXiv:1611.09268.
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable attention
model for natural language inference. arXiv preprint
arXiv:1606.01933.

You might also like