Professional Documents
Culture Documents
Minghao Hu†∗, Furu Wei§ , Yuxing Peng† , Zhen Huang† , Nan Yang§ , Ming Zhou§
†
College of Computer, National University of Defense Technology
§
Microsoft Research Asia
{huminghao09,pengyuxing,huangzhen}@nudt.edu.cn
{fuwei,nanya,mingzhou}@microsoft.com
Machine reading comprehension with unan- Passage: The Normans were the people who in
swerable questions aims to abstain from an- the 10th and 11th centuries gave their name to
arXiv:1808.05759v4 [cs.CL] 5 Sep 2018
Answer Inference
Sentence
Question Answer Model-I Masked Multi
Modeling
Self Attention
Merge y s q
BiLSTM BiLSTM
Model-II Text & Position Embed Text Embed Text Embed
(a) (b) (c)
Figure 2: An overview of answer verifiers. (a) Input structures for running three different models. (b) Finetuned
Transformer model proposed by Radford et al. (2018). (c) Our proposed interactive model.
where σ is the sigmoid activation function. the task. The model is a multi-layer Transformer
Through this loss, we expect the model to produce decoder (Liu et al., 2018a), which is first trained
a more confident prediction on no-answer score z with a language modeling objective on a large un-
without considering the shared-normalization op- labeled text corpus and then finetuned on the spe-
eration. cific target task.
Finally, we combine the above losses as fol- Specifically, given an answer sentence S, a
lows: question Q and an extracted answer A, we
concatenate the two sentences with the answer
L = Ljoint + γLindep−I + λLindep−II while adding a delimiter token in between to get
[S; Q; $; A]. The sequence, which can also be de-
where γ and λ are two hyper-parameters that con- noted as a series of tokens X = {xi }li=1
m
, is then
trol the weight of two auxiliary losses. encoded by a multi-head self-attention operation
followed by position-wise feed-forward layers as
3.2 Answer Verifier
follows:
After the answer is extracted, an answer verifier
is used to compare the answer sentence with the h0 = We [X] + Wp
question, so as to recognize local textual entail- hi = transformer block(hi−1 ), ∀i ∈ [1, n]
ment that supports the answer. Here, we define the
answer sentence as the context sentence that con- where X denotes the sequence’s indexes in the vo-
tains either gold answers or plausible answers. We cab, We is the token embedding matrix, Wp is the
explore three different architectures for the veri- position embedding matrix, and n is the number
fying task (shown in Figure 2): (1) a sequential of layers.
model that takes the inputs as a long sequence, (2) The last token’s activation hlnm is then fed into
an interactive model that encodes two sentences a linear projection layer followed by a softmax
interdependently, and (3) a hybrid model that takes function to output the no-answer probability y:
both of the two approaches into account.
p(y|X) = softmax(hlnm Wy )
3.2.1 Model-I: Sequential Architecture
In Model-I, we convert the answer sentence A standard cross-entropy objective is used to
and the question along with the extracted an- minimize the negative log-likelihood:
swer into an ordered input sequence. Then X
we adapt the recently proposed Finetuned Trans- L(θ) = − log p(y|X)
former model (Radford et al., 2018) to perform (X,y)
3.2.2 Model-II: Interactive Architecture Intra-Sentence Modeling: Next we apply an
Since the answer verifier needs to model infer- intra-sentence modeling layer to capture self cor-
ences between two sentences, therefore we also relations inside each sentence. The input is first
consider an interaction-based approach that has passed through another BiLSTM layer for encod-
the following layers: ing. We then use the same attention mechanism
Encoding: We embed words using the GloVe em- described above, only now between the sentence
bedding (Pennington et al., 2014), and also em- and itself, and we set aij = −inf if i = j to
bed characters of each word with trainable vectors. ensure that the word is not aligned with itself. An-
We run a bidirectional LSTM (BiLSTM) (Hochre- other fusion function is used to produce ŝi and q̂j
iter and Schmidhuber, 1997) to encode the charac- respectively.
ters and concatenate two last hidden states to get Prediction: Before the final prediction, we apply
character-level embeddings. In addition, we use a a concatenated residual connection and model the
binary feature to indicate if a word is part of the sentences with a BiLSTM as:
answer. All embeddings along with the feature are s̄i = BiLSTM([s̃i ; ŝi ]) , q̄j = BiLSTM([q̃j ; q̂j ])
then concatenated and encoded by a weight-shared
A mean-max pooling operation is then applied
BiLSTM, yielding two series of contextual repre-
to summarize the representation of two sentences.
sentations:
All summarized vectors are then concatenated and
si = BiLSTM([wordsi ; charsi ; feasi ]), ∀i ∈ [1, ls ] fed into a feed-forward classifier that consists of
qj = BiLSTM([wordqj ; charqj ; feaqj ]), ∀j ∈ [1, lq ] a projection sublayer with gelu activation and a
softmax output sublayer, yielding the no-answer
where ls is the length of answer sentence, and [·; ·] probability. As before, we optimize the negative
denotes concatenation. log-likelihood objective function.
Inference Modeling: An inference modeling 3.2.3 Model-III: Hybrid Architecture
layer is used to capture the interactions between
To explore how the features extracted by Model-I
two sentences and produce two inference-aware
and Model-II can be integrated to yield better rep-
sentence representations. We first compute the
resentation capacities, we investigate the combi-
dot products of all tuples < si , qj > as attention
nation of the above two models, namely Model-
weights, and then normalize these weights so as to
III. We merge the output vectors of two models
obtain attended vectors as follows:
into a single joint representation. An unified feed-
aij = sT
i qj , ∀i ∈ [1, ls ], ∀j ∈ [1, lq ]
forward classifier is then applied to output the no-
lq ls
answer probability. Such design allows us to test
X eaij X eaij whether the performance can benefit from the inte-
bi = qj , cj = s
Plq akj i
Pls
aik gration of two different architectures. In practice
j=1 k=1 e i=1 k=1 e
we use a simple concatenation to merge the two
We separately fuse local inference informa- sources of information.
tion between aligned pairs {(si , bi )}li=1s
and
lq
{(qj , cj )}j=1 using a weight-shared function F : 4 Experimental Setup
4.1 Dataset
s̃i = F (si , bi ) , q̃j = F (qj , cj )
We evaluate our approach on the SQuAD 2.0
A heuristic fusion function o = F (x, y) pro- dataset (Rajpurkar et al., 2018). SQuAD 2.0 is a
posed by Hu et al. (2018a) is used as: new machine reading comprehension benchmark
that aims to test the models whether they have
r = gelu (Wr [x; y; x ◦ y; x − y]) truely understood the questions by knowing what
g = σ (Wg [x; y; x ◦ y; x − y]) they don’t know. It combines answerable ques-
o = g ◦ r + (1 − g) ◦ x tions from the previous SQuAD 1.1 dataset (Ra-
jpurkar et al., 2016) with 53,775 unanswerable
where gelu is the Gaussian Error Linear questions about the same passages. Crowdsourc-
Unit (Hendrycks and Gimpel, 2016), ◦ is ing workers craft these questions with a plausible
element-wise multiplication, and the bias term is answer in mind, and make sure that they are rele-
omitted. vant to the corresponding passages.
4.2 Training and Inference Dev Test
Model
EM F1 EM F1
Our no-answer reader is trained on context pas- BNA1 59.8 62.6 59.2 62.1
sages, while the answer verifier is trained on oracle DocQA2 61.9 64.8 59.3 62.3
answer sentences. Model-I follows a procedure DocQA + ELMo 65.1 67.6 63.4 66.3
of unsupervised pre-training and supervised fine- VS3 −Net† - - 68.4 71.3
tuning. That is, the model is first optimized with a SAN3 - - 68.6 71.4
language modeling objective on a large unlabeled FusionNet++(ensemble)4 - - 70.3 72.6
text corpus to initialize its parameters. Then it RMR + ELMo + Verifier 72.3 74.8 71.7 74.2
Human 86.3 89.0 86.9 89.5
adapts the parameters to the answer verifying task
with our supervised objective. For Model-II, we
Table 1: Comparison of different approaches on the
directly train it with the supervised loss. Model-
SQuAD 2.0 test set, extracted on Aug 23, 2018: Levy
III, however, consists of two different architectures et al. (2017)1 , Clark and Gardner (2018)2 , Liu et al.
that require different training procedures. There- (2018b)3 and Huang et al. (2018)4 . † indicates unpub-
fore, we initialize Model-III with the pre-trained lished works.
parameters from both of Model-I and Model-II,
and then fine-tune the whole model until conver-
and 32 for Model-I and Model-III. We use the
gence.
GloVe (Pennington et al., 2014) 100D embeddings
At test time, the reader first predicts a candidate for the reader, and 300D embeddings for Model-II
answer as well as a passage-level no-answer prob- and Model-III. We utilize the nltk tokenizer 2 to
ability. The answer verifier then validates the ex- preprocess passages and questions, as well as split
tracted answer along with its sentence and outputs sentences. The passages and the sentences are
a sentence-level probability. Following the offi- truncated to not exceed 300 words and 150 words
cial evaluation setting, a question is detected to be respectively.
unanswerable once the joint no-answer probabil-
ity, which is computed as the mean of the above 5 Evaluation
two probabilities, exceeds some threshold. We
tune this threshold to maximize F1 score on the 5.1 Main Results
development set, and report both of EM (Exact We first submit our approach on the hidden test set
Match) and F1 metrics. We also evaluate the per- of SQuAD 2.0 for evaluation 3 , which is shown in
formance on no-answer detection with an accuracy Table 1. We use Model-III as the default answer
metric (ACC), where its threshold is set as 0.5 by verifier, and only report the best result. As we can
default. see, our system achieves an EM score of 71.7 and
a F1 score of 74.2, outperforming all previous ap-
4.3 Implementation proaches at the time of submission.
We use the Reinforced Mnemonic Reader
5.2 Ablation Study
(RMR) (Hu et al., 2018a), one of the state-of-the-
art reading comprehension models on the SQuAD Next, we do an ablation study on the SQuAD
1.1 dataset, as our base reader. The reader is 2.0 development set to show the effects of our
configurated with its default setting, and trained proposed methods for each individual component.
with the no-answer objective with our auxiliary Table 2 first shows the ablation results of dif-
losses. ELMo (Embeddings from Language Mod- ferent auxiliary losses on the reader. Remov-
els) (Peters et al., 2018) is exclusively listed in our ing the independent span loss (indep-I) results in
experimental configuration. The hyper-parameter a performance drop for all answerable questions
γ is set as 0.3, and λ is 1. As for answer verifiers, (HasAns), indicating that this loss helps the model
we use the original configuration from Radford in better identifying the answer boundary. Ab-
et al. (2018) for Model-I. For Model-II, the Adam lating independent no-answer loss (indep-II), on
optimizer (Kingma and Ba, 2014) with a learning the other hand, causes little influence on HasAns,
rate of 0.0008 is used, the hidden size is set as but leads to a severe decline on no-answer accu-
300, and a dropout (Srivastava et al., 2014) of racy (NoAns ACC). This suggests that a conflic-
0.3 is applied for preventing overfitting. The 2
https://www.nltk.org/
3
batch size is 48 for the reader, 64 for Model-II https://rajpurkar.github.io/SQuAD-explorer/
HasAns All NoAns All NoAns
Configuration
EM F1 EM F1 ACC EM F1 ACC
RMR 72.6 81.6 66.9 69.1 73.1 RMR 66.9 69.1 73.1
- indep-I 71.3 80.4 66.0 68.6 72.8 + Model-I 68.3 71.1 76.2
- indep-II 72.4 81.4 64.0 66.1 69.8 + Model-II 68.1 70.8 75.6
- both 71.9 80.9 65.2 67.5 71.4 + Model-II + ELMo 68.2 70.9 75.9
RMR + ELMo 79.4 86.8 71.4 73.7 77.0 + Model-III 68.5 71.5 77.1
- indep-I 78.9 86.5 71.2 73.5 76.7 + Model-III + ELMo 68.5 71.2 76.5
- indep-II 79.5 86.6 69.4 71.4 75.1 RMR + ELMo 71.4 73.7 77.0
- both 78.7 86.2 70.0 71.9 75.3 + Model-I 71.8 74.4 77.3
+ Model-II 71.8 74.2 78.1
Table 2: Comparison of readers with different auxil- + Model-II + ELMo 72.0 74.3 78.2
iary losses. + Model-III 72.3 74.8 78.6
+ Model-III + ELMo 71.8 74.3 78.3
Answer Verifier NoAns ACC
Model-I 74.5 Table 4: Comparison of readers with different answer
Model-II 74.6 verifiers.
Model-II + ELMo 75.3
Model-III 76.2
Finally, we plot the precision-recall curves of
Model-III + ELMo 76.1
F1 score on the development set in Figure 3. We
observe that RMR + ELMo + Verifier achieves the
Table 3: Comparison of different architectures for the
answer verifier.
best precision when the recall is less than 80. Af-
ter the recall exceeds 80, the precision of RMR +
ELMo becomes slightly better. Ablating two aux-
tion between answer extraction and no-answer de- iliary losses, however, leads to an overall degrada-
tection indeed happens. Finally, deleting both of tion on the curve, but it still outperforms the base-
two losses causes a degradation of more than 1.5 line by a large margin.
points on the overall performance in terms of F1,
with or without ELMo embeddings. 5.3 Error Analysis
Table 3 details the results of various architec- To perform error analysis, we first categorize all
tures for the answer verifier. Model-III outper- examples on the development set into 5 classes:
forms all of other competitors, achieving a no-
answer accuracy of 76.2. This illustrates that • Case1: the question is answerable, the no-
the combination of two different architectures can answer probability is less than the threshold,
bring in further improvement. Adding ELMo and the answer is correct.
embeddings, however, does not boost the perfor-
mance. We hythosize that the bytepair encod- • Case2: the question is unanswerable, and
ing (Sennrich et al., 2016) from Model-I and the the no-answer probability is larger than the
word/character embeddings from Model-II have threshold.
provided enough representation capacities.
• Case3: almost the same as case1, except that
After doing separate ablations on each compo-
the predicted answer is wrong.
nent, we then compare the performance of the
whole system, as shown in Table 4. We notice • Case4: the question is unanswerable, but the
that the combination of base reader with any an- no-answer probability is less than the thresh-
swer verifier can always result in considerable per- old.
formance gains, and combining the reader with
Model-III obtains the best result. We conjecture • Case5: the question is answerable, but
that the improvement comes from the significant the no-answer probability is larger than the
increasement on no-answer accuracy: the metric threshold.
raises by 4 points when using RMR as the base
reader. The similar observation can be found when We then show the percentage of each category
ELMo embeddings are used, demonstrating that in the Table 5. As we can see, the base reader
the improvements are consistent and stable. trained with auxiliary losses is notably better at
Configuration Case1 3 Case2 3 Case3 7 Case4 7 Case5 7
RMR - both 27.8 37.3 6.5 12.7 15.7
RMR 27 39.9 5.9 10.2 17
RMR + Verifier 30.3 38.2 8.4 11.8 11.3
RMR + ELMo - both 31.5 38.3 5.6 11.8 12.8
RMR + ELMo 31.2 40.2 5.5 9.9 13.2
RMR + ELMo + Verifier 32.5 39.8 6.5 10.3 10.9
Table 5: Percentage of five categories. Correct predictions are denoted with 3, while wrong cases are marked
with 7.
Percentage (%)
90 Phenomenon
All Error
Negation 9 0
80
Antonym 20 8
Precision
70 Entity Swap 21 24
Mutual Exclusion 15 16
60 DocQA + ELMo
Impossible Condition 4 14
RMR + ELMo - both Other Neutral 24 32
50 RMR + ELMo Answerable 7 6
RMR + ELMo + Verifier
40 Table 6: Linguistic phenomena exhibited by all neg-
0 20 40 60 80 ative examples (statistics from Rajpurkar et al. (2018))
Recall
and sampled error cases of RMR + ELMo + Verifier.