Professional Documents
Culture Documents
Eric Curea
Research Institute for Artificial Intelligence
MIHAI DRAGANESCU,
Romanian Academy
eric@racai.ro
of terms combined with the large scale of the cor- the system can spend most processing time and
pus, we ran our experiments on a smaller sub-set apply as many tests, approximations and refine-
of MeSH terms, composed of only 154 most fre- ments, because this is the place where most arti-
quent items. cles condense the biggest amount of relevant in-
formation about the content of the document. Of
In the classification process we took into ac- course finding possible relevant information in the
count as much information as we can and have abstract text is only part of the equation. The
access to, about each document in the large-scale more important part is determining relevant rela-
corpus. The title of a document usually holds key tions between different relevant lexical tokens, the
information about the content of the document. location of the information segments, distance be-
The journal in which it was published is likely tween the different relevant lexical tokens inside
to carry weight in the label assigning process as the abstract, number of occurrences, similarity to
only specific types of documents can be published the information determined in the question (W2V,
in certain types of journals. The year in which cosine similarity(Steinbach et al., 2000)).
the document was published will tell the system if
the information retrieved from the document has All the input features were treated in a bag-of-
a chance of not being up to date or it might be words manner, from which we removed any fea-
completely outdated and superseded by more re- ture (word) with an occurrence rate lower than
cent research, in which case the system should at 100. This threshold of 100 was selected after
least try to see if newer publications might hold testing different limits that yielded either too few
better results or more important supplementary in- features left to test with or too low occurrence
formation. The abstract text is the place where rate for the feature to be relevant. Initially, our
training data contained 7,466,119 unique features measured on one of the datasets. It also offers a
and the pruning process reduced this number to comparative view between our methodology and
only 123,255. For the classification task we em- the other systems present in the competition. We
ployed an ensemble of linear classifiers. Each pos- must mention that the overall performance figures
sible output MeSH was associated with a classi- are measured using all the available MeSHes, not
fier, which was trained in a 1-vs-all style to predict the pruned subset.
if the system should or should not assign that la-
bel, based on the input features. The output of the 4 Result ranking
linear model ranged from -1 (do not assign a la- For this we take each lexical component of the key
bel) to 1 (assign a label) and was computed using set of data extracted from the corpus and we try
Equation 1, with w computed using the delta-rule to find if the classified documents from the cor-
(Equation 2): pus approximate to possible synonyms of lexical
n
X component. For each lexical component of the
y= wn x n (1) key set of data extracted from the question, we
1
calculated a list of lexical elements that can be
wk = (t y) xk (2) considered similar in meaning using cosine sim-
ilarity computed over distributed word represen-
where
tations (Mikolov et al., 2013). The vectors (100-
y is the output of the classifier
dimensional) were computed using the word2vec5
t is the desired output of the classifier (-1 or 1)
tool on a specific subset of Wikipedia combined
xi is the ith input feature
with additional raw text resources provided as part
wi is the weight of the i-th input feature
of the BioASQ challenge. In order to compile the
is the learning-rate (set to 103 )
subset from Wikipedia we followed a simple boot-
strapping procedure:
When we trained our ensemble of classifiers we
divided our training data into 9/10 for training and 1. We downloaded the latest Wikipedia XML
1/10 for development, while trying to preserve Dump at that date from the official web-site,
as best as possible the initial distribution for on which we run a version of WikipediaEx-
each of the labels in both sets. Training was tractor6 , that was modified to preserve cate-
done iteratively (compute new value for w using gories;
the training set and measure accuracy on the
2. We seeded a list of categories, using the first
development set) and the stopping condition was
level of categories on the Wikipedia site for
not to have any improvements on the development
the Biomedical main category;
set for more than 20 iterations. At the end of the
training process we kept the w that achieved the 3. We iterated 3 times through the entire cor-
highest accuracy on the development set. pus and we consolidated our category list, by
adding categories that were associated with
our initial category list, each time updating
Table 3: Labeling results our seeded list;
System MiP MiR Acc.
Sequencer 0.0920 0.0964 0.0494 4. We kept all documents that had at least one
Default MTI 0.6148 0.6286 0.4594 category from our final category list.
Our System 0.7681 0.1472 0.1381 Given a question our IR process is: (a) we
DeepMeSH4 0.6671 0.6289 0.4839 extract a list of keywords from the query, by re-
MZ1 0.6495 0.3985 0.3299 moving function words, using a predefined dic-
DeepMeSH3 0.6898 0.6170 0.4877 tionary; (b) we use the keywords to retrieve the
DeepMeSH2 0.6895 0.6432 0.5059 top 1M documents from the initial corpus; (c) we
DeepMeSH1 0.7025 0.6282 0.5025 re-rank our results and obtain a list with the top-
DeepMeSH5 0.7198 0.6122 0.5024 10 most relevant documents. Document ranking
5
https://github.com/dav/word2vec - accessed 2017-04-05
Table 3 shows the accuracy (Acc), Micro Preci- 6
https://github.com/bwbaugh/wikipedia-extractor - ac-
sion (MiP) and Micro Recall (MiR) of our system, cessed 2017-01-28
Table 4: Test-set results
System Name Mean precision Recall F-measure Map GMAP
Top 100 Baseline 0.2460 0.2845 0.1333 0.1606 0.0028
Top 50 Baseline 0.2470 0.2591 0.1920 0.1503 0.0024
fdu 5b 0.1865 0.2228 0.1791 0.1300 0.0084
Our System 0.4000 0.2222 0.2857 0.1238 0.1238
MCTeamMM 0.2266 0.1481 0.1249 0.0892 0.0005
MCTeamMM10 0.0326 0.1481 0.0436 0.0892 0.0005
Wishart-S1 0.0465 0.0484 0.0350 0.0237 0.0001
where
Sd - is the relevance of document d
k - is the number of keywords in the query
m - is the number of words in the document
ti - is the word embedding for term i in the query
dj - is the word embedding for term j in the
document
5 Snippets
Usually not all the text in the retrieved abstract is
part of a good answer to a given question. So the
next step was finding the most relevant, shortest
part of the abstract.
To approximate the shortest span of text in each
abstract of the documents, that represents the best
response to the question, we selected a list of all
the lexical tokens in the abstract text that corre-
spond or might have generated the relevant label.
At first glance, the snippet would be starting from
the beginning of the first sentence that contains a
token from the list and finishing at the end of the Our system also provides a corresponding list of
last sentence that contains a token from the list. n snippets from the best ranked documents, the
Of course this list has a high probability of hav- shortest span of text which contain the informa-
ing duplicates. These duplicates have no value for tion from the abstract most relevant for the current
detecting the shortest relevant text. So we calcu- question. This is done by discarding any sentence
late from the current abstract, the shortest span of from the abstract text that does not contain any to-
text that still contains all of the lexical tokens but ken from a determined list or only contains low
we ignore any duplicates in the list. relevance duplicates of tokens from said list.
To help explain the previous statement we will Currently we do not deal with determining and
use the following example: extracting lexical dependencies between words
"document": .... [token_1]....[token_1 and we only focus on relevant-document retrieval.
]...[token_2].....[token_1]....[ However, our future development plans include
token_3].....[token_4].....[token_5
]....[token_1] ... extending our system to be able to answer yes/no,
factoid and item-list questions. Additionally we
It can easily be seen in the example that the first
plan to include multilingual data from various
iteration of token 1 holds no value for the pur-
sources and investigate cross-lingual techniques
pose of finding the shortest relevant span of text an
for document retrieval and machine translation for
neither does the second iteration even though it is
delivering the cross-lingual results in the users na-
position in closer proximity to another token from
tive language.
the list. The list is not in any way ordered so the
placement of the second token: token 2 in front
of the first token token 1 is irrelevant. The exis- References
tence of a different token in front of the current to- L Douglas Baker and Andrew Kachites McCallum.
ken: token 2 before token 1 only means that 1998. Distributional clustering of words for text
this iteration of token 1 is a viable candidate for classification. In Proceedings of the 21st annual in-
the shortest relevant span of text. Finally the final ternational ACM SIGIR conference on Research and
development in information retrieval, pages 96103.
iteration of token 1 has no other tokens placed
ACM.
after it so we considered this iteration to hold less
value for a snippet. No other token had a duplicate Ronan Collobert and Jason Weston. 2008. A unified
in this example so in this case the shortest most architecture for natural language processing: Deep
neural networks with multitask learning. In Pro-
relevant span of text was: ceedings of the 25th international conference on
"snippet": [token_2].....[token_1]....[ Machine learning, pages 160167. ACM.
token_3].....[token_4].....[token_5]
Alaa M El-Halees. 2015. Arabic text classification
It is worth noting that there were of course cases using maximum entropy. IUG Journal of Natural
when the system would present the snippet as be- Studies, 15(1).
ing the same as the entirety of the abstract text.
Eui-Hong Sam Han, George Karypis, and Vipin Ku-
6 Conclusions and future work mar. 2001. Text categorization using weight ad-
justed k-nearest neighbor classification. In Pacific-
In this article we presented a biomedical ori- asia conference on knowledge discovery and data
mining, pages 5365. Springer.
ented system that automatically assigns MeSH la-
bels to documents in a large-scale corpus. Our ap- Thorsten Joachims. 1998. Text categorization with
proach is based on a linear classifier, trained in a support vector machines: Learning with many rel-
1-vs-all style for each possible MeSH. evant features. Machine learning: ECML-98, pages
137142.
The system then retrieves answers from said
corpus for questions relevant to the medical field. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Each question yields a number of n best ranked Recurrent convolutional neural networks for text
classification. In AAAI, volume 333, pages 2267
documents that relate to the question. We achieve 2273.
this by first selecting the relevant lexical tokens
from the questions. Then we use Word2Vec for Bjornar Larsen and Chinatsu Aone. 1999. Fast and ef-
fective text mining using linear-time document clus-
100 length vectors in order to calculate the cosine tering. In Proceedings of the fifth ACM SIGKDD in-
similarity to approximate x closest lexical con- ternational conference on Knowledge discovery and
cepts for each of the tokens from the question. data mining, pages 1622. ACM.
David D Lewis and Marc Ringuette. 1994. A compar- Adwait Ratnaparkhi. 1998. Maximum entropy mod-
ison of two learning algorithms for text categoriza- els for natural language ambiguity resolution. Ph.D.
tion. In Third annual symposium on document anal- thesis, University of Pennsylvania.
ysis and information retrieval, volume 33, pages 81
93. FB Rogers. 1963. Medical subject headings. Bulletin
of the Medical Library Association, 51:114116.
David D Lewis, Robert E Schapire, James P Callan,
and Ron Papka. 1996. Training algorithms for lin- Michael Steinbach, George Karypis, Vipin Kumar,
ear text classifiers. In Proceedings of the 19th an- et al. 2000. A comparison of document cluster-
nual international ACM SIGIR conference on Re- ing techniques. In KDD workshop on text mining,
search and development in information retrieval, volume 400, pages 525526. Boston.
pages 298306. ACM.
Larry M Manevitz and Malik Yousef. 2001. One-class George Tsatsaronis, Michael Schroeder, Georgios
svms for document classification. Journal of Ma- Paliouras, Yannis Almirantis, Ion Androutsopoulos,
chine Learning Research, 2(Dec):139154. Eric Gaussier, Patrick Gallinari, Thierry Artieres,
Michael R Alvers, Matthias Zschunke, et al. 2012.
A McCallum and K Nigam. 1998. A comparison Bioasq: A challenge on large-scale biomedical se-
of event models for naive bayes text classification; mantic indexing and question answering. In AAAI
1998. Disponvel em: citeseer. nj. nec. com/mccal- fall symposium: Information retrieval and knowl-
lum98comparison. html. edge discovery in biomedical text.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Min-Ling Zhang and Zhi-Hua Zhou. 2006. Multilabel
rado, and Jeff Dean. 2013. Distributed representa- neural networks with applications to functional ge-
tions of words and phrases and their compositional- nomics and text categorization. IEEE transactions
ity. In Advances in neural information processing on Knowledge and Data Engineering, 18(10):1338
systems, pages 31113119. 1351.