Professional Documents
Culture Documents
By
Ta Hoang Thang
A Thesis Presented
By
Ta Hoang Thang
Title:
Author:
Ta Hoang Thang
Program:
Advisor:
Academic Year:
2015
Acknowledgments
I would like to acknowledge and express my gratitude for the completion of
this study to the following:
Associate Professor Dr. Chutiporn Anutariya and Dr. Aekavute Sujarae, for
their guidance and valuable comments. Their encouragements and insights have led
me to the completion of my study. I have gained a lot of knowledge and precious
experiences from their kind support. I believe that the completion of this study will be
a great motivation both to widening my research ability and my professional career in
the future;
The thesis committee, for their patience and insightful comments which
enriched and refined the focus of my study;
The librarians, for providing me some materials when I am doing my research;
All my classmates, for being there for me in some ways when I need some
helps;
All my friends, for their encouragements and moral support;
All my family members, for having confidence in me, for their
encouragements, for loving me as I am and for their support throughout my study.
Ta Hoang Thang
ii
Abstract
Title:
Author:
Ta Hoang Thang
Program:
Academic Year:
2015
Multilingual Wikis
Wikidata
iii
Table of Contents
Title
Page
Page
Acknowledgments
Abstract
ii
Table of Contents
iii
List of Figures
List of Tables
vi
Chapter 1 Introduction
1.1 Background
1.3 Objectives
2.1 Wikipedia
2.3 DBPedia
13
2.4 SPARQL
15
16
19
iv
22
25
2.9 Summary
26
27
Introduction
Align Infobox Parameters with Wikidata Properties
General Model
Detect, Connect Missing Interwiki Links and Synthesize Semantic
27
28
27
Relations
32
38
43
43
44
48
50
5.1 Conclusions
50
51
References
52
Appendices
Appendix A Converter 1.1.6
56
57
Appendix C AutoWikiBrowser
58
Biography
59
List of Figures
Title
Page
Figure 2.1
Figure 2.2
Figure 2.3
10
Figure 2.4
11
Figure 2.5
11
Figure 2.6
12
Figure 2.7
13
Figure 2.8
14
Figure 2.9
16
17
19
23
Figure 3.1
27
Figure 3.2
36
Figure 4.1
43
Figure 4.2
44
56
57
57
58
vi
List of Tables
Title
Page
Table 2.1
Table 2.2
Wikipedia Namespaces
Table 2.3
18
Table 2.4
18
Table 2.5
21
Table 3.1
29
Table 3.2
7-8
30
Table 3.3
33
Table 3.4
34
Table 3.5
35
Table 3.6
37
Table 3.7
38
Table 3.8
39
Table 3.9
40
Table 3.10
41
Table 3.11
42
Table 4.1
45
Table 4.2
48
Table 4.3
49
Chapter 1
Introduction
1.1 Background
Wikipedia is an encyclopedia that allows the public community to develop
content voluntarily in numerous languages (Anderson, 2011, p. 10-11; O'Sullivan,
2012, p. 85; Bieberstein, 2008). It covers the content differentiation which is from the
differences of language structure and editor contributions. Wikipedia must face with
several difficulties, such as content management, anti-vandalism (Kittur, Suh,
Pendleton & Chi, 2007), data number values verification and content synchronization
between its projects. Many researchers retrieved semantic relations from Wikipedia
content in order to widen semantic database and to improve Wikipedia performance
based on the discovered outcomes. To extract Wikipedia data, DBPedia is one of
many projects that have been deeply examined by the research community (Hellmann
et al., 2014; Gurevych, Kim & Calzolari, 2013). DBPedia also included the
relationships among entities (for example, articles, categories and templates) which
were linked to a large multilingual knowledge base. Other researchers concentrate
their works on some specific languages to retrieve the common semantic relations and
then add the missing content to the language as needed (Sorg & Cimiano, 2008).
Except many entities of Wikipedia that contain interwiki links, there are a lot of
entities as well as the relationship between them which need to be researched were
unlinked to other language projects. From this perspective, the researches about
multilingual wikis are also opened enormous potential for future works.
1.2 Problem Statement
When editors create new entities, mostly articles, categories and templates in
Wikipedia, they need to arrange content by following a defined format. 1 This format
supports not only for readers to find the needed information easily, but also for the
management staff and bots to manage the content effectively. Because of the
differences of language structures and editor communities, each Wikipedia language
gradually diverges its own data compared with others. We call this case is the
1
https://en.wikipedia.org/wiki/Help:Wiki_markup
2
heterogeneity. In some high collaborative quality Wikipedias, such as English
Wikipedia, the information is plentiful and content structure is organized logically.
But at other Wikipedias, especially ones which lack of contributors, these things are
still poor and limited. To contribute information to these Wikipedias, we cannot only
depend on the local editors. According to Wikipedia statistics, we found that the
number of editors whose contributions is higher than 5 edits and 100 edits has slightly
decreased in the recent years. 2 Therefore, the pivotal point is we need a model that
can retrieve data in a semantic aspect from several high quality Wikipedias, then
contribute needed data of entities to other ones semi-automatically. By this way, we
can improve the performance and enrich the content of many languages generally
without using much human effort.
We can utilize DBPedia to form a new model because it contained many
semantic datasets that were extracted from various Wikipedia languages. However,
DBPedia depends on entities which have interwiki links, mainly those links to the
English entities. This leads to one of the drawbacks of DBPedia is non-interlinked
relations of entities may describe incompletely and imprecisely. With the low update
frequency of data (Kittur, Suh, Pendleton & Chi, 2007) and prevent contributions
from public community broadly, in some cases, DBPedia can offer not enough
semantic relations which help the content enriching effectively for all languages.
Another project of DBPedia, Live Extraction can solve the problem of low update
frequency of data, but it only supports English content and depends primarily on
update threads. On the other hand, Wikidata project allows editors contribute semantic
relations to its data openly (Vrandei & Krtzsch, 2014). From this point, we also
can use Wikidata to enrich Wikipedia content. However, this project is in the process
of developing its semantic data and may not have enough semantic rules which help
much in the content enrichment.
In this thesis, we propose a new model which solves some the disadvantages
of DBPedia when using the infobox and Wikidata property alignment (Ta &
Anutariya, 2014). This alignment will produce the aligned structures of infoboxes
which allow updating their content openly on demands and can be derived by next
researchers. Thus, the semantic relations are always updated whenever we update the
aligned structures. In addition, these structures also help boosting the unified process
of infobox properties in Wikidata. We can consider this model as an intermediate step
2
http://stats.wikimedia.org/EN/TablesWikipediaZZ.htm
3
that Wikidata needs in matching and unifying Wikipedia infoboxes of all languages.
For those articles lack of interwiki links, we can connect these links throughout
semantic comparisons based on the aligned structures. Generally, we can enrich the
article content and synchronize the common understanding between languages.
1.3 Objectives
The objectives of this thesis are to create a general model for extracting
semantic relations from different entities based on datasets of multilingual
Wikipedias; to contribute the gathered results to some Wikipedia languages, which are
suitable for smaller-scale Wikipedias; and to enrich some basic data types of articles
for different language editions.
1.4 Scope and Research Objects
Previous researchers focused on how to extract semantic relations from
Infoboxes properties in some languages (Nguyen, Moreira, Nguyen, Nguyen &
Freire, 2011; Tacchini, Schultz & Bizer, 2009; Rinser, Lange & Naumann, 2013;
Adar, Skinner & Weld, 2009). They defined methods to match Infobox properties of
various languages which have interwiki links. They used infobox extraction
algorithms to detect infobox structures, form ontologies (Auer & Lehmann, 2007) and
store outcomes in outside databases, such as DBPedia, YAGO, etc. The comparison of
these ontologies in different languages was a main key to enrich Wikipedia language
editions.
This thesis approaches a different way. Our purpose is to form a model which
can enrich for all languages. We reused a general model to enrich Wikipedia content
from our research paper. This model extracts semantic relations from Infobox and
Wikidata property alignment (Ta & Anutariya, 2014). Wikidata is used as a central
server to align Infobox properties and translate terms between languages. For each
Infobox property, we try to match it with a Wikidata property, which includes labels in
many languages. We will store the alignment results in Wikipedia templates when we
gain the agreement of language communities because we dont own Wikipedia
projects. Next, we identify the correlation of properties in different languages, then
enrich the missing properties for Wikipedias language editions. The more Infobox
properties of languages we can align with Wikidata, the more data we can enrich to
4
the content of these languages. We also enrich other datasets such as external links,
images, geo-coordinates, categories and bottom templates, etc.
We concentrates mainly on Wikipedia of several Latin-based languages, in
particular between Vietnamese Wikipedia and English Wikipedia. In this thesis, the
accuracy of article contents and errors that are made by Wikipedia editors are beyond
the scope of this thesis.
1.5 Thesis Organization
Beside this chapter, this thesis includes four other chapters:
Chapter 2 Literature Review & Foundation Reviews related works
which clarify the development pace of current research and list several related
technologies and knowledge.
Chapter 3 Proposed Model for Multilingual Wikis Describes in detail
about the general model which is used to enrich Wikipedia contents.
Chapter 4 Experiments and Obtained Results Points out how to execute
the model in Chapter 3 for implementation processes.
Chapter 5 Conclusions and Recommendations Concludes and future
thesis works.
Chapter 2
Literature Review
2.1 Wikipedia
Wikipedia founders, Jimmy Wales and Larry Sanger first launched this
website on January 15, 2001 (Anderson, 2011, p. 42). As a free encyclopedia,
Wikipedia allows everyone can access and edit its article content. Now, Wikipedia is
one of the most popular websites and the largest reference work.
English is the only initial language in Wikipedia. Then, Wikipedia opened
other languages and it gradually became a multilingual site. Currently, there are 287
languages, all editions was established in the same technical framework, but with
different content and editing practices. English Wikipedia is the biggest project with
over 4.68 million articles and its depth (collaborative quality) is 887.
Wikipedia
gained 18 billion page views and approximately 500 million unique visitors each
month as of February 2014. As of May 2014, Wikipedia had 22 million accounts, with
over 73,000 active editors globally. There are many sites which extract semantic
relations from Wikipedia data, such as YAGO (Suchanek, Kasneci & Weikum, 2007),
FreeBase, DBPedia and Cycorp.
2.2 Wikipedia Architecture
2.2.1 MediaWiki general architecture. Wikipedia architecture is based on
MediaWiki, which is an open source wiki written in PHP language. MediaWiki has
been developed by Wikimedia Foundation and MediaWiki volunteers. It is used by
Wikimedia Foundation and other websites. The latest version of MediaWiki is
MediaWiki 1.25 alpha. 4 In general, MediaWiki architecture contains 4 layers which
are called User layer, Network layer, Logic layer and Data layer. In network layer,
Squid is a high-performance proxy server which executes caching.
Table 2.1
https://meta.wikimedia.org/wiki/List_of_Wikipedias
6
MediaWiki General Architecture. 5
User layer
web browser
Squid
Apache webserver
MediaWiki's PHP scripts
PHP
File system, MySQL Database (program and content)
Network layer
Logic layer
Data layer
Templates
Namespaces
Pages/Articles
Access levels
namespaces for their wiki to manage content systematically. Namespaces are prefixes
before
an
article
name.
For
example,
link
http://localhost/mediawiki-
https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture
7
this
link:
redirected
http://localhost/mediawiki-1.22.2/index.php/User:Thang,
to
http://localhost/mediawiki-1.22.2/index.php/Thnh
it
will
be
vin:Thang
in
Vietnamese version.
The interwiki link prefixes dont determine namespaces, however they link to the
pages in other MediaWiki language projects.
version, fr: for French version, th: for Thai version and vi: for Vietnamese version. If
we have en:Mathematics means a link which will connect to Mathematics article in
English Wikipedia. MediaWiki uses _ (dash) between two words to identify a blank
space in URLs, so namespace User_talk: is the same as User talk:.
In the table 2.2, we will see all the namespaces and their identity number (ID)
which are defined in Wikipedia. Wikipedia contains two virtual namespaces that are
Special (-1) and Media (-2).
Table 2.2
Wikipedia Namespaces 7
Wikipedia namespaces
Subject namespaces
Talk namespaces
0
2
4
6
8
(Main/Articles)
User
Wikipedia
File
MediaWiki
Talk
User talk
Wikipedia talk
File talk
MediaWiki talk
1
3
5
7
9
Template
Help
Category
Portal
Book
Draft
Education Program
TimedText
Module
-1
-2
Special
Media
Template talk
Help talk
Category talk
Portal talk
Book talk
Draft talk
Education Program talk
TimedText talk
Module talk
Virtual namespaces
https://en.wikipedia.org/wiki/Wikipedia:Namespace
https://en.wikipedia.org/wiki/Template:Namespaces
11
13
15
101
109
119
557
711
829
8
2.2.2.2 Page/Article. Articles or pages are the most important contents of
MediaWiki that include templates inside and are classified into various categories. An
article has history log which shows user contributions by chronological order. It also
has level of restrictions which permit which groups can interact with that article by
group rights. To format the articles, Wikipedia defined Wiki markup (wikitext or
wikicode) which includes syntax and keywords. Wiki markup also supports some
HTML elements. The main components of Graphium stratiotes, a butterfly species is
shown Figure 2.2.
stratiotes'''''
that
belongs
Swallowtail]] family.
==Subspecies==
to
is
the
butterfly
[[Swallowtail
found
in
butterfly|
9
* G. s. stratiotes
* G. s. sukirmani
==References==
*Collins,
N.M.,
Morris,
M.G.,
IUCN,
1985
''Threatened
1985
IUCN
[http://ia600501.us.archive.org/4/items/threatenedswallo85col
l/threatenedswallo85coll.pdf
pdf]
{{Papilionidae-stub}}
[[Category:Graphium (butterfly)]]
[[Category:Animals described in 1887]]
In Figure 2.2, the article name is Graphium Stratiotes in bold and italic text.
The main content is part 4. This article uses two templates: Taxobox ({{Taxobox
}}) and Papilionidae-stub ({{Papilionidae-stub}}) (2 and 3). Two categories
of
this
article
are
Graphimum
(butterly)
([[Category:Graphium
2.2.2.3
Set categories are named after a class (usually in the plural in English
https://en.wikipedia.org/wiki/Wikipedia:Categorization#Administration_category
10
Wikipedia). For example, Category:Cities in Thailand contains articles whose
subjects are cities in Thailand.
In Figure 2.3, this Ancient Roman scientists category includes a child
category (Ancient Roman astronomers) and an article (Lucilius Junior). It has four
parent categories are Roman science, Ancient scientists, Scientists by nationality
and Ancient Romans by occupation.
11
{{Infobox University
|name = Shinawatra University
|image_name = SIU_logo.jpg
|established = 1999
|Founder = Dr. Thaksin Shinawatra
|President
Prof.
Dr.
Voradej
Chandarasorn
|city =[[Bangkok]]
|country = [[Thailand]]
|campus = [[Pathumthani]]
|website=
http://www.siu.ac.th|
Shinawatra University
}}
https://en.wikipedia.org/wiki/Wikipedia:User_access_levels
12
13
articles may appear inaccuracies such as misspellings, ideological biases, and
inappropriate text.
2.3 DBPedia
DBPedia is a project which extracts structured information from Wikipedia
and supplies the data availably for everyone, especially for the research community.
To extract different types of Wikipedia content, DBPedia uses 19 extractors such as
labels, abstracts, interlanguage links, images, redirects, disambiguation, etc (Auer et
al., 2007). DBPedia organized its data into many datasets which mainly incorporates
many RDF triples. Currently, the latest version of DBPedia is 3.9. 10 According to
DBPedia, its English edition can determine 4.0 million things. DBPedia also has data
for 119 languages with 24.9 million defined things.
DBPedia knowledge base and its datasets can support computational
linguistics tasks (Mendes, Jakob & Bizer, 2012; Cabrio, Cojan, Gandon & Hallili,
2013). DBPedia datasets are helpful in researching about the semantic relations of
Wikipedia. DBPedia is also a multilingual data which researchers can develop
question answering over it (Hahn et al., 2010). DBPedia datasets can be as format
standards that we can refer to develop my own datasets. These datasets also help we
know about the relationship of data types in Wikipedia and how to organize the
dataset structures.
However, two essential limitations of DBPedia are the obsolete datasets which
are not updated frequently (Morsey & Lehmann, 2011) and lacks of supporting for
non-English languages effectively (de Melo & Weikum, 2010, October). Furthermore,
DBpedia Live which can extract the data of Wikipedia by current time but just for the
support of the English edition.
10
http://wiki.dbpedia.org/Changelog
14
11
http://www.w3.org/TR/rdf-sparql-query/
15
formats such as XML, JSON, CSV, TVS and RDF (Prudhommeaux & Seaborne,
2013).
This is an example how to use SELECT query in SPARQL:
Data
@prefix foaf:
<http://xmlns.com/foaf/0.1/> .
_:a
foaf:name
" Thang" .
_:a
foaf:email
_:b
foaf:name
_:b
foaf:email
<mailto:alex@yahoo.com> .
_:c
foaf:email
<mailto:carol@love.net> .
<mailto:thang@gmail.com> .
"Alex M" .
SELECT query
PREFIX foaf:
<http://xmlns.com/foaf/0.1/>
Query Result
Name
Thang
<mailto:thang@gmail.com>.
"Alex M"
<mailto:alex@yahoo.com>.
First, PREFIX refers to the source to get the data. In the example above, the
source data is http://xmlns.com/foaf/0.1/ and foaf: is a prefix for
querying easier. The query gets all people who have both name and email.
2.5 Interwiki Links at Wikipedia and Wikidata
An article can have many different language editions. To link these editions
together, Wikipedia allows editors use language prefixes such as [[en: (English
Wikipedia), [[fr: (French Wikipedia), as mentioned in Section 2.2.2.1 of this
chapter. However, this way is obsolete because of its complexity. If an article
appears in 4 language editions, to link these editions together, the article content of
16
each language will need 3 language prefixes. In total, we need 12 language prefixes
(language links). In addition, the editors of each language edition must always
maintain these links correctly. To simplify and counteract vandalism, Wikipedia
deployed Wikidata project in 2012 which stores multilingual structured data in its
server (Erxleben, Gnther, Krtzsch, Mendez & Vrandei, 2014).
13
queries for supplying common views of data and scaling down the maintenance tasks
of Wikipedia immensely (Nguyen, 2013, p. 60).
When a user creates a new article (or template, category) which has interwiki
links, this user must manually add this new one to the common database at Wikidata.
In Figure 2.10, Risk Management is an article in English edition. When a user
translates it into Vietnamese with name Qun l ri ro d n, he/she must add it to
Wikidata by specifying language edition and article title in the form in Figure 2.10.
12
https://www.wikidata.org/w/index.php?title=Wikidata:Introduction& oldid=42879390
13
https://meta.wikimedia.org/wiki/Wikidata/Technical_proposal
17
Code
Dewiki
Article name
Mathematik
English
Enwiki
Mathematics
Franais
Frwiki
Mathmatiques
Ting Vit
Thwiki
Ton hc
Viwiki
Besides the interwiki links, each entity includes its abstract, statements,
alternative names and properties (Erxleben, Gnther, Krtzsch, Mendez & Vrandei,
2014). Table 2.4 shows these things of mathematics article.
18
Table 2.4
The Structure of Mathematics Article (Q395).
mathematics (Q395)
Abstract study of numbers, quantity, structure, relationships, etc.
Alternative names: math, maths
In other languages
Vietnamese
franais
ton hc
Mathmatiques, science des nombres
Statements
part of
commons category
instance of
formal science
Mathematics
branch of science
Wiskunde
Mathematik
19
https://www.mediawiki.org/wiki/Content_translation
15
https://en.wikipedia.org/wiki/Wikipedia:Category_names
20
researches, some NLP patterns can be extracted from category taxonomy that are
member_of, directed_by, located_in, attribute_of and R (Nastase & Strube, 2008).
The outputs are estimated against ResearchCyc and a subset of human judges as well
as showing many copious patterns which were induced from category taxonomy.
These patterns can be applied not only English but also other languages as well
without much changes of algorithms and methods. For example, the pattern X Y refers
to a category such as Information Technology in English which X = Information and Y
= Technology. In Vietnamese, the pattern can be denoted as Y X with Y = Technology
= Cng ngh and X = Information = Thng tin. So the Vietnamese category name is
Cng ngh Thng tin. In Indonesian language (Bahasa), this pattern is also like
Vietnamese Y X (Teknologi informasi, X = Informasi, Y = Teknologi). The creation of
new categories at small and medium scale languages can depend on these patterns to
proceed automatically this task.
In Table 2.5, some basic NLP patterns points out the feasibility of translating
category names automatically from English to Vietnamese by a simple tool.
Table 2.5
Some NLP Patterns which can Describe Category Names.
English
[X] in [Y]
Cities in France
X = Cities
Y = France
XY
Information Technology
X = Information
Y = Technology
X by Y
Birds by country
X = Birds
Y = country
Vietnamese
[X] [Y]
Thnh ph Php
X = Thnh ph,
Nhng thnh ph (plural) *
Y = Php
YX
Cng ngh thng tin
X = thng tin
Y = Cng ngh
X theo Y
Chim theo quc gia
X = Chim
Nhng con chim (plural) *
Y = quc gia
* In Vietnamese Wikipedia, the editors prefer not to use plural for category names
in general [23].
21
In 2006, Chernov and co-authors mentioned about the semantic relations
between Wikipedia categories. If the large number of pages from Category A contains
links to Category B, we can conclude that Category A has a semantic relation with
Category B. The research experiment compared the relationship between Category
Country and other categories. In 2010, NER (Named-entity recognition) task was
used to measure the relevant score between a certain category and Software Category
(Xu, Takeda, Hamasaki & Wu, 2010). The estimated method is to divide this
relationship into three things: S-Category, Parent Edge and Ancestor Path and then
use an algorithm to calculate the final score. The outcome can be correct 80%.
However, the research scope is limited to the English software categories. In short, the
correlation between two Wikipedia categories cannot be compared easily and still
have many promising uncovers.
2.7 Wikipedia Infoboxes
Infobox, a type of template, contains the semi-structured content which can be
retrieved as semantic relations. DBPedia Infobox Datasets include two types which
are Raw Infobox Dataset and Mapping-based Dataset (Li & Sima, 2015). Raw
Infobox Dataset cannot deal with using the same attribute (property) name. The latter
uses new extraction method to overcome this drawback, but support insufficiently all
infoboxes and their properties. At the moment, DBPedia still have inadequate support
for non-English languages. Thus, for non-English articles without interwiki links,
DBPedia is impossible to produce the infobox datasets.
The heterogeneity is the cause to prevent the fusion of infobox datasets at
Wikipedia languages. Infobox templates regularly contain three main components:
parameters (or properties), parameter labels and their data values. When the editors
import the infoboxes from English Wikipedia to other languages, they can translate
everything or just keep the English parameters, translate their labels and data values.
For some editors who dislike English parameters, they may create those parameters in
their known languages. In many cases, there could have two or more than two
infoboxes of non-English Wikipedia interlink to an English infobox. Consequently,
the heterogeneity of infobox data is continuously getting bigger. Alessio pointed out
his solution which automatically mapped Wikipedia infobox attributes to DBPedia
properties in 14 different languages (Aprosio, Giuliano & Lavelli, 2013). Many
researchers stored their outputs in the external database that everyone cannot reuse
22
and follow-up easily. Generally, infobox alignment for all languages is still a
challenge.
The second phase (Phase 2) of Wikidata plan is to collect the infobox structure
from different languages and store at Wikidata. This phase reduces the complexity and
plentiful infobox data in all languages. The first version of Wikidata infoboxes was
deployed but now, it is still not available in practice. To support the alignment process
at this phase, in Figure 2.12, we suggest using bilingual parameter couples (property
couples) in English and a non-English. English Wikipedia is the biggest Wikipedia
with million articles and infoboxes. Because of this, the bilingual parameter couples
need to have at least one parameter in English to utilize its copious data. Then, we
will update the infobox structure for using two parameters by semi-automated
mechanism. In this step, we also translate parameter labels to non-English language if
applicable. We call this step as the unification of Wikipedia infobox structure.
Next, we can convert the data values of infobox parameters which have
interwiki links and lastly, compare and update to the articles which use these
infoboxes in some languages. The update process may contain two choices and the
editors must choose one of them:
If the data cannot convert to non-English, editors still update data to the
articles.
has value
Parameters
A (English - en)
B (Vietnamese - vi)
has value
Label (convert data to
vi use Wikidata)
23
Step 1: Prepare the infoboxes with bilingual parameter couples and translate
parameter labels.
English Wikipedia
{{Infobox company
| label1
= [[Country]] | data1
= {{{country|}}}
| label30
= [[Website]] | data30
= {{{homepage|}}}}
}}
Vietnamese Wikipedia
{{Infobox company
| label1
= [[Quc gia]]
| data1
= {{{country|{{{|quc gia}}}}}}
| label30
= [[Website]]
| data30
= {{{homepage|{{{trang ch|}}}}}}
}}
We translate [[Country]] label to [[Quc gia]] label from interwiki links. The
[[Website]] label in Vietnamese is the same in English. We can add more code to
country and homepage parameters in order to use both bilingual parameters. This code
fragment {{{country|{{{|quc gia}}}}}} means we can use parameter country or
quc gia which refers a nation.
Step 2: Using the infobox
An editor can use his code such as:
{{Infobox company
| country = USA
| homepage = http://www.usa.us
}}
or
{{Infobox company
| quc gia = USA
| trang ch = http://www.usa.us
}}
24
or a mixed parameters
{{Infobox company
| country = USA
| trang ch = http://www.usa.us }}
If the data values of these parameters are in [[ ]] mean they are internal
links which lead to the articles and may have interwiki links. We can use a tool
[Appendix A] to translate these data values. For example, [[England]] in brackets
[[ ]] can be converted to Vietnamese language as [[Anh]] or Thai language as
[[]].
In some Wikipedias, community editors prefer their own language parameters
to English parameters. We can use Wikidata properties to align with existing
parameters. For example, country property can map to Property:P856 on
Wikidata. For data values of these properties, we can use parsers which are similar to
DBpedia parsers
16
1718
synchronization platform which converted infobox data from English to Korean. This
research emphasized how to synchronize the infoboxes between Korean Wikipedia
and English Wikipedia. The result was successful in updating Korean infoboxes.
However, we have to build new translation mechanisms when applying to other
languages.
2.8 Multilingual Approaches
The technical core of DBPedia (Lehmann et al., 2012) is used to form a new
extraction framework (Morsey, Lehmann, Auer, Stadler & Hellmann, 2012). From
that, we can inherit all the valuable mechanisms of DBPedia and add more custom
methods to enhance the productivity of extracting semantic relations from Wikipedia.
The difficulty is we should have a very deep understanding about DBPedia
architecture as well as robust servers to run the extraction rapidly and smoothly. We
prefer to find a simpler solution for extracting semantic relations.
16
http://wiki.dbpedia.org/Internationalization/Guide#h152-4
17
18
https://www.wikidata.org/wiki/Property_talk:P856
25
In 2011, Nguyen and co-authors deployed WikiMatch as a new approach for
aligning infoboxes in different languages with its case study aligns infoboxes in
Vietnamese, Portuguese and English. WikiMatch can be good for high cross language
heterogeneity WikiMatch can be good for high cross-language heterogeneity and few
data instances. We have to investigate the use of a xed point based matching strategy
to improve the effectiveness.
Using interwiki link system, Francis and Jacques could develop a bilingual
dictionary which is a simple, computationally inexpensive means to retrieve word lists
(Tyers & Pienaar, 2008). Future work must improve the precision more with human
evaluators. In 2010, de Melo and Weikum introduced MENTA, the multilingual
lexical knowledge base which was from the integration of multilingual information.
Heuristic linking functions are responsible for connecting Wikipedia articles,
categories, infoboxes, and WordNet synsets from multiple languages. This research
extracted semantic relations directly from Wikipedia in different languages. The same
authors of MENTA developed techniques to detect imprecise or wrong interlinks (de
Melo & Weikum, 2010). Eytan and co-authors introduced Ziggurat for enriching
Wikipedia infoboxes by applying self-supervised learning. This automated system can
align and create Wikipedia infoboxes; enrich the missing information; and detect
differences between parallel articles. Their experiments indicated the method
effectiveness, even in the absence of dictionaries. These reseaches deployed the
interwiki link detection on the obsoleted mechanism. Wikipedia removed this
mechanism and replaced by Wikidata (Vrandei & Krtzsch, 2014). This thesis uses
Wikidata as a central server to align Infobox properties and translate terms among
languages. Therefore, our approach is different from previous ones.
2.9 Summary
This chapter pointed out the significant tendencies in studying about enriching
Wikipedia content between languages. There are also many areas of Wikipedia in
which we can enhance its performance by applying a framework for multilingual
wikis. Besides, with the rapid development of the new Wikipedia project, Wikidata
opens many opportunities for new researches and applications in the future
(Vrandei, 2013).
26
Chapter 3
Proposed Model for Multilingual Wikis
3.1 Introduction
This chapter proposes a general model for Multilingual Wikis. This model
may be applied to different Wikipedia languages. We set English Wikipedia as an
origin language in extracting and matching semantic relations. The semantic
exploitation focuses on data which include the most valuable information such as
infoboxes and navigation templates (optional). There are also other types of data that
can be retrieved such as disambiguations, images, geography coordinates, etc.
3.2 General Model
no
Finish
Start
Wikidata
yes
Store aligned structure
Process
Enrichment
Comparison Process
Alignment Process
yes
B
assess semantic
relations
Had interlinks
Synthesize semantic
Connect missing
interwiki links
no
Finish
relations
Finish
A: Can retrieve infobox properties and make alignment for infobox properties and
Wikidata properties (items)?
B: Are missing interwiki links found?
Figure 3.1 General model for multilingual wikis. Adapted from A Model for Enriching
3.2.1 Alignment
process.
ThisCopyright
process 2015
will bymake
property alignment
Multilingual
by T. H. Ta, 2014,
p. 338.
Springer.
between infobox and Wikidata. If gathering inadequate results, DBPedia may be
27
replaced for Wikidata. DBPedia may change the alignment more differences
compared with Wikidata. Therefore, to comply with Phase 2 of Wikidata, we should
use Wikidata. Then, the aligned structure is stored hidden inside infoboxes to avoid
affecting their usage. Editors can modify this structure appropriately so it will support
for next researches publicly. Moreover, we can make alignment between navigation
templates and Wikidata to extend the aligned structure. The significant advantage of
this process is to create aligned structures that can support for retrieving semantic
relations easily with an uncomplicated mechanism among Wikipedias and also
support for the Phase 2 development of Wikidata. The outcome will directly affect to
Comparison process because it cannot operate if no aligned structures are established.
3.2.2 Comparison process. Comparisons of semantic relations and assess
their correlation will be executed to detect missing interwiki links. Depend on some
assessments or matching algorithms, we can conclude the interwiki links between
articles of different languages and then update sitelinks at Wikidata. For articles had
interwiki links, we don't have to do anything above. Next, we synthesize all semantic
relations and other optional relations (categories, images, geographic coordinates,
etc.) to prepare for next process.
3.2.3. Enrichment process. This last process will enrich article content and
Wikidata statements after implementing the comparisons of gathered semantic
relations from Comparison process. We also can crosscheck data (semantic relations
which are mainly from infobox properties) among Wikipedia's for enriching so that
the anti-vandalism may be detected and prevented. We expect to enrich more data for
articles which have new interwiki links in Comparison process.
3.3 Align Infobox Parameters with Wikidata Properties
A semi-automated tool will be created to support searching and aligning the
semantic equivalence between Wikidata properties (or items which has no relevant
property) and infobox properties. First of all, we choose infoboxes of non-English
Wikipedias which have interwiki links with English Wikipedia. The reason why we
do this because these infoboxes will tend to have more similar properties, even in
different languages. Then, we get all the properties from these infoboxes. Next, with
each property we search it on Wikidata to find the corresponding property or item. If
we can not find anything on Wikidata, we will pass this property and mark it as a
specific label unknown.
28
We check the alignment between Template:Infobox school in English and
Wikidata which is shown in Table 3.1.
Table 3.1
The Alignment between Template:Infobox School and Wikidata.
Properties of
Template:Infobox
school
Image
Name
Unknown
Location
Unknown
Country
Coordinates
29
We can put more information in alignment process such as redirects and
related templates which help in detecting missing interwiki links more effectiveness in
Table 3.2.
Table 3.2
The Alignment between Infoboxes (Bn Mu: Trng Hc in Vietnamese and
Template: Infobox School in English) and Wikidata.
Template
name
Properties
Redirects
Related
templates
Vietnamese
English
Wikidata
Trng hc
Infobox school
Q5618975
Hnh
Tn
Nc
Coor
Hiu trng trng
Bn mu:i hc
Bn mu:Infobox
University
Bn mu:Infobox
university
Image
Name
Location
Country
Coordinates
Principal
Template:School
Template:Infobox
HighSchool
Template:Infobox
OtherEducation
Template:Infobox
Private School
NA
Property:P18
Property:P17
Property:P625
Q1056391
NA
* A tool will help to search similar properties on Wikidata and human decisions are made to
assign which best property on Wikidata for every property of infobox.
30
with dictionary, WordNet, translation, NLP algorithms and assessments of linguistic
experts.
If there is no alignment between Wikidata properties and infobox properties,
we can use DBPedia as an optional source to make the alignment. Notwithstanding,
this is not our recommendation because DBPedia will change aligned structure that is
not matched with Wikidata metadata and phase 2 of Wikidata plan. 19
Aligned results can be stored as XML format in infoboxes between include
tax, for example <noinclude>alignedresults</noinclude>. This will
not affect the infoboxes, which are embedded in Wikipedia articles. Like mentioned
above in Alignment process, these XML fragments can be reused for next research
and help the infobox alignment of Wikidata at Phase 2.
Here is the aligned structure of Template:Infobox school in XML format:
Template:Infobox school
...<noinclude><!
<infoboxlang="en"name="Template:Infoboxschool"synonyms=""
redirects="Template:School,Template:InfoboxHighSchool,..."
wikidata="Q5618975"relationship="">
<properties>
<propertyname="image"synonyms="portrait,
illustration,picture"wikidata="Property:P18"description="a
relevantillustration"datatype="Commonsmediafile">
</property>
...
</properties>
</infobox>
></noinclude>
19
https://www.wikidata.org/w/index.php?title=Wikidata:Introduction&oldid=42871496
31
3.4 Detect, Connect Missing Interwiki Links and Synthesize Semantic Relations
We define two types of semantic relations:
are retrieved from redirects, categories, external links, internal links, images, videos,
audios of articles. These semantic relations can be represented in RDF triples which
are not always found in DBPedia because of its insufficient support and low update
frequency. When there are no semantic relations on DBPedia, we will create these by
ourselves. The simple solution is to use a bot to get semantic relations from article
content throughout APIs of Wikipedia.
retrieve semantic relations from infoboxes or any templates that have structured
metadata. RDF triples will be set up from these semantic relations. When infoboxes
regularly summarize the information of articles, these semantic relations are helpful
for detecting interwiki links.
The sample articles will be classified into two groups: non Latin-based
alphabet Wikipedias and Latin-based alphabet Wikipedias. We prefer to focus on the
latter. As stated in the introduction section, English Wikipedia has a high collaborative
quality. It may be a valuable source for identifying interwiki links with other
Wikipedias. Likewise, any Wikipedias have high collaborative quality such as
German Wikipedia, French Wikipedia and Spanish Wikipedia will be considered as
sources to find interlinks. In this thesis, we want to compare articles of all Wikipedias
with English articles to search for interwiki links.
Supposed that to detect interwiki link for an unlinked article in Vietnamese,
firstly we should have a look at the article. We must understand the article content and
search the relevant articles in English by some defined keywords. If we find a needed
article, we will connect it to Vietnamese article. This task requires the understanding
of English, Vietnamese and knowledge about that article. However, we try to make
this task simpler that machine can comprehend when we exclude human, translation
and NLP approaches to find the similarities among articles in various languages.
Instead, we use article name and its redirects. There is a huge tendency to use the
same or nearly same article name in the Latin-based alphabet Wikipedias. This case
may only correct for article about cities, people, biological species, proper nouns,
acronyms, etc. Additionally, there are a lot of articles being translated from English to
32
non-English languages. This reduces the users efforts in building and developing
articles from the beginning. Therefore, it is easier to identify a certain article name of
Latin-based alphabet Wikipedia, which has or does not have in the English. In
contrast, it is totally difficult for recognizing an article of non Latin-based alphabet
Wikipedia, which has its version in English or not because of the different alphabets.
3.4.1 Compared list. The most difficult thing is to search an article A in
language A has interwiki link with which article in language B. To do that, we have to
create a comparison list (candidate articles) of language B to which the article A will
compare. Supposed that language B has 4 million articles, it is not feasible to a
execute linear algorithm to match A with 4 million articles of B.
From the difficulty above, we must reduce the size of comparison list.
Normally, when we search for an object, we always use its name as the first criterion
in searching. In this case, article name and its redirects can be used to define the
comparison list.
In Table 3.3, we have an article about Barack Obama, current American
president. This articles name in Vietnamese and French is Barack Obama which is
same in English. However, in Thai and Chinese, the article name is quite different, it
appears in native language and may be confused for some editors who are illiterate
these non-Latin languages.
Table 3.3
Article Titles about Barack Obama in some Languages.
Latin Wikipedias
French
Vietnamese
Barack Obama
Barack Obama
Non-Latin Wikipedias
Thai
Chinese
33
Table 3.4
Article Titles about Dog in Vietnames and English.
Vietnamese
Page name: Ch
Redirects:
Con
English
Page: Dog
ch,
Canis
In Table 3.4, with the article name Ch in Vietnamese, we can never find
any article which has the same name in English because of the language differences.
However, if we search by redirects we realize that Ch article in Vietnamese may
have a relationship with Dog article in English because the two contain redirect
Canis lupus familiaris which is a dogs scientific name. Creating a comparison list
from searching by name and its redirects can be used for Latin-based alphabet
Wikipedias which have many resemblances of usage article name and redirects to get
more benefits. This method typically provides one article in the comparison list. It can
reduce the compared times, but may affect the outcome when there are no matching
results are found or the comparison list is empty. Thus, in our future researches, we
will apply many methods which can detect and compare the similarities of using
images, videos, categories, internal links and semantic structure of certain articles.
3.4.2 Semantic relations based on article structure. Besides article name,
an article must be built a semantic structure which machines can understand when
they automatically execute matching processes. The simplest structure is to organize
an article by its relationships of categories, images, terms, templates and others. For
example, in Vietnamese Wikipedia, Alcina article does not have interwiki link. By
reading its source code (Wiki markup), we form its structure.
Here is Alcina articles source code at Vietnamese Wikipedia:
[[Tp tin: George Frideric Handel by Balthasar Denner.jpg|
250px|nh|George Frideric Handel]]
'''Alcina''' l v [[opera]] 3 mn ca [[nh son nhc]]
[[ngi
Anh
gc
c]]
[[George
Frideric
Handel]].
34
phm c trnh ln u tin ti [[London]], [[Anh]] vo
nm [[1735]]<ref>T in tc gi, tc phm m nhc ph
thng, V T Ln, xut bn nm 2007</ref>.
==Ch thch==
{{tham kho}}
[[Th loi:Opera]]
In Table 3.5, we use sitelinks of Wikidata to translate terms from Vietnamese
into English.
Table 3.5
The Semantic Structure of Alcina Article at Vietnamese Wikipedia in English.
Alcina (Vietnamese Wikipedia)
Term: link-to
Category:
has-category
Template:
{{Reflist}}
has-template
Image: has-image
George Frideric Handel by Balthasar Denner.jpg
* Note: terms in bold do not have their own articles or interlinks
35
Then, from Table 3.5, we also form a graph in Figure 3.2. This graph will remove all
terms which could not be translated into English.
1735
London
human
opera
England
composer
Opera
Alcina
George Frideric Handel
Reflist
Ludovico Ariosto
Riccardo Broschi
Georg... .jpg
Figure 3.2 The graph of the semantic structure of Alcina article at Vietnamese
Wikipedia in English.
links-to
has-category
has-image
has-template
36
In Figure 3.2, these semantic relations can be seen as weak relations because
Wikipedia article content depends on user contributions. So, different articles in
different language may form different semantic structures. That is a crucial reason
why we cannot use these structures for detecting interlinks. Our first idea is to
compare these structures and make conclusions that interwiki links may exist among
articles. However, we can use this structure to support the assessment of detecting
missing interwiki links of articles in the next section.
3.4.3 Semantic relations from infobox and navigation templates. In this
section, we will primarily retrieve semantic relations from infoboxes and navigation
templates which were embed in articles. Other templates may be used if they serve
some good semantic relations. All articles will be scanned in order to choose the ones
that contained infoboxes with their aligned structure in Alignment process. For
articles had interwiki links with English Wikipedia, we just retrieve the semantic
relations from the infobox properties. For others, we detect the missing interwiki
links, connect these links and then also synthesize the semantic relations. For
example, Ch article in Vietnamese does not have interwiki link with English
Wikipedia. After searching by its name, we can find the candidate Dog article in
English (Table 3.4). Then, a bot will read the content of two articles and collect
semantic relations from Template:Taxobox which is aligned similar at Section 3.1.
Table 3.6
Semantic Relations of Ch (vi) & Dog (en) Articles.
Vietnamese
English
Language: vi
Language: en
Page_name: Ch
Page_name: Dog
Redirects:
Con
ch,
Canis
familiaris, Cn, Ch ch, Ch nh, Cu, lupus familiaris, Canis Canis, Domestic
Ch
Name:
Type: species
Type: species
Regnum: ng vt
Regnum: Animalia
37
Ordo: B n tht
Ordo: Carnivora
Familia: H Ch
Familia: Canidae
Genus: Chi Ch
Genus: Canis
Species: Si xm
Binomial:
Binomial:
Binomial_authority:
Binomial_authority:
Synonyms:
Synonyms:
Table 3.7
The Comparison Result between Dog Articles in Vietnamese and English.
PAGEvi:Chen:Dog
Type:species
RESULTScore:5/5Percentage:100%
DETAIL
Species(OK) vi:Sixm<>en:GrayWolf
Genus(OK) vi:ChiCh<>en:Canis
Ordo(OK) vi:B?ntht<>en:Carnivora
Familia(OK) vi:HCh<>en:Canidae
Regnum(OK) vi:ngvt<>en:Animalia
38
If all these semantic relations are matched, we can hypothesize that these two
articles may have an interwiki link. After that, a bot will automatically connect them
together by adding sitelinks on Wikidata or sets an alert template which notices
editors and let them make final decisions. This section does not mention about using a
fixed assessment for all articles which are identified by short abstract (may refer to a
type). Different articles can have different types and therefore they will have different
assessments based on gathered semantic relations and how we apply the proper
assessments. For example,
39
The Alignment Table.
Language A
Infobox AI
AI_P1
AI_P2
AI_P3
AI_P4
* Not available
Language B
Infobox BI
BI_P1
BI_P2
BI_P3
AI_P4
Wikidata
Q123
P51 (string)
NA*
P7 (time)
P34 (quantity)
English properties
population
Wikidata
P1082 (quantity)
7.067.000
quc gia
7,067,000
Country
Population of Hanoi
P17 (item)
Thi Lan
m s GND
Thailand
GND identifier
Articles
P227 (string)
4029924-7
ngy sinh
4029924-7
date of birth
GND of Qatar
P569 (time)
29 November, 2000
29/11/2000
11/29/2000
Table 3.9 shows that data values can be represented differently in English and
Vietnamese. For quantity data type, English uses comma to separate sequences of
20
https://www.wikidata.org/wiki/Wikidata:Glossary#Qualifier
40
three digits but Vietnamese uses full stop. Items data value (of an entity or a class)
can be seen as existing articles at Wikipedia projects, such as Thailand article in
English has interwiki link with Thi Lan article in Vietnamese. The plurality of data
values are also very complex, such as datetime case. We prioritize enriching Infobox
properties whose data values are string, item and quantity. For other data types, the
enrichment much depends on the translation or conversion. We try to perform this
step as much as possible.
3.5.2 Wikidata statements enrichment. Temporarily, we dont focus on
enriching Wikidata statements. However, we recommend that this process can be
made when we enrich Wikipedia. For example, in Table 3.6, supposed that we have
Binomial_authority property has a value is Carl Linnus in English, we can update
the Binomial_authority property value in Vietnamese if it does not exist. Then, if
Wikidata item Q144 lacks statement taxon author, we can insert it with value Carl
Linnus.
3.5.3 Category enrichment. Categories can be chosen as a enrichment source
for Wikipedia content. We use Wikidata to translate categories from English to other
languages. Then, we add these categories to articles of language editions. The
enrichment must be comply to some policies of Wikipedia such as Categorization 21,
Naming Conventions 22. We also create Category tool to do this step (Appendix B).
For example, Asparagus persicus is a flowering plant. We have two versions
of this species in English and Vietnamese. In English version, this species has 9
categories. In Vietnamese version, this species has 2 categories. All these categories
are shown in Table 3.10.
Table 3.10
The Categories of Asparagus Persicus in English and Vietnamese.
Asparagus persicus (Wikidata: Q4807699)
English category list
Vietnamese category list
Category:Asparagus
Th loi:Chi Mng ty
Category:Flora of Turkey
Th loi:Thc vt c m t nm 1875
Category:Flora of Iran
Category:Flora of Afghanistan
Category:Flora of Uzbekistan
Category:Flora of Tajikistan
21
https://en.wikipedia.org/wiki/Wikipedia:Category_names
22
https://en.wikipedia.org/wiki/Wikipedia:Categorization
41
Category:Flora of Kazakhstan
Category:Flora of China
Category:Flora of Russia
From Table 3.10, we will translate categories from English to Vietnamese. For
each English category (EC), we check it at Wikidata and get a corresponding
Vietnamese category (VC). Next, if VC does not exist in Vietnamese category list and
dont have any parent-child relationship with categories of this list, we add VC to the
Enrichment list.
Table 3.11
The Translation List and Enrichment List of Asparagus Persicus.
Asparagus persicus (Wikidata: Q4807699)
English category list
Translation list
Vietnamese category list
Category:Asparagus
Category:Flora of Turkey
Category:Flora of Iran
Category:Flora of Afghanistan
Category:Flora of Uzbekistan
Category:Flora of Tajikistan
Category:Flora of Kazakhstan
Category:Flora of China
Category:Flora of Russia
Th loi:Chi Mng ty
Th loi:Thc vt Th Nh K
Th loi:Thc vt Iran
Th loi:Thc vt Afghanistan
NA
NA
Th loi:Thc vt Kazakhstan
Th loi:Thc vt Trung Quc
Th loi:Thc vt Nga
Th loi:Chi Mng ty
Th loi:Thc vt c m t
nm 1875
Enrichment list
Th loi:Thc vt Th Nh K
Th loi:Thc vt Iran
Th loi:Thc vt Afghanistan
Th loi:Thc vt Kazakhstan
Th loi:Thc vt Trung Quc
Th loi:Thc vt Nga
42
consensus. External links offer readers more reference sources in the case they can not
find enough information on articles. For bottom templates, we can use Wikidata to
translate template names from English to other languages and add them to article
content. Similarly, the gallery section provides the content visual so we can add this
part to article content.
43
Chapter 4
Experiments and Obtained Results
4.1 Preparation Steps
In Chapter 3, the proposed model can work with the articles that already had
interwiki links. However, we prefer to use articles which lack of interwiki links
because we can demonstrate the process of detecting and connecting interwiki links.
As mentioned in Chapter 1, we focus on the alignment at Vietnamese Wikipedia and
English Wikipedia. We choose the infoboxes which are used in the most articles. In
Vietnamese Wikipedia, this list is shown in Figure 4.1.
23
https://vi.wikipedia.org/wiki/Special:MostTranscludedPages
44
4.2 Biological Domain
Vietnamese Wikipedia crossed over 1 million articles with many thousand
biological articles were mainly created by bots. These stub articles miss interlinks
because bots generate them automatically from external databases. Furthermore,
many local editors did not pay attention to enrich these unattractive articles.
Therefore, we need to find a solution to solve the problem. One of the feasible
solutions is we can enrich these articles from other Wikipedias, for example English
Wikipedia. To do so, firstly, we need to connect these articles to English articles. In
Figure 4.1, Taxobox (in bold) is embedded in 791888 pages in Vietnamese Wikipedia.
Therefore, applying this infobox to our model may be a valuable point. We decided to
choose biological articles as our input, which contain Template:Taxobox and have no
interwiki links24 to English Wikipedia. In Figure 4.2, our tool allows user can press
No Interlinks Button to get articles without interwiki links, then press Detect Button to
choose articles which embed Taxobox as well as to search by article name in English
Wikipedia for making compared list as Section 3.4.1.
wiki_wiki&limit=500&offset=0
25
http://iczn.org/iczn/index.jsp
26
http://www.iapt-taxon.org/nomen/main.php
45
Template:Taxobox has relationships with Template:Automatic taxobox
and
Redirects
Related
templates
Vietnamese
Bng phn loi
English
Taxobox
Wikidata
Q52496
status_system
image, hnh
range_map
binomial
species, loi
genus, chi
familia, h
ordo, b
class, lp
regnum
domain
Bn mu:Phn loi
khoa hc, Bn
mu:Taxobox
NA
status_system
image
range_map
binomial
species
genus
familia
ordo
class
regnum
domain
Wikipedia:TX,
Wikipedia:TAXO
BOX,
Template:Automa
tic taxobox
Template:Species
box
Property:P141
Property:P18
Property:P181
Property:P225
Q7432
Q34740
Q35409
Q10861678
Q37517
Q36732
Q146481
46
Here is XML type of Template:Taxobox after making aligntment.
<infobox lang="en" name="Template:Taxobox" synonyms=""
redirects="Wikipedia:TX, Wikipedia:TAXOBOX, Template:Infobox virus,
Template:Infobox Taxobox" wikidata="Q52496" parent="" children=""
relationship="Template:Automatic taxobox, Template:Speciesbox">
<parameters>
<parameter name="status_system" alternativename="" synonyms="IUCN
conservation status" wikidata="Property:P141" description="conservation status
assigned by the International Union for Conservation of Nature"
datatype="Item"></parameter>
<parameter name="image" alternativename="" synonyms="portrait,
illustration, picture" wikidata="Property:P18" description="a relevant illustration; more
specific properties should be used when more description is required"
datatype="Commons media file"></parameter>
<parameter name="range_map" alternativename="" synonyms="range map
image" wikidata="Property:P181" description="range map of a taxon"
datatype="Commons media file"></parameter>
<parameter name="binomial" alternativename="" synonyms="taxon name"
wikidata="Property:P225" description="the scientific name of a taxon (in biology)"
datatype="String"></parameter>
<parameter name="binomial_authority" alternativename="" synonyms="taxon
author" wikidata="Property:P405" description="the author(s) that (optionally) may be
cited with the scientific name" datatype="Item"></parameter>
<parameter name="domain" alternativename="" synonyms=""
wikidata="Q146481" description="taxonomic rank" datatype="String"></parameter>
<parameter name="regnum" alternativename="" synonyms=""
wikidata="Q36732" description="taxonomic rank" datatype="String"></parameter>
<parameter name="phylum" alternativename="" synonyms=""
wikidata="Q38348" description="taxonomic rank" datatype="String"></parameter>
<parameter name="class" alternativename="" synonyms=""
wikidata="Q37517" description="taxonomic rank" datatype="String"></parameter>
<parameter name="ordo" alternativename="" synonyms=""
wikidata="Q10861678" description="taxonomic rank"
datatype="String"></parameter>
<parameter name="familia" alternativename="" synonyms="family"
wikidata="Q35409" description="taxonomic rank" datatype="String"></parameter>
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>
47
<parameter name="image" alternativename="" synonyms="portrait,
illustration, picture" wikidata="Property:P18" description="a relevant illustration; more
specific properties should be used when more description is required"
datatype="Commons media file"></parameter>
<parameter name="range_map" alternativename="" synonyms="range map
image" wikidata="Property:P181" description="range map of a taxon"
datatype="Commons media file"></parameter>
<parameter name="binomial" alternativename="" synonyms="taxon name"
wikidata="Property:P225" description="the scientific name of a taxon (in biology)"
datatype="String"></parameter>
<parameter name="binomial_authority" alternativename="" synonyms="taxon
author" wikidata="Property:P405" description="the author(s) that (optionally) may be
cited with the scientific name" datatype="Item"></parameter>
<parameter name="taxon" alternativename="" synonyms="latin name,
scientific name" wikidata="Property:P225" description="the scientific name of a taxon
(in biology)" datatype="String"></parameter>
<parameter name="authority" alternativename="" synonyms="taxon author"
wikidata="Property:P405" description="the author(s) that (optionally) may be cited
with the scientific name" datatype="Item"></parameter>
<!-- Q items -->
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>
Template:Species box
<infobox lang="en" name="Template:Speciesbox" synonyms="" redirects=""
wikidata="Q14449650" parent="" children="" relationship="Template:Taxobox">
<parameters>
<!-- Properties -->
<parameter name="taxon" alternativename="" synonyms="latin name,
scientific name" wikidata="Property:P225" description="the scientific name of a taxon
(in biology)" datatype="String"></parameter>
<parameter name="authority" alternativename="" synonyms="taxon author"
wikidata="Property:P405" description="the author(s) that (optionally) may be cited
with the scientific name" datatype="Item"></parameter>
<!-- Q items -->
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>
48
4.3 Results of Aligning Biological Species
In Table 4.2, we executed the comparisons: 4 times with 100 random couples,
4 times with 200 random couples and 1 time with 1000 random couples. We received
the result of higher-and-equal-80%-matching which is not much different from the
manual method. The matching percent can be higher a bit because we removed the
articles which are related to Monospecificity. We realized that a large number of
couples need to be merged which could be from the mistakes of bots and editors. This
helped to reduce the repetitive of articles. The new interlinks we found in this case
study around 30%-40%, which showed that there are still many articles that lack of
interlinks in biology articles. To connect the interwiki links of articles, we will set a
suggested template into these articles and may let the judgments for the editor
community.
Table 4.2
Results of Comparing Article Couples in Vietnamese and English.
No.
No.
Random
Couples
100
100
100
100
Manual
Matching
>=80%
Matching
=100%
Matching
Merge
needed
New
interlinks
80
84
77
78
79.75%
77
83
76
76
78%
64
67
67
58
64%
37
40
31
32
35%
40
43
45
44
43%
5
6
7
8
Mean
200
200
200
200
165
164
155
160
80.5%
163
156
149
158
78.25%
120
118
119
130
60.88%
98
89
81
85
44.13%
65
67
68
73
34.13%
1000
819
(81.9%)
788
(78.8%)
575
(57.5%)
463
(46.3%)
325
(32.5%)
1
2
3
4
Mean
However, bot can automatically connect interwiki links for the articles which
have higher-and-equal-80%-matching. The next step is to retrieve as much as possible
semantic relations which can help to enrich the article content. In this case study, the
machine can easily detect the missing interwiki links among articles because of the
similarities of using infobox format and article names as well as redirects of Latinbased alphabet Wikipedias.
49
In Table 4.3, we chose randomly 50 couples, which have new interwiki links,
after connecting interwiki links, we enriched Vietnamese articles by categories,
external link and bottom templates. We dont enrich Taxobox properties because we
realize that most articles have a full set of Taxobox properties when bot created these
articles following a clear, universal format.
Table 4.3
Result of Enriching Vietnamese Articles by Categories, External Links and Bottom
Templates
Article size
No.
Bytes added
(after enriching)
(before enriching)
1
2
3
4
5
6
7
8
9
10
45
46
47
48
49
50
Mean
0
253
0
0
132
170
240
140
281
394
581
364
453
489
441
551
182.68
1421
2261
1371
1635
1656
1760
1497
1339
1010
1733
1755
1154
1615
1679
1626
1678
1478.43
0.00%
12.60%
0.00%
0.00%
8.66%
10.69%
19.09%
11.68%
38.55%
29.42%
49.49%
46.08%
38.98%
41.09%
37.22%
48.89%
14.10%
In some cases, we cannot enrich any content because both Vietnamese and
English articles are stub with their mean size is 1478.43 bytes. Thus, there are not
much data which can be exploited. On average, we did enrich +182.68 bytes or
+14.10% per article. With 791888 pages containing Taxobox, we believe that this
alignment can contribute a significant content to Vietnamese Wikipedia, at least in the
biological domain.
50
Chapter 5
Conclusions and Recommendations
5.1 Conclusions
Our proposed model is a new approach which based on the property alignment
between Wikipedia infoboxes and Wikidata to enrich the articles for all Wikipedias,
especially Latin-based alphabet Wikipedias and Wikidata statements.
The aligned structure of infoboxes are valuable sources for stakeholders when
retrieving the semantic relations or reusing in their works openly and independently.
These structures can be updated by everyone so the retrieved semantic relations which
are used for the Enrichment process may be the latest ones. Therefore, our model can
reduce the low update frequency of semantic relations which was a problem of
DBPedia. Furthermore, these structures support the infobox unification of Wikidata
when bot accounts are able to map directly the infobox properties to Wikidata without
needing any translation tasks, semantic algorithms or human effort.
The comparison list is created from the matching of article titles between
languages. We easily detect the required articles whose titles are proper names, place
names, scientific names, etc. in Latin-based alphabet languages with the similarities of
naming convention. Thus, our model can have more benefits when working in some
specific domains such as biological species (scientific names), places (cities, towns),
person, chemical compounds, symbols, years, numbers, and asteroids.
Our model is a proper solution to enrich the articles which are in stub status or
lack of editors attention in small-and-medium-scale Wikipedias. By this way, we can
earn enrichment profit as much as possible. In this thesis, we successfully enriched
article content for biological species in some datasets. We proposed the possibility of
enrichment for Wikidata statements, however, we did not include this in our
implementation.
At last, we believe that this thesis will open up to many studies about the
correlation between Wikidata, Wikipedia and DBPedia.
51
5.2 Recommendations and Future Works
According to Phase 2 of Wikidata plan, we believe that these aligned
structures may help Wikidata developers in unifying the infoboxes of all languages. In
this model, we can utilize the community power in property alignment which
DBPedia inhibited. Nevertheless, our model is in the development stage, which may
not support the content Enrichment process completely.
In Alignment process, we should use some translation tools and parsers to
improve the property alignment. Furthermore, we need more algorithms to evaluate
the correlation of properties with more exactness and inherit other previous researches
to widen the alignment property database. These works we will continue to research
in the future. Creating a comparison list by searching the article name is still not the
best solution to detect missing interlinks. Thus, we will compare semantic structures
and other data of articles in this task. In the case study, we realize that our model can
work well with the biological articles of Latin-based alphabet Wikipedias. However,
to apply to other domains effectively, many efforts needed to be made to improve our
model. That is the reason why we will build more assessments for different article
domains in Comparison process.
The Enrichment Process depends on the gathered semantic relations. So, to
improve the content enrichment, we have to use more datasets such as Geocoordinates, person data, disambiguation, images, etc to earn more enrichment
benefits. We will continue to deploy the enrichment for Wikidata statements in some
domains.
1 inch
52
References
Adar, E., Skinner, M., & Weld, D. S. (2009). Information arbitrage across
multi-lingual Wikipedia. In Proceedings of the Second ACM International
Conference on Web Search and Data Mining (pp. 94-103). Retrieved from
http://dl.acm.org/citation.cfm?
id=1498813&dl=ACM&coll=DL&CFID=671618123&CFTOKEN=83150324
Anderson, J. J. (2011). Wikipedia: The company and its founders. (pp. 10-11, 42).
North Mankato, Minnesota: ABDO.
Aprosio, A. P., Giuliano, C., & Lavelli, A. (2013). Towards an automatic creation of
localized versions of DBpedia. The Semantic WebISWC 2013 (pp. 494-509).
Berlin: Springer.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007).
Dbpedia: A nucleus for a web of open data. Berlin: Springer.
Auer, S., & Lehmann, J. (2007). What have innsbruck and leipzig in common?
extracting semantics from wiki content. The Semantic Web: Research and
Applications (pp. 503-517). Berlin: Springer.
Baldwin, R. & Cave, M. & Lodge, M. (2010). The Oxford handbook of regulation
Oxford handbooks in business and management - Oxford handbooks. Oxford
Handbooks Online. Retrieved from http://www.oxfordhandbooks.com/view/
10.1093/oxfordhb/9780199560219.001.0001/oxfordhb-9780199560219
Bieberstein, N. (2008). Executing SOA: A Practical Guide for the Service-Oriented
Architect. Upper Saddle River, N.J.: IBM Press.
Bhmann, L., & Lehmann, J. (2011). LOD2 Deliverable D3. 3.1: Release of
Knowledge
Base
Enrichment
Algorit
hms.
Retrieved
from
jens-
lehmann.org/files/2011/lod2_deliverable_3.3.1.pdf
Cabrio, E., Cojan, J., Gandon, F., & Hallili, A. (2013). Querying multilingual dbpedia
with qakis. The Semantic Web: ESWC 2013 Satellite Events (pp. 194-198).
Berlin: Springer.
Chernov, S., Iofciu, T., Nejdl, W. & Zhou, X. (2006). Extracting Semantics
Relationships between Wikipedia Categories. In Proceedings of the First
Workshop on Semantic Wikis -- From Wiki To Semantics. Budva: Springer.
53
Dandala, B., Mihalcea, R., & Bunescu, R. (2012, June). Towards building a
multilingual semantic network: Identifying interlingual links in wikipedia. In
Proceedings of the First Joint Conference on Lexical and Computational
Semantics-Volume 1: Proceedings of the main conference and the shared task,
and Volume 2: Proceedings of the Sixth International Workshop on Semantic
Evaluation (pp. 30-37). Association for Computational Linguistics.
de Melo, G., & Weikum, G. (2010, July). Untangling the cross-lingual link structure
of Wikipedia. In Proceedings of the 48th Annual Meeting of the Association
for Computational Linguistics (pp. 844-853). Association for Computational
Linguistics.
de Melo, G., & Weikum, G. (2010, October). MENTA: inducing multilingual
taxonomies from wikipedia. In Proceedings of the 19th ACM International
Conference on Information and Knowledge Management (pp. 1099-1108).
Retrieved from http://dl.acm.org/citation.cfm?id=1871577
Erxleben, F., Gnther, M., Krtzsch, M., Mendez, J., & Vrandei, D. (2014).
Introducing Wikidata to the Linked Data Web. The Semantic WebISWC 2014
(pp. 50-65). Trentino: Springer International.
Gurevych, I., Kim, J., & Calzolari, N. (2013). The peoples web meets NLP:
Collaboratively constructed language resources. Berlin: Springer Science &
Business Media.
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Brgle, M., ... & Scheel,
U. (2010). Faceted wikipedia search. In Business Information Systems (pp. 111). Berlin: Springer.
Hellmann, S., Bryl, V., Bhmann, L., Dojchinovski, M., Kontokostas, D., Lehmann,
J., ... & Zamazal, O. (2014). Knowledge Base Creation, Enrichment and
Repair. Linked Open Data--Creating Knowledge Out of Interlinked Data (pp.
45-69). Springer International.
Kim, E. K., Weidl, M., & Choi, K. S. (2010, April). Metadata Synchronization
between Bilingual Resources: Case Study in Wikipedia. In MSW (pp. 35-38).
Kittur, A., Suh, B., Pendleton, B. A., & Chi, E. H. (2007, April). He says, she says:
Conflict and coordination in Wikipedia. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (pp. 453-462).
Retrieved from http://dl.acm.org/citation.cfm?id=1240698
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., ... &
54
Bizer, C. (2015). DBpediaA large-scale, multilingual knowledge base
extracted from Wikipedia. Semantic Web, 6 (2), 167-195.
Li, H., & Sima, Q. (2015). Parallel mining of OWL 2 EL ontology from large linked
datasets. Knowledge-Based Systems. Retrieved from https://dx.doi.org/10.1016
/j.knosys.2015.03.023
Mendes, P. N., Jakob, M., & Bizer, C. (2012, May). DBpedia: A multilingual crossdomain knowledge base. LREC. Istanbul: ELRA.
Morsey, M., Lehmann, J., Auer, S., Stadler, C., & Hellmann, S. (2012). Dbpedia and
the live extraction of structured data from wikipedia. Program, 46(2), 157181.
Morsey, M., & Lehmann, J. (2011). LOD2 Deliverable 3.2. 2 DBpedia-Live
Extraction. Retrieved from http://jens-lehmann.org/files/2011/lod2
_deliverable_3.2.2.pdf
Nastase, V., & Strube, M. (2008, July). Decoding wikipedia categories for
knowledge acquisition. AAAI. Chicago: AAAI Press.
Nguyen, T. H. (2013). Integrating structured data on the web. (Doctoral
dissertation). The University of Utah, Utah.
Nguyen, T., Moreira, V., Nguyen, H., Nguyen, H., & Freire, J. (2011). Multilingual
schema matching for wikipedia infoboxes. In Proceedings of the VLDB
Endowment, 5(2), 133-144.
O'Sullivan, D. (2012). Wikipedia: A new community of practice?. Farnham,
England; Burlington, VT: Ashgate.
Ponzetto, S. P., & Navigli, R. (2009, July). Large-scale taxonomy mapping for
restructuring and integrating wikipedia. IJCAI, 9, 2083-2088.
Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large scale taxonomy from
Wikipedia. AAAI, 7, 1440-1445. Vancouver: AAAI Press.
Prudhommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF.
W3C. Retrieved from http://www.w3.org/TR/rdf-sparql-query/
Rinser, D., Lange, D., & Naumann, F. (2013). Cross-lingual entity matching and
infobox alignment in Wikipedia. Information Systems, 38(6), 887-907.
Sorg, P., & Cimiano, P. (2008, June). Enriching the crosslingual link structure of
55
wikipedia-a classification-based approach. In Proceedings of the AAAI 2008
Workshop on Wikipedia and Artifical Intelligence (pp. 49-54). Chicago:
Springer Science & Business Media.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2007, May). Yago: A core of semantic
knowledge. In Proceedings of the 16th international conference on World
Wide Web (pp. 697-706). Retrieved from http://dl.acm.org/citation.cfm?
id=1242667
Syed, Z. S., & Finin, T. (2010). Approaches for Automatically Enriching Wikipedia.
AAAI
Workshops.
Retrieved
from
https://www.aaai.org/ocs/index.php/
WS/AAAIW10/paper/view/2036/2493
Ta, T. H., & Anutariya, C. (2014). A model for enriching multilingual Wikipedias
using infobox and Wikidata property alignment. In Semantic Technology (pp.
335-350). ChiangMai: Springer International.
Tacchini, E., Schultz, A., & Bizer, C. (2009). Experiments with wikipedia crosslanguage data fusion. In 5th Workshop on Scripting and Development for the
5th Workshop on Scripting and Development for the Semantic Web
(SFSW2009). Tokyo: Springer.
Toma, I., Hangl, S., Caminero, F. J., & Date, C. D. (2010). Diversity-aware
extensions
to
collaborative
systems.
Retrieved
from
http://render-
project.eu/wp-content/uploads/2013/04/D4.1.2_2.0.pdf.
Tyers, F. M., & Pienaar, J. A. (2008). Extracting bilingual word pairs from
Wikipedia. In LREC 2008, SALTMIL Workshop. Marrakech, Marroco.
Vrandei, D. (2013). The rise of Wikidata. IEEE Intelligent Systems, 28(4), 90-95
Vrandei, D., & Krtzsch, M. (2014). Wikidata: A free collaborative
knowledgebase. Communications of the ACM, 57(10), 78-85.
Xu, D., Cheng, G., & Qu, Y. (2014). Preferences in Wikipedia abstracts: Empirical
findings and implications for automatic entity summarization. Information
Processing & Management, 50(2), 284-296.
Xu, L., Takeda, H., Hamasaki, M., & Wu, H. (2010). Typing software articles with
Wikipedia category structure. Nil Tech-nical Reports.
56
Appendix A
Converter 1.1.6
This tool is used to translate terms in brackets [[]] (internal links) from any
Wikipedia to any Wikipedia throughout Wikidata by sitelinks. We can translate data
values of infobox parameters to the specific language which we want to contribute the
information.
To use this tool, please follow these steps:
Paste text which needed to convert and press Convert button to get the
result.
57
Appendix B
Category 1.0.8
This tool is used for improving the category taxonomy more fine-grained by
copying from the English Wikipedia classifications. This tool checks all categories
which have interwiki links to English edition and collect the category classifications
of English as RDF triples. Then, we can use AWB (AutoWikiBrowser) to import the
triples to other languages.
58
Appendix C
AutoWikiBrowser
AutoWikiBrowser (AWB) is a semi-automated MediaWiki editor which runs
on Windows operating system. AWB helps to edit tasks faster and more convenient.
59
Biography
Name:
Ta Hoang Thang
Date of Birth:
02 November, 1985
Place of Birth:
Institutions Attended:
2003 2008
2012 2014
Home Address:
Email:
tahoangthang@gmail.com
thangth@dlu.edu.vn