FIT 2016 Paper 363

Motivation and Problem Statement
Digital libraries (DLs) such as DBLP, Cite Seer, arXiv, MAS, Google Scholar and Scopus conserve
bibliographic citations and provide several services; receiving items associated with a particular author, multiple
searches, browsing personalizations and building communities with certain educational fields [1]. Some of the
main challenges highlighted by a study conducted by Lee et al. [2] in their study enumerated that there are many
different sources of error in these DLs. Amongst these sources of errors, a great attention is paid to ambiguous
author names [1], [3].
To resolve name ambiguity problem in digital citations is known as author name disambiguation (AND). Author
name ambiguity could happen when many authors share a name amongst them, or an author's many name
variations exist in DLs. In both above stated scenarios, it is very hard to tell about the accuracy citation records.
For example, complete name of an abbreviated name "C. Chen" can denote to "Chang Chen" or "Chao Chen",
two totally dissimilar authors, but they both refer to as "C. Chen". This occurrence of citation fusion, the case in
which two different people have the same name (e.g., "C. Chen"), is recognized as "mixed citations" [4], [5].
The exactly same case can occur for an author named "Chang Chen", which can be mutual name by many other
authors. This problem is known as "split citations" [4], [5].
By examining Table 2, examples of both namesakes and synonyms can be seen, which as mentioned before in
the introduction section, are two sub-problems of the name ambiguity problem. Author names r1 and r10 are
examples of polysemes cace in which r1 refers to Ijaz Hussain from COMSATS Islamabad, Pakistan and r10
refers to Inam Hussain from PIEAS, Pakistan. Author names r3 and r7 are an examples of a synonyms author
case. Both refer to Ajab Gul from Quaid-e-Azam university, Pakistan.
Now we define the name disambiguation task as follows: Let S = {s1, s2, ..., sk} be a set of citation records. Each
citation record si has a list of different attributes. In a citation, each attribute is related with a particular value
that might have numerous modules. In author name attributes, a component may resembles to the name of a
single unique author and is a reference rj to a real author.
Table 1 Synthetic Citations of ambiguous authorIjaz Hussain
Citation Id
Citations
(r1) I. Hussain, (r2) I. Hassan, (r3) A. Gul, (r4) A. Saif. Data mining in crop fields. CIKM,
C1
2016.
(r5) Ijaz Hussain, (r6) S. Adnan, (r7) Ajab Gul. A new mechanisn of industril control.
C2
KDD, 2009.
(r8) Inam Hussain, (r9) Abdullah Gul, (r10) I. Hussain, (r11) Adnam Saif. Monte-Carlo
C3
simulation for name disambiguation. HSCC, 2010.
The disambiguation method should be such as it partition the set of m ambiguous citations references {r1, . . . ,
rm} into n sets of distinct authors{a1, . . . , an}, in which each partition ai comprises of the all references to the
same one author as illustrated in Figure 1.
a1
r1
r5
Mixed authors
Distinct authors
a2
r1
r2
r3
r2
r4
r5
r3
Disambiguation
Function
a3
r6
r7
r4
r6
r7
rm
an
rm
Figure 1 Objective of author name disambiguation function

Literature Review
Author name disambiguation techniques can be divided into machine learning techniques and non-machine
learning techniques. Machine learning and non-machine learning techniques can be further subdivided into three
and two subcategories respectively, as shown in Figure 2
Author Name
Disambiguation
Techniques
Non-machine
Learning
Techniques
Machine Learning
Techniques
Supervised
Un-Supervised
Semi-Supervised
Graph
Heuristic
Figure 2 Classification of author name disambiguation methods

Supervised Techniques
Supervised techniques require labeled training data that is manually created and inputted to the classifier. The
data consist of pairs of the form <Ai, Bi>, where Ai is input feature vector and Bi is correctly labeled output
class. The objective of learning function in supervised learning is to map input attributes to correct output class
value.
A four step boosted tree classification method for name disambiguation was proposed by [7]. This method only
solved the problem of polysems. Two extreme learning machine based algorithms for AND was proposed by
[8]. First was one classifier for each name (OCEN) and the second was one classifier for all names (OCAN).
Missing data case is not handled in this strategy and if a single author or two namesakes share the same or
similar title it would fail to distinguish. A deep neural network (DNN) based approach was proposed by [9] to
automatically learn features and disambiguate the authors from any dataset. However, it requires retraining of
the model if some parameters are changed and many models for each ambiguous author that is not scalable.
Unsupervised Techniques
In [10] authors proposed algorithms for author disambiguation that uses Dempster-Shafer theory in combination
with Shannon entropy. This method has low accuracy than supervised techniques. A two-step algorithm that not
only disambiguates authors name problem but it also reconstructs the h-index of the authors was proposed by
[11]. The limitation of this method is that it only applies to experienced author and excludes new authors cases.
INDi a good solution for the existing cleaned DLs was published by [12]. Their presented solution utilized
similarity among bibliographic records and grouped the new records to authors with similar citation records or
creates new authors clusters when the similarity evidence was not strong enough. However, this method suffers
from fragmented clusters. An enhanced version of heuristic hierarchical clustering (HHC) was recently
proposed by [13]. In which they used network based similarity measure along with syntax based similarity
measure for better clustering.
Semi-supervised Techniques
A hybrid name disambiguation framework that not only used the traditional information (co-authors) but also
web page genre information was proposed by [14]. This framework consisted of two main steps; web page
genre identification, re-clustering model. This is not a scalable solution as it requires a lot of web access. A
semi-supervised approach for AND that utilizes Microsoft academic search real data was proposed by [15].
They pre-processed data set and found many useful features. They constructed a co-author based bibliographic
network and applied community detection algorithm. Recently, an ethnicity sensitive method that mainly
comprises of three parts was presented by [16]. Self-training associative name disambiguation that comprised of
two stages after pre-processing of the data set was proposed by [17]. All these methods are slow and require
users feedback.
Heuristic-based Techniques
When exact solutions are not possible or it is too slow to get the exact solution then heuristic techniques come
into the scene and give solution quickly. It is worth noting that these solutions may not be optimal but these are
near to optimal.
A name matching framework for author name disambiguation for Microsoft Academic Search dataset was
proposed by [18] and it realized two implementations. Ranking-based four step name matching algorithm and a
system called RankMatch was proposed by [19]. A technique proposed by [20] used a novel five-step coauthor
inclusion recursive algorithm that solved the problem of homonyms. It fails on sole and very ambiguous author
cases.
Graph-based Techniques
The majority of techniques forms a graph and then finds similarity between vertices. They cluster these results
by using diverse ML techniques to disambiguate the authors. However, these techniques have been classified
under the umbrella of graph-based techniques.
Two multi-level scalable algorithms for AND was proposed by [21]. First was a multi-level graph partitioning
(MGP) algorithm, and the second was a multi-level graph partitioning and merging (MGPM) algorithm on the
basis of some attribute similarity value. This method assumes that the no of the unknown author are given in
advance. A novel method which they called GrapHical framewOrk for name diSambiguaTion (GHOST) was
proposed by [22]. It first modeled the relationships among publications using undirected graphs. Then they
solved homonyms problem by iteratively finding valid paths, computing similarities, clustering with the help of
affinity propagation algorithm and in the last using user feedback as a complementary tool to enhance the
performance. Collaboration network of authors along with syntactic similarity between authors to disambiguate
author names in three different sub-sets of DL datasets was presented by [23]. They assumed that two
syntactically similar authors were same if there were a close relationship and small distance between them. A
context graph based method to solve the synonym problem in AND was proposed by [24]. They made a context
graph for every ambiguous reference by finding contextual information from the citations attributes of authors
and their co-authors. Then, these graphs were overlaid on the already constructed whole graph of all citations.
They defined some graph based similarity function that utilizes common quasi-cliques between the complete
graph and entity graph. Another graph-based methodADANA was proposed by [25]. In this method, they
modeled pairwise graph of their factors (PFG) that can be used to integrate several types along with user
feedback to a unified model. A graph framework for author disambiguation (GFAD) that exploited only authors,
co-authors, and paper titles information was presented by [4] presented. They solved homonyms by splitting
graph nodes, synonyms by merging graph nodes and outliers by merging them to the syntactically closest graph
node. In [26], they used latent Dirichlet allocation for finding the resultant vector of authors all publications and
then a target paper is compared with the vectors of all the authors. It is assigned to that author who is closest to
this papers concepts. This method fails in the case of two homonyms if they both are working in the similar
field.
Statement of the Problem
The issue of author name ambiguity has been solved by many author name disambiguation (AND) methods [4],
[5], [11], [16], [26][33]. In these studies, some good results are reported in terms of recall, accuracy, precision,
and F-measure. However, most of these methods neither address nor provide satisfactory solutions to the
following three problems:
Problem 1.
In many studies [1], [16], [34], [35] the first phase of AND is called blocking, in which similar
names are grouped together if their similarity is higher than some predefined threshold value. Majority of these
studies use predefined similarity measures such as cosine [4], [5], [7], [10], [12], [15][17], [25], [36], [37],
Jaccard [9], [10], [24], [35], longest common sub-sequence [4], [10] and Jaro-Winkler [9], [16], [35] for
blocking. These similarity measures produce high false positives.
Problem 2.
Many AND studies assume that ambiguous author can be disambiguated by knowing his/her
co-authors [4], [22][24], [38]; so, the author, co-author social circle characterize authors identity. However,
these studies[21], [38], even not address the case of sole authors [6] that is reported above 10% in Thompson
Reuters in the year 2012 by [41]. To the best of our knowledge, only [4] proposed an algorithm to handle these
type of authors by using title similarity. However, titles have limited words and cannot exactly portray the
research fields of authors.
Problem 3.
Supervised methods train different model for each ambiguous author, so these are not
scalable. Similarly, graph-based methods use cycles or paths enumeration between vertices so these are also not
scalable [4], [7], [19], [37], [42], [43]. Schloss Dagstuhl Leibniz-Zentrum fr Informatik GmbH announced a
project titled Scalable Author Disambiguation for Bibliographic Databases (2015-2018), in which scalability
is the main focus.
The above three problems all degrade the overall performance of AND techniques. There is a need to address
these issues so that the resulting system would be scalable, solve outliers and effective.
Research Methodology
In this section, an overview of proposed conceptual AND framework and its main components is provided that
are likely to be included. It is also important here to note that identified research problems in this synopsis can
be solved with several techniques/methods, but in proposed framework graph-based technique would be used. It
is learned from the literature review and initial implementation of some techniques that the solution of the AND
using graphs have many advantages over other methodologies such as it requires no labeled training data, no
complicated clustering, natural way of representation, and requires fewer attributes. The proposed GRAND
method is theoretically scalable, robust, and offers complete end to end name ambiguity solution.
The conceptual AND framework consists of following main stages; preparation stage, automatic graph
construction algorithm, namesakes resolver algorithm, synonyms resolver algorithm and proposed sole author
resolver algorithm, as shown in Figure 3.
Figure 3 Conceptual Author Name Disambiguation Framework

All these stages are further divided into different modules and it is given at abstract level in the following
GRAND algorithms.
GRAND Algorithm
1:
Input: Dataset of ambigous authors citations (a1, a2, ..an)
2:
Output: Set of distinct authors (d1, d2, .dm)
3:
Procedure dataset pre processing (a1, a2, ..an)

a.
Preprocessing Procedure
b.
Automatic Graph Construction Algorithm
c.
Namesakes resolver algorithm
d.
Synonymous resolver algorithm
e.
Sole authors resolver algorithm
4: end procedure
Bibliograph
[1]
A. A. Ferreira, M. A. Gonalves, and A. H. F. Laender, A brief survey of automatic methods for author name disambiguation,
ACM SIGMOD Rec., vol. 41, no. 2, p. 15, 2012.
[2]
B. D. Lee, J. Kang, P. Mitra, and C. L. Giles, CLEAN?, vol. 50, no. 12, pp. 3338, 2007.
[3]
S. Elliott, Survey of Author Name Disambiguation: 2004 to 2010, Libr. Philos. Pract., vol. 473, 2010.
[4]
D. Shin, T. Kim, J. Choi, and J. Kim, Author name disambiguation using a graph model with node splitting and merging based
on bibliographic information, Scientometrics, vol. 100, no. 1, pp. 1550, 2014.
[5]
J. Tang, A. C. M. Fong, B. Wang, and J. Zhang, A unified probabilistic framework for name disambiguation in digital library,
IEEE Trans. Knowl. Data Eng., vol. 24, no. 6, pp. 975987, 2012.
[6]
A. A. Ferreira, M. A. Gonalves, and A. H. F. Laender, Automatic Methods for Disambiguating Author Names in Bibliographic
Data Repositories, 2015.
[7]
J. Wang, K. Berzins, D. Hicks, J. Melkers, F. Xiao, and D. Pinheiro, A boosted-trees method for name disambiguation,
Scientometrics, vol. 93, no. 2, pp. 391411, 2012.
[8]
D. Han, S. Liu, Y. Hu, B. Wang, and Y. Sun, ELM-based name disambiguation in bibliography, World Wide Web, vol. 18, no.
2, pp. 253263, 2013.
[9]
H. N. Tran, T. Huynh, and T. Do, Author Name Disambiguation by Using Deep Neural Network, arXiv1502.08030 [cs], vol.
8397, no. 2, pp. 123132, 2014.
[10]
H. Wu, B. Li, Y. Pei, and J. He, Unsupervised author disambiguation using Dempster???Shafer theory, Scientometrics, vol. 101,
no. 3, pp. 19551972, 2014.
[11]
C. Schulz, A. Mazloumian, A. M. Petersen, O. Penner, and D. Helbing, Exploiting citation networks for large-scale author name
disambiguation, EPJ Data Sci., vol. 3, no. 1, pp. 114, 2014.
[12]
A. P. de Carvalho, A. A. Ferreira, A. H. F. Laender, and M. A. Gonalves, Incremental Unsupervised Name Disambiguation in
Cleaned Digital Libraries, J. Inf. Data Manag., vol. 2, no. 573871, p. 289, 2011.
[13]
M. Nadimi-shahraki, A more Accurate Clustering Method by using Co-author Social Networks for Author Name
Disambiguation, vol. 1, no. 4, pp. 307317, 2014.
[14]
Y. Zhu and Q. Li, Enhancing object distinction utilizing probabilistic topic model, in Proceedings - 2013 International
Conference on Cloud Computing and Big Data, CLOUDCOM-ASIA 2013, 2013, pp. 177182.
[15]
J. Zhao, P. Wang, and K. Huang, A semi-supervised approach for author disambiguation in KDD CUP 2013, Proc. 2013 KDD
Cup 2013 Work. - KDD Cup 13, pp. 18, 2013.
[16]
G. Louppe, H. Al-Natsheh, M. Susik, and E. Maguire, Ethnicity sensitive author disambiguation using semi-supervised learning,
pp. 114, 2015.
[17]
A. A. Ferreira, A. Veloso, M. A. Gonalves, and A. H. F. Laender, Effective self-training author name disambiguation in
scholarly digital libraries, Proc. 10th Annu. Jt. Conf. Digit. Libr. JCDL 10, p. 39, 2010.
[18]
W.-S. Chin, W.-C. Chang, K.-H. Huang, T.-M. Kuo, S.-W. Lin, Y.-S. Lin, Y.-C. Lu, Y.-C. Su, C.-K. Wei, T.-C. Yin, C.-L. Li, Y.C. Juan, T.-W. Lin, C.-H. Tsai, S.-D. Lin, H.-T. Lin, C.-J. Lin, Y. Zhuang, F. Wu, H.-Y. Tung, T. Yu, J.-P. Wang, C.-X. Chang,
and C.-P. Yang, Effective string processing and matching for author disambiguation, Proc. 2013 KDD Cup 2013 Work. - KDD
Cup 13, vol. 15, pp. 19, 2013.
[19]
J. Liu, K. H. Lei, J. Y. Liu, C. Wang, and J. Han, Ranking-based name matching for author disambiguation in bibliographic
data, Proc. 2013 KDD Cup 2013 Work. - KDD Cup 13, pp. 18, 2013.
[20]
S. Wooding, K. Wilcox-Jay, G. Lewison, and J. Grant, Co-author inclusion: A novel recursive algorithmic method for
dealingwith homonyms in bibliometric analysis, Scientometrics, vol. 66, no. 1, pp. 1121, 2005.
[21]
B. W. On, I. Lee, and D. Lee, Scalable clustering methods for the name disambiguation problem, Knowl. Inf. Syst., vol. 31, no.
1, pp. 129151, 2012.
[22]
X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, On Graph-Based Name Disambiguation, J. Data Inf. Qual., vol. 2, no. 2, pp. 123,
2011.
[23]
F. H. Levin and C. a. Heuser, Evaluating the Use of Social Networks in Author Name Disambiguation in Digital Libraries, J.
Inf. Data Manag., vol. 1, no. 2, p. 183, 2010.
[24]
B. On, E. Elmacioglu, and D. Lee, Improving Grouped-Entity Resolution using Quasi-Cliques, 2006.
[25]
X. Wang, J. Tang, H. Cheng, and P. S. Yu, ADANA: Active name disambiguation, Proc. - IEEE Int. Conf. Data Mining, ICDM,
pp. 794803, 2011.
[26]
M. Katsurai, I. Ohmukai, and H. Takeda, Topic Representation of Researchers Interests in a Large-Scale, no. 4, pp. 1010
1018, 2016.
[27]
G. J. Soumyajit, G. Manish, V. Varma, and V. Pudi, Author2Vec: Learning Author Representations by Combining Content and
Link Information, 2016.
[28]
A. F. Santana, M. A. Gonalves, A. H. F. Laender, . Anderson, A. Ferreira, B. Marcos, A. Gonalves, and A. A. Ferreira, On the
combination of domain-specific heuristics for author name disambiguation: the nearest cluster method, Int J Digit Libr, vol. 16,
pp. 229246, 2015.
[29]
T. Arif, Exploring The Use Of Hybrid Similarity Measure For Author Name Disambiguation, vol. 4, no. 12, pp. 171175, 2015.
[30]
Y. Liu and Y. Tang, Network based Framework for Author Name Disambiguation Applications, vol. 8, no. 9, pp. 7582, 2015.
[31]
J. Zhu, Y. Yang, Q. Xie, L. Wang, and S. U. Hassan, Robust hybrid name disambiguation framework for large databases,
Scientometrics, vol. 98, no. 3, pp. 22552274, 2014.
[32]
H. N. Tran, T. Huynh, and T. Do, Author Name Disambiguation by Using Deep Neural Network, Aciids, 2014.
[33]
J. Wu and X. H. Ding, Author name disambiguation in scientific collaboration and mobility cases, Scientometrics, vol. 96, no. 3,
pp. 683697, 2013.
[34]
B.-W. O. B.-W. On, J. K. J. Kang, D. L. D. Lee, and P. Mitra, Comparative study of name disambiguation problem using a
scalable blocking-based framework, Proc. 5th ACM/IEEE-CS Jt. Conf. Digit. Libr. (JCDL 05), p. 344, 2005.
[35]
P. Treeratpituk and C. L. Giles, Disambiguating authors in academic publications using random forests, Proc. 2009 Jt. Int. Conf.
Digit. Libr. - JCDL 09, pp. 3948, 2009.
[36]
H. Han, W. Xu, H. Zha, and C. L. Giles, A hierarchical naive Bayes mixture model for name disambiguation in author citations,
Proc. 2005 ACM Symp. Appl. Comput. - SAC 05, p. 1065, 2005.

FIT 2016 Paper 363

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FIT 2016 Paper 363

Uploaded by

Copyright:

Available Formats

Motivation and Problem Statement

Figure 1 Objective of author name disambiguation function

Figure 2 Classification of author name disambiguation methods

Figure 3 Conceptual Author Name Disambiguation Framework

Input: Dataset of ambigous authors citations (a1, a2, ..an)

Output: Set of distinct authors (d1, d2, .dm)

Procedure dataset pre processing (a1, a2, ..an)

Automatic Graph Construction Algorithm

Namesakes resolver algorithm

Synonymous resolver algorithm

Sole authors resolver algorithm

You might also like