Professional Documents
Culture Documents
Data Mining
1
P Institute of Computing Technology, Chinese Academy of Sciences, Kexueyuan South Road 6
P
2
Graduate School of Chinese Academy of Sciences, Beijing, China
P P
http://www.intsci.ac.cn/en/index.html
http://www.gscas.ac.cn/gscasenglish/index.aspx
{luojw, shizzlinf, wangmg, }@ics.ict.ac.cn
1 Introduction
Distributed Data Mining (DDM) aims at extraction useful pattern from distributed
data bases in order to compose them within a distributed knowledge base and use for
the purposes of decision making. Mining information and knowledge from distributed
data sources such as weather data bases, financial data portals, has been recognized
by industrial companies as an important income source. However, most data mining
techniques are developed for centralized data sources and often cannot directly apply
on distributed data environment. How to expediently apply the centralized methods to
deal with the distributed application becomes a gigantic challenge [1, 2, 3].
Traditionally, the most widely used approach to DDM in business applications is
to apply sequential data mining techniques to data which have been retrieved from
different sources and stored in a central data warehouse. Although its commercial
success, such a solution may be impractical or even impossible for some business
environment because data may be inherently distributed and cannot be localized on
any one host for a variety of reasons including security and fault tolerant etc [6].
Fortunately, the advent of multi-agent systems has brought us opportunities for the
development of distributed system as the infrastructure for dealing with this kind of
1
T TThis project is supported by High-Tech Program 863 (2001AA113121), National Natural
Science Foundation of China (90104021, 6007301960173017).
problem. During the past decades, there have been several models and approaches
proposed to use multi-agent paradigm constructing distributed data mining applica-
tions. However, most of these approaches focus on specific combination technique,
such as meta-learning for integrating homogeneous models and collective data mining
for combining heterogeneous schema. Few general tools have been proposed to con-
veniently implement the various agent based DDM applications. In this paper, we
describe VAStudio, an extensible toolkit which combines the multi-agent technology,
parallel and distributed algorithms and strong GUI to provide an integrating devel-
opment environment.
The rest paper is organized as follows. In section 2, we briefly review the existing
agent based DDM systems. In section 3, we describe the VAStudio hierarchical struc-
ture. After that, collaborative learning will be discussed. In the end, we demonstrate
the GUI of the toolkit and compare it with the most representative agent based DDM
systems to evaluate it functions.
2 Relate Work
Despite relative infancy comparing with centralized data mining, agent based distrib-
uted data mining has already achieved import research fruits in the past years. In this
section, we briefly review the most representative agent based DDM systems:
BODHI, JAM, Papyrus and PADMA.
BODHI is an agent-based distributed data mining system that offers an environ-
ment capable of handling heterogeneous distributed data mining. It has been designed
according to a framework for collective data mining on heterogeneous data sites such
as supervised inductive distributed function learning and regression [7, 11].
JAM is an agent-based meta-learning system for DDM. It is implemented as a col-
lection of distributed learning and classification programs linked together through a
network of data sites. Each local agent builds a classification model and different
agents build classifiers using different techniques. After local data mining, JAM pro-
vides a set of meta-learning agents for combining multiple models learnt at different
sites into a meta-classifier that in many cases improves the overall predictive accu-
racy [8].
Papyrus uses java aglets for supporting move data, models, results or mixed
strategies. It supports different task and predictive model strategies. It is a specialized
system for clusters, meta-clusters, and super-clusters. Each cluster has one distin-
guished node which acts as its cluster access and control point for the agents. Coordi-
nation of the overall clustering task is either done by a central root site or distributed
to the (peer-to-peer) network of cluster access points [9].
PADMA is an agent based architecture for parallel and distributed data mining
which is deals with the DDM problems from homogeneous data sites. Partial data
cluster models are first computed by stationary agents locally at distributed sites. All
local models are collected to a central site that performs a second-level clustering
algorithm to generate the global cluster model [10].
Common to all approaches is that they aim at integrating the knowledge which is
discovered out of data at different geographically distributed network sites and every
system focus on special and representative technique for combing the local models.
This section introduces the hierarchical architecture of the VAStudio in detail. From
software engineering perspective, the reusable classes are main components of new
software and programmers prefer to integrate the existing modules for rapid devel-
opment. Compliant to this view, we adopt a hierarchical structure in VAStudio. Fig-
ure 2 shows the structure, which is composed with four different layers: the algo-
rithms library layer, behavior layer, agent layer and society layer. The lower layers
supply the fundamental materials for upper one to construct more abstract and com-
plex applications.
Agent
Society
layer
Agent
layer
Behaviour
layer
DataMining
Algorithms
Data mining Algorithms Integration Algorithms
4 Collaborative learning
In distributed data mining, there is a fundamental trade-off between the accuracy and
the cost of computation. At one extreme, we can move all distributed data to a central
site for data mining to produce the most accurate result. At the other extreme, we can
process all the data locally to obtain local models, and combine them for the final
result [9].
Although the centralized technique produces more accurate results, it may imprac-
tical in some enterprise applications because of the data privacy and network transfer
constraints. So, how to improve the accuracy in the distributed environment is the
problem we care. In this section, we apply collaborative learning [14, 15] approach
which aims to improve the accuracy of local models. Since global model are com-
bined by the local models, the accuracy of local models considerably affects the final
results. In the following, we first use the agent goal relation to formalize the collabo-
rative learning process.
Definition1 (Basic Denotations)
( Actioni ) : Agent i takes action .
( Achievei : G ) : G denotes a goal of agent i. It takes action to finish the goal.
F = S1 U S 2 U S3 UL Sn 1 U S n : Si denotes the data stored in the host i. F is the
whole set of N distributed data. We assume Schema ( Si ) = Schema ( S j ) here.
Z = {L1 , L2 , L3 ,L Ln } : Li is data mining algorithm used in host i. Z denotes the
selected algorithms set in one distributed data mining process. Without loss of gener-
ality, we do not require the algorithms are the same in the DDM applications.
k = {C1 , C2 , C3 L Cn } : Ci denotes the local model in host i. k is the global result
of the whole process in N geographically distributed data sites.
Definition 2 (Data Mining Goal)
S = {g1 , g 2 , g3 ,L g n } denotes the data mining goals set of N distributed hosts.
In the application, the whole goal of agent society is to compute and acquire the globe
model and gi is local data mining goal of Agenti .
Definition 3 (Goal Relation)
R denotes the relations between goals. R : = { Before, Serial} ,
R ( gi , g j ) = Before iff (achievei : gi ) (achieve j : g j )
(action j ) U (actioni )
This formula means gi must be finished before g j . As the definition of Before,
Serial means gi and g j can be finished at the same time, in other words, there is no
time sequence between them.
S1 S2 Sn S1 S2 Sn
L1 L2 Ln L1 L2 Ln
C1 C2 C3 C1 C2 C3
Combination Combination
Fig.2. Collaboration learning process. The left one demonstrates Before relation and the
right shows Serial relation
In this section, we demonstrate the GUI of VAStudio which provides a visual devel-
opment for agent based DDM applications. Considering heterogeneous environment
in distributed process, VAStudio is developed by Java language which has superior
characteristics on cross platforms and data bases.
In fact, we do not emphasize the data mining function and aim to provide a general
multi-agent development platform at former versions. Accordingly, the toolkit learns
many experiences and lessons from several mature multi-agent systems and building
platform, such as JADE[17], ZEUS[18] and Aglets[19] etc. However, different from
these systems, we adopt plug-in technique to make the toolkit extensible like
MATLAB [21] which can continually update and add new functions according to
practical requirements. In the recent version, we specially build the DDM algorithms
library and aim to make it a powerful tool for constructing agent based DDM.
As figure 3 shows, VAStudio is composed with the five main parts: the visual pro-
gramming environment, algorithms library, debug frame, hierarchical Behavior-
Agent-Society frame, menu and tool bar. Visual programming environment offers an
ideal window for java-oriented programmers to implement agent based applications.
Considering some programmers not familiar with agent theory, we provide the Be-
havior, Agent and Society building Wizard like most business software to help them
quickly develop DDM systems.
In the left big frame, we add the ID3, C4.5, Cart and Ripper algorithms to build
relative sub-behavior classes. And then, through the Agent Wizard, we construct
MobileAgent_ID3, MobileAgent_C4.5 and. In this experiment, we adopt the
weighted voting algorithms to generate MobileAgent_Integration. The process is
showed on the right big frame.
In order to support multi-agent collaboration, we provide GUI for programmers to
rapidly implement interaction ontology. The small frame shows the ontology build-
ing Wizard.
Fig.3. Graphical user interface of VAStudio, the small frame is an Ontology wizard, the left
big frame is showing the five Behaviors building and the right big one shows three agents
construction process.
6 Evaluations
In this paper, we first review the most representative agent based DDM systems and
then describe VAStudio, an extensible, user-friendly toolkit for developing various
agent based DDM applications. And then, we formalize the goal relation theory to
facilitate modeling multi-agent interactions in complex DDM processes. Finally,
detailed comparison between VAStudio and former systems has been reported ac-
cording to different aspects. The contribution of this paper is that we first implement
an integrating platform for constructing agent based DDM which effectively connect
the theory and practice. And the goal relation formalization provides a perspective for
users to deeply understand multi-agent collaboration and simplifies the modeling
process for complex DDM applications. Moreover, the detailed comparisons between
the most representative agent based DDM systems may provide a short tutorial for
readers who are not familiar with this field.
In future, we plan to enrich the data mining algorithms library both from machine
learning theory and from statistics approach. And, one core problem we have to deal
with is to provide support for heterogamous data bases. The final goal is VAStudio
can provide develop support, from data mining algorithms to integration techniques,
not only for distributed data sites but also for the expansive Internet.
References
1. B. Park and H. Kargupta, Distributed data mining: Algorithms, systems, and applications,
in Handbook of Data Mining, N. Ye, Ed. Hillsdale, NJ: Lawrence Erlbaum, 2003, pp. 341
361.
2. M. Zaki and Y. Pan, Introduction: Recent developments in parallel and distributed data
mining, J. Distributed Parallel Databases, vol. 11, no.2, pp. 123127, 2002.
3. H. Kargupta, C. Kamath, and P. Chan, Distributed and parallel data mining: Emergence,
growth and future directions, in Advances in Distributed Data Mining, H. Kargupta and P.
Chan, Eds. Menlo Park, CA:AAAI, 1999, pp. 407416.
4. V. Kumar, S. Ranka, and V. Singh, High performance data mining, J. Parallel Distrib.
Computing. vol. 61, no. 3, pp. 281284, 2001.
5. M. Zaki Parallel and Distributed Data Mining: An Introduction. In Large-Scale Parallel
Data Mining (Lecture Notes in Artificial Intelligence 1759), edited by Zaki M. and Ho C.-T.,
Springer-Verlag, Berlin, pages 123, 2000.
6. M.Clusch, S. Lodi and G.Moro. Agent-based distributed data mining: the KDEC schema
LNAI 2586 pp104-122.2003
7. Kargupta, H. Park Hershberger Collective Data Mining: A New Perspective Toward Dis-
tributed Data Mining In Advances in Distributed and Parallel Knowledge Discovery
AAAI/MIT Press (2000) 131-178
8. S. J. Stolfo, A. L. Prodromidis, L. Tselepis, W. Lee, D. Fan, and P. K.Chan, JAM: Java
agents for meta-learning over distributed databases,in Proc. 3rd Int. Conf. Data Mining
Knowledge Discovery (KDD), D.Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy,
Eds., NewportBeach, CA, 1997, pp. 7481.
9. R. Grossman, S. Bailey, A. Ramu, B. Malhi,H. Sivakumar, and A. Turinsky. Papyrus: A
system for data mining over local and wide area clusters and super-clusters. Proceedings of
Supercomputing 1999
10. H. Kargupta, I. Hamzaoglu, and B. Stanfford. Scalable, distributed data mining using an
agent basedarchitecture. Proceedings the Third International Conference on the Knowledge
Discovery and Data Mining, pages 211-214, 1997.
11. H. Kargupta, B. Park, E. Johnson, E. Sanseverino,L. D. Silvestre, and D. Hershberger.
Collective data mining from distributed vertically partitioned feature space. Workshop on
distributed data mining, International Conference on Knowledge Discovery and Data Min-
ing, 1998.
12. P. Chan and S. Stolfo, A Comparative Evaluation of Voting and Meta-Learning on Parti-
tioned Data, In Proceedings of the Twelfth International Conference on Machine Learning,
Morgan-Kaufmann, San Francisco, California,pages 90-98, 1995.
13. T. G. Dietterich, Machine Learning Research: Four Current Directions, AI Magazine Vol-
ume 18, pages 97-136, 1997.
14. Provost, F. and D. Hennessy (1996). Scaling up: Distributed machine learning with coop-
eration. In Proceedings of the thirteenth National Conference on Artificial Intelligence,
Menlo Park, CA, pp. 7479. AAAI Press.
15. P. J. Modi and W.-M. Shen. Collaborative multi-agent learning for classification tasks. In
Proceedings of the Fifth International Conference on Autonomous Agents, pages
37.38,Montreal, 2001. ACM Press.
16. Wenk Lee and Naser S. JavaDot: An extensible visualization environment Technique Re-
port CUCS Department of Computer Sciences. Columbia University, NY 1997
17. http://jade.tilab.com/
TU UT
18. http://www.btexact.com/projects/agents/zeus/
TU UT
19. http://www.trl.ibm.com/aglets/
TU UT
20. http://www.cs.waikato.ac.nz/ml/weka/
TU UT
21. www.mathworks.com/
TU UT
22. www.intsci.ac.cn
TU UT