You are on page 1of 10

An Extensible Toolkit for Agent Based Distributed

Data Mining

Luo Jiewen1,2, Shi Zhongzhi1 ,Lin Fen1,2, Wang Maoguang1,2,


T P P P P P P P P T

1
P Institute of Computing Technology, Chinese Academy of Sciences, Kexueyuan South Road 6
P

2
Graduate School of Chinese Academy of Sciences, Beijing, China
P P

http://www.intsci.ac.cn/en/index.html
http://www.gscas.ac.cn/gscasenglish/index.aspx
{luojw, shizzlinf, wangmg, }@ics.ict.ac.cn

Abstract. The advent of multi-agent systems has brought us opportunities for


the development of complex software that will serve as the infrastructure for
advanced distributed applications. During the past decades, there have been
several models and systems proposed to apply agent technology constructing
parallel and distributed data mining. However, almost every approach focuses
on specific technique which may proper in one aspect but not so effective in
another. In this paper, we implement an extensible toolkit named VAStudio1 T T

which integrates multi-strategy and provides a user-friendly development envi-


ronment for building various distributed data mining applications.

1 Introduction

Distributed Data Mining (DDM) aims at extraction useful pattern from distributed
data bases in order to compose them within a distributed knowledge base and use for
the purposes of decision making. Mining information and knowledge from distributed
data sources such as weather data bases, financial data portals, has been recognized
by industrial companies as an important income source. However, most data mining
techniques are developed for centralized data sources and often cannot directly apply
on distributed data environment. How to expediently apply the centralized methods to
deal with the distributed application becomes a gigantic challenge [1, 2, 3].
Traditionally, the most widely used approach to DDM in business applications is
to apply sequential data mining techniques to data which have been retrieved from
different sources and stored in a central data warehouse. Although its commercial
success, such a solution may be impractical or even impossible for some business
environment because data may be inherently distributed and cannot be localized on
any one host for a variety of reasons including security and fault tolerant etc [6].
Fortunately, the advent of multi-agent systems has brought us opportunities for the
development of distributed system as the infrastructure for dealing with this kind of

1
T TThis project is supported by High-Tech Program 863 (2001AA113121), National Natural
Science Foundation of China (90104021, 6007301960173017).
problem. During the past decades, there have been several models and approaches
proposed to use multi-agent paradigm constructing distributed data mining applica-
tions. However, most of these approaches focus on specific combination technique,
such as meta-learning for integrating homogeneous models and collective data mining
for combining heterogeneous schema. Few general tools have been proposed to con-
veniently implement the various agent based DDM applications. In this paper, we
describe VAStudio, an extensible toolkit which combines the multi-agent technology,
parallel and distributed algorithms and strong GUI to provide an integrating devel-
opment environment.
The rest paper is organized as follows. In section 2, we briefly review the existing
agent based DDM systems. In section 3, we describe the VAStudio hierarchical struc-
ture. After that, collaborative learning will be discussed. In the end, we demonstrate
the GUI of the toolkit and compare it with the most representative agent based DDM
systems to evaluate it functions.

2 Relate Work

Despite relative infancy comparing with centralized data mining, agent based distrib-
uted data mining has already achieved import research fruits in the past years. In this
section, we briefly review the most representative agent based DDM systems:
BODHI, JAM, Papyrus and PADMA.
BODHI is an agent-based distributed data mining system that offers an environ-
ment capable of handling heterogeneous distributed data mining. It has been designed
according to a framework for collective data mining on heterogeneous data sites such
as supervised inductive distributed function learning and regression [7, 11].
JAM is an agent-based meta-learning system for DDM. It is implemented as a col-
lection of distributed learning and classification programs linked together through a
network of data sites. Each local agent builds a classification model and different
agents build classifiers using different techniques. After local data mining, JAM pro-
vides a set of meta-learning agents for combining multiple models learnt at different
sites into a meta-classifier that in many cases improves the overall predictive accu-
racy [8].
Papyrus uses java aglets for supporting move data, models, results or mixed
strategies. It supports different task and predictive model strategies. It is a specialized
system for clusters, meta-clusters, and super-clusters. Each cluster has one distin-
guished node which acts as its cluster access and control point for the agents. Coordi-
nation of the overall clustering task is either done by a central root site or distributed
to the (peer-to-peer) network of cluster access points [9].
PADMA is an agent based architecture for parallel and distributed data mining
which is deals with the DDM problems from homogeneous data sites. Partial data
cluster models are first computed by stationary agents locally at distributed sites. All
local models are collected to a central site that performs a second-level clustering
algorithm to generate the global cluster model [10].
Common to all approaches is that they aim at integrating the knowledge which is
discovered out of data at different geographically distributed network sites and every
system focus on special and representative technique for combing the local models.

3 Hierarchical Structure of VAStudio

This section introduces the hierarchical architecture of the VAStudio in detail. From
software engineering perspective, the reusable classes are main components of new
software and programmers prefer to integrate the existing modules for rapid devel-
opment. Compliant to this view, we adopt a hierarchical structure in VAStudio. Fig-
ure 2 shows the structure, which is composed with four different layers: the algo-
rithms library layer, behavior layer, agent layer and society layer. The lower layers
supply the fundamental materials for upper one to construct more abstract and com-
plex applications.

Agent
Society
layer

Agent
layer

Behaviour
layer

DataMining
Algorithms
Data mining Algorithms Integration Algorithms

Fig.1. Hierarchical Structure of VAStudio

In algorithms library layer, we focus on classification, cluster and associations al-


gorithms because lots of relative algorithms are successfully developed in the past
decades [19]. Especially, the field of machine learning has made substantial progress
which provides many sophisticated techniques for distributed data mining. Accord-
ingly, we mainly integrate this kind of algorithms in VAStudio although other algo-
rithms such as ANN, Fuzzy Logic and Rough Set are also effective in practical data
mining processes.
Besides local data mining algorithms, another challenge is how to integrate the lo-
cal results to a global one. The integration is not simply putting together results from
all sites because an interesting pattern in a local database may not be interesting pat-
tern globally. In VAStudio, we employ the ideas from meta-learning [8] and
ensemble learning [13]. All integrating algorithms are implemented with Java and
stored in integration algorithms library, which can be reused in the Behavior layer.
According to the hierarchical structure, behaviors are the basic components to
build agents. In VAStudio, every behavior has one or more actions. We provides the
definition of the base Behavior Class and every child Behaviors (i.e. ID3Behavior,
C4.5Behavior) are generated as a subclass of it which implements common interfaces
for subclasses to inherit.
Although behaviors are components having concrete actions, they can not be dis-
patched to remote hosts and execute as independent entities. They are used by
autonomous, intelligent and mobile agents. In VAStudio, agents which wrap concrete
behaviors can be moved to distributed data sites for independent run and can commu-
nicate and collaborate with their peers for complex DDM applications. Like Behavior,
we have implemented the base class Agent for Child-Agent to inherit.
The highest layer of the structure is agent society that provides a platform for
simulating multi-agent interaction in distributed data mining. In this layer, we can
monitor the message transmission, collaboration process and move route of mobile
agents

4 Collaborative learning

In distributed data mining, there is a fundamental trade-off between the accuracy and
the cost of computation. At one extreme, we can move all distributed data to a central
site for data mining to produce the most accurate result. At the other extreme, we can
process all the data locally to obtain local models, and combine them for the final
result [9].
Although the centralized technique produces more accurate results, it may imprac-
tical in some enterprise applications because of the data privacy and network transfer
constraints. So, how to improve the accuracy in the distributed environment is the
problem we care. In this section, we apply collaborative learning [14, 15] approach
which aims to improve the accuracy of local models. Since global model are com-
bined by the local models, the accuracy of local models considerably affects the final
results. In the following, we first use the agent goal relation to formalize the collabo-
rative learning process.
Definition1 (Basic Denotations)
( Actioni ) : Agent i takes action .
( Achievei : G ) : G denotes a goal of agent i. It takes action to finish the goal.
F = S1 U S 2 U S3 UL Sn 1 U S n : Si denotes the data stored in the host i. F is the
whole set of N distributed data. We assume Schema ( Si ) = Schema ( S j ) here.
Z = {L1 , L2 , L3 ,L Ln } : Li is data mining algorithm used in host i. Z denotes the
selected algorithms set in one distributed data mining process. Without loss of gener-
ality, we do not require the algorithms are the same in the DDM applications.
k = {C1 , C2 , C3 L Cn } : Ci denotes the local model in host i. k is the global result
of the whole process in N geographically distributed data sites.
Definition 2 (Data Mining Goal)
S = {g1 , g 2 , g3 ,L g n } denotes the data mining goals set of N distributed hosts.
In the application, the whole goal of agent society is to compute and acquire the globe
model and gi is local data mining goal of Agenti .
Definition 3 (Goal Relation)
R denotes the relations between goals. R : = { Before, Serial} ,
R ( gi , g j ) = Before iff (achievei : gi ) (achieve j : g j )
(action j ) U (actioni )
This formula means gi must be finished before g j . As the definition of Before,
Serial means gi and g j can be finished at the same time, in other words, there is no
time sequence between them.

Whole data sets Whole data sets

S1 S2 Sn S1 S2 Sn

L1 L2 Ln L1 L2 Ln

Local learning Local learning Local learning Local learning


Agent1
Agent2

C1 C2 C3 C1 C2 C3

Combination Combination
Fig.2. Collaboration learning process. The left one demonstrates Before relation and the
right shows Serial relation

When N hosts are being processed sequentially, it is possible to take advantage of


knowledge mined ahead to guide mining in the next. This is the motivation of apply-
ing collaborative learning in VAStudio. In figure 2, S1 , S 2 , S3 L S n are randomly
selected data tables from a large data bases. We assume they have the homogeneous
schema and the N hosts are connected by high performance computer network.
Z = {L1 , L2 , L3 ,L Ln } denotes the machine learning algorithms set wrapped by
mobile agents that can be dispatched to the N data sites for local learning. In the left
sub-figure, local model Ci is taken as input to the data mining program and is used
in building Ci +1 . According to the goal relation definition, we can conveniently de-
scribe the process as ( Achievei Li : g i ) I ( R ( g i , g i +1 ) = Before) . And for right
sub-figure, rather than assume that results are available before a stage begins, distrib-
uted algorithms can collaborate by sharing results as they become available. Since
many mining algorithms operate naturally as anytime algorithms, producing some
results very quickly and then more as time progresses, early in the DDM process
there likely will be results that can act similarly to those passed from stage to stage in
the sequential mode. We denote it
as ( Achievei Li : gi ) I ( Achievej Lj : g j ) I (R(gi , g j ) = Serial) . Comparing these two
techniques, the first is comparatively easy to control and implement. However, it will
lose some accuracy because only Ci +1 can acquire benefit from Ci . For the latter,
mobile agents share knowledge at a parallel way and learning from each other which
can effectively improve the accuracy of the local models. Generally speaking, we
adopt the first one if we emphasize the efficiency and the latter if we more care the
accuracy. From the description above, goal relation formalization facilitates multi-
agent interactions model for complex DDM applications.

5 Graphical User Interface

In this section, we demonstrate the GUI of VAStudio which provides a visual devel-
opment for agent based DDM applications. Considering heterogeneous environment
in distributed process, VAStudio is developed by Java language which has superior
characteristics on cross platforms and data bases.
In fact, we do not emphasize the data mining function and aim to provide a general
multi-agent development platform at former versions. Accordingly, the toolkit learns
many experiences and lessons from several mature multi-agent systems and building
platform, such as JADE[17], ZEUS[18] and Aglets[19] etc. However, different from
these systems, we adopt plug-in technique to make the toolkit extensible like
MATLAB [21] which can continually update and add new functions according to
practical requirements. In the recent version, we specially build the DDM algorithms
library and aim to make it a powerful tool for constructing agent based DDM.
As figure 3 shows, VAStudio is composed with the five main parts: the visual pro-
gramming environment, algorithms library, debug frame, hierarchical Behavior-
Agent-Society frame, menu and tool bar. Visual programming environment offers an
ideal window for java-oriented programmers to implement agent based applications.
Considering some programmers not familiar with agent theory, we provide the Be-
havior, Agent and Society building Wizard like most business software to help them
quickly develop DDM systems.
In the left big frame, we add the ID3, C4.5, Cart and Ripper algorithms to build
relative sub-behavior classes. And then, through the Agent Wizard, we construct
MobileAgent_ID3, MobileAgent_C4.5 and. In this experiment, we adopt the
weighted voting algorithms to generate MobileAgent_Integration. The process is
showed on the right big frame.
In order to support multi-agent collaboration, we provide GUI for programmers to
rapidly implement interaction ontology. The small frame shows the ontology build-
ing Wizard.

Fig.3. Graphical user interface of VAStudio, the small frame is an Ontology wizard, the left
big frame is showing the five Behaviors building and the right big one shows three agents
construction process.

6 Evaluations

In Section 2, we have reviewed four representative agent based DDM systems. In


order to evaluate VAStudio, we compare it with them from different aspects.
The biggest difference between VAStudio with the above four systems is the de-
velopment goal and application scope. VAStudio aims to provide a general develop-
ment platform for building agent based DDM applications. Under this view, it fo-
cuses on powerful algorithms library support, extensible interfaces and user-friendly
GUI since these attributes are most important for developers to implement their own
DDM applications. However, the four systems in Section 2 focus on learning algo-
rithms, integration techniques and some specific areas which inevitably confine their
application scope.
Concretely, from GUI aspects, JAM provides graph drawing tools to help users
understand the learned knowledge which employed major components from JavaDot
[16].PADMA provides a web-based user interface for visual interaction with the
system [10]. As far as I know, BODHI and Papyrus do not show the GUI in the pub-
lished papers. To VAStudio, just as show in section 5, it has a user-friendly, extensi-
ble GUI which provides complete support for users to rapidly develop agent based
DDM systems.
From data bases support aspects, JAM mainly focuses on homogeneous data bases
at first and later through the bridging method to support heterogeneous data schema.
Both PADMA and BODHI are proposed by Kargupta et al. The first one aims to
support homogeneous data bases and the latter to support heterogamous data bases
using the collective data mining techniques. According to the authors, Papyrus sup-
ports both schemas. To VAStudio, Integration algorithms library only provides sup-
port for homogeneous data base in the recent versions.
From the representative technique aspects, JAM is the most distinguished meta-
learning system for distributed data mining. Papyrus first proposes the idea to support
move data, model, result or mixed strategies in the DDM process. PADMA applies
the facilitator to support local model integration and provides a web interface for
human-computer interaction. BODHI first propose to use orthonormal basis function
to support heterogeneous data bases. Different from them, VAStudio is an integrating
DDM develop platform which can extended according to requirements. Its other
representative characteristic is applying collaboration learning for improve the accu-
racy of local models.

Table 1. The comparison of VAStduio with JAM, PADMA,BODHI and Papyrus.

System Develop- GUI Data Bases Technique


environment
JAM No Good Both Meta-learning
PADMA No Mid Homogeneous Facilitator
BODHI No Not show Heterogeneous Collective DM
Papyrus No Not show Both Multi-Strategy
VAStudio Yes High Visual Homogeneous Extensible
Develop- &Collaboratio
ment n Learning
7 Conclusions and Future Work

In this paper, we first review the most representative agent based DDM systems and
then describe VAStudio, an extensible, user-friendly toolkit for developing various
agent based DDM applications. And then, we formalize the goal relation theory to
facilitate modeling multi-agent interactions in complex DDM processes. Finally,
detailed comparison between VAStudio and former systems has been reported ac-
cording to different aspects. The contribution of this paper is that we first implement
an integrating platform for constructing agent based DDM which effectively connect
the theory and practice. And the goal relation formalization provides a perspective for
users to deeply understand multi-agent collaboration and simplifies the modeling
process for complex DDM applications. Moreover, the detailed comparisons between
the most representative agent based DDM systems may provide a short tutorial for
readers who are not familiar with this field.
In future, we plan to enrich the data mining algorithms library both from machine
learning theory and from statistics approach. And, one core problem we have to deal
with is to provide support for heterogamous data bases. The final goal is VAStudio
can provide develop support, from data mining algorithms to integration techniques,
not only for distributed data sites but also for the expansive Internet.

References

1. B. Park and H. Kargupta, Distributed data mining: Algorithms, systems, and applications,
in Handbook of Data Mining, N. Ye, Ed. Hillsdale, NJ: Lawrence Erlbaum, 2003, pp. 341
361.
2. M. Zaki and Y. Pan, Introduction: Recent developments in parallel and distributed data
mining, J. Distributed Parallel Databases, vol. 11, no.2, pp. 123127, 2002.
3. H. Kargupta, C. Kamath, and P. Chan, Distributed and parallel data mining: Emergence,
growth and future directions, in Advances in Distributed Data Mining, H. Kargupta and P.
Chan, Eds. Menlo Park, CA:AAAI, 1999, pp. 407416.
4. V. Kumar, S. Ranka, and V. Singh, High performance data mining, J. Parallel Distrib.
Computing. vol. 61, no. 3, pp. 281284, 2001.
5. M. Zaki Parallel and Distributed Data Mining: An Introduction. In Large-Scale Parallel
Data Mining (Lecture Notes in Artificial Intelligence 1759), edited by Zaki M. and Ho C.-T.,
Springer-Verlag, Berlin, pages 123, 2000.
6. M.Clusch, S. Lodi and G.Moro. Agent-based distributed data mining: the KDEC schema
LNAI 2586 pp104-122.2003
7. Kargupta, H. Park Hershberger Collective Data Mining: A New Perspective Toward Dis-
tributed Data Mining In Advances in Distributed and Parallel Knowledge Discovery
AAAI/MIT Press (2000) 131-178
8. S. J. Stolfo, A. L. Prodromidis, L. Tselepis, W. Lee, D. Fan, and P. K.Chan, JAM: Java
agents for meta-learning over distributed databases,in Proc. 3rd Int. Conf. Data Mining
Knowledge Discovery (KDD), D.Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy,
Eds., NewportBeach, CA, 1997, pp. 7481.
9. R. Grossman, S. Bailey, A. Ramu, B. Malhi,H. Sivakumar, and A. Turinsky. Papyrus: A
system for data mining over local and wide area clusters and super-clusters. Proceedings of
Supercomputing 1999
10. H. Kargupta, I. Hamzaoglu, and B. Stanfford. Scalable, distributed data mining using an
agent basedarchitecture. Proceedings the Third International Conference on the Knowledge
Discovery and Data Mining, pages 211-214, 1997.
11. H. Kargupta, B. Park, E. Johnson, E. Sanseverino,L. D. Silvestre, and D. Hershberger.
Collective data mining from distributed vertically partitioned feature space. Workshop on
distributed data mining, International Conference on Knowledge Discovery and Data Min-
ing, 1998.
12. P. Chan and S. Stolfo, A Comparative Evaluation of Voting and Meta-Learning on Parti-
tioned Data, In Proceedings of the Twelfth International Conference on Machine Learning,
Morgan-Kaufmann, San Francisco, California,pages 90-98, 1995.
13. T. G. Dietterich, Machine Learning Research: Four Current Directions, AI Magazine Vol-
ume 18, pages 97-136, 1997.
14. Provost, F. and D. Hennessy (1996). Scaling up: Distributed machine learning with coop-
eration. In Proceedings of the thirteenth National Conference on Artificial Intelligence,
Menlo Park, CA, pp. 7479. AAAI Press.
15. P. J. Modi and W.-M. Shen. Collaborative multi-agent learning for classification tasks. In
Proceedings of the Fifth International Conference on Autonomous Agents, pages
37.38,Montreal, 2001. ACM Press.
16. Wenk Lee and Naser S. JavaDot: An extensible visualization environment Technique Re-
port CUCS Department of Computer Sciences. Columbia University, NY 1997
17. http://jade.tilab.com/
TU UT

18. http://www.btexact.com/projects/agents/zeus/
TU UT

19. http://www.trl.ibm.com/aglets/
TU UT

20. http://www.cs.waikato.ac.nz/ml/weka/
TU UT

21. www.mathworks.com/
TU UT

22. www.intsci.ac.cn
TU UT

You might also like