Weka4WS A WSRF-Enabled Weka Toolkit

Weka4WS
Introduction
Weka4WS is a framework developed at the University of Calabria to extend the widely used Weka toolkit for
supporting distributed data mining on Grid environments.
Weka provides a large collection of machine learning algorithms written in Java for data pre-processing,
classification, clustering, association rules, and visualization, which can be invoked through a common graphical
user interface. In Weka, the overall data mining process takes place on a single machine, since the algorithms
can be executed only locally.
The goal of Weka4WS is to extend Weka to support remote execution of the data mining algorithms through
WSRF Web Services. In such a way, distributed data mining tasks can be concurrently executed on
decentralized Grid nodes by exploiting data distribution and improving application performance. In Weka4WS,
the data mining algorithms for classification, clustering and association rules can be also executed on remote
Grid resources. To enable remote invocation, all the data mining algorithms provided by the Weka library are
exposed as a Web Service, which can be easily deployed on the available Grid nodes. Thus, Weka4WS also
extends the Weka GUI to enable the invocation of the data mining algorithms that are exposed as Web Services
on remote Grid nodes.
To achieve integration and interoperability with standard Grid environments, Weka4WS has been designed by
using the Web Services Resource Framework (WSRF) as enabling technology. In particular, Weka4WS has been
developed by using the WSRF Java library provided by Globus Toolkit 4.0.x (GT4).
In the Weka4WS framework all nodes use the GT4 services for standard Grid functionalities, such as security
and data management. Those nodes can be distinguished in two categories:
1. user nodes, which are the local machines of the users providing the Weka4WS client software;
2. computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data mining
tasks.
Weka4WS is therefore distributed in two separated packages:
1. Weka4WS-client, which contains the client software (including the extended Weka GUI) to be installed on the
user nodes;
2. Weka4WS-service, which contains the WSRF-compliant Web Services to be installed on the computing nodes.
Installation
Software prerequisites
Weka4WS requires Globus Toolkit 4.0.x (full installation) on the computing nodes and only the Java WS Core (a
subset of Globus Toolkit) on the user nodes. Note that this is not a minimum requirement but a specific
requirement: Globus Toolkit 4.2.x and later versions contain some updates to the web services specifications
and in some other of its services which make them incompatible with Weka4WS.
Since the full version of Globus Toolkit 4.0.x runs on Unix platforms (Linux included), Weka4WS-service can be
currently installed only on those systems, while the Weka4WS-client can be installed both on Unix and on
Windows platforms.
To install Globus Toolkit 4.0.x here you have some useful links:
Globus Toolkit 4.0.x Download
GT 4.0.x Quickstart Guide
Installing GT 4.0.x (System Administrator's Guide)
Security prerequisites
Weka4WS runs in a security context, and uses a gridmap authorization: that is only users that are listed in the
service gridmap may invoke it. So in order to make Weka4WS run properly the following prerequisites must be
satisfied:
1. the Weka4WS user must hold a valid proxy certificate (in the X.509 format) with a given Distinguished Name
(DN);
Weka4WS: a WSRF-enabled Weka Toolkit http://grid.deis.unical.it/weka4ws/main.html#download
1 of 8 12/11/2012 7:49 AM
2. the file '/etc/grid-security/grid-mapfile' on the computing nodes must contain an entry to map the Weka4WS
client user to a local user at the computing node. An entry example follows:
"O=KGrid/OU=University of Calabria/CN=John Doe" john
Computing nodes
As 'root' user, perform the following step:
1. add the following line to the file /etc/sudoers:
globus ALL= NOPASSWD: /bin/ls, /bin/cp, /bin/mkdir, /bin/chown, /bin/gzip
As 'globus' user (or alternatively as user which runs the globus container), download the Weka4WS-service
package in a directory (for example in its home directory), and perform the following steps:
1. extract the Weka4WS-service package:
tar xzvf weka4ws-service-2.1.tgz
2. enter the just created directory:
cd ./weka4ws-service-2.1
3. generate the Weka4WS GAR file running the command:
./build.sh
4. deploy the Weka4WS service running the command:
./deploy.sh
User node
Download the Weka4WS-client package and extract it in a directory of your choice.
Configuration
The only configuration Weka4WS requires is the editing of the 'machines' file, placed in the 'etc' subfolder of the
client package. This file contains information regarding the computing nodes, formatted with the syntax
explained below.
The 'etc/machines' file syntax
Every line beginning with the '#' character will be ignored. Every line not beginning with '#' must contain the
hostname address, its Globus container port, its GridFTP port and the logging option of a given computing node.
The logging option can be only 1 or 0 standing for, respectively, "enabled" and "disabled": when the logging
option is enabled a detailed logging will be produced on the screen where the container has been started up and
is running. An example of machines file is shown below:
# ==================== computing node ==========================
# hostname | container port | gridFTP port | logging
pluto.deis.unical.it 8443 2811 1
saturn.deis.unical.it 8443 2811 0
cosmos.cs.icar.cnr.it 8443 2811 1
Execution
Computing nodes
As 'root' user, perform the following step:
1. start the GridFTP server with:
$GLOBUS_LOCATION/sbin/globus-gridftp-server -p <port>
(where <port> is the desired port; if not specified the default 2811 port will be used)
As 'globus' user (or alternatively as user which runs the globus container) perform the following step:
2 of 8 12/11/2012 7:49 AM
1. start the globus container with:
$GLOBUS_LOCATION/bin/globus-start-container -p <port>
(where <port> is the desired port; if not specified the default 8443 port will be used)
User node
Enter the directory where you extracted the client package:
cd <path>/weka4ws-client-2.1
and run "weka4ws.sh" (or "weka4ws.bat" if you are running it on a Windows machine).
Run the client on Windows
1. download the Globus "Java WS Core Binary Installer" from here;
2. extract its content to a directory of your choice (e.g. "C:\ws-core-4.0.7");
3. place the user certificate (usercert.pem and userkey.pem) in
C:\Documents and Settings\[your username]\.globus
(Explorer does not allow to create a directory with a name starting with a dot, so you will have to create the
.globus directory by running "mkdir .globus" in the command prompt)
4. place the Certification Authority files (a couple of files named like 'abc123.0' and 'abc123.signing_policy') in
C:\Documents and Settings\[your username]\.globus\certificates
5. set the environment variable:
* go to Control Panel / System / Advanced / Environment variables
* press the "New" under "System Variables" and set the following:
GLOBUS_LOCATION=C:\ws-core-4.0.7
* double click on "weka4ws.bat" to run the application.
Troubleshooting
* OutOfMemoryException (at client side): most Java virtual machines only allocate a certain maximum
amount of memory to run Java programs. Usually this is much less than the amount of RAM in your computer.
With Weka4WS you can extend the memory available for the virtual machine by running the 'weka4ws.sh' (or
'weka4ws.bat' on Windows) script passing as first argument the amount of RAM (in MB) you wish to use. For
example running:
./weka4ws.sh 2048
will run Weka4WS setting the maximum Java heap size to 2048MB.
* OutOfMemoryError (at server side): it is recommended to increase the maximum heap size of the JVM when
running the container. By default on Sun JVMs a 64MB maximum heap size is used. The maximum heap size
can be set using the -Xmx JVM option. For example if you want to set 512MB as maximum heap size you need
to run:
setenv GLOBUS_OPTIONS -Xmx512M
Screenshots
3 of 8 12/11/2012 7:49 AM
The Gui Chooser (left side), used to launch
Weka's four graphical environments. The
hosts list checking window (top right side),
automatically loaded at startup to check
whether on every host:
* the Globus Container is running and
accessible;
* the GridFTP Server is running and
accessible;
* the requesting user has an account on the
host;
* the Weka4WS service is deployed and
accessible;
* the Weka4WS client and service versions
are the same.
The Grid Proxy Initialization window
(middle right side), automatically loaded at
startup if the user credentials are not available
or have expired.
~
The Weka4WS Explorer component with the
modified parts highlighted. Through a drop
down menu (in blue) it is possible to choose on
which remote host we want the data mining
task to be computed; the Reload hosts
button (in red) brings up the hosts list
checking window (described above); the
Proxy button (in green) brings up the Grid
Proxy Initialization window (described above).
~
The Weka4WS Explorer showing multiple
tasks executed concurrently on some remote
hosts. The number of running tasks is
displayed on the right-lower corner. At the top
of the output panel is displayed the host name
where the task is being computed. At any time
it is possible to stop a remote task by
selecting the task from the 'Result list' (at the
left-lower corner) and pressing the 'Stop'
button.
~
4 of 8 12/11/2012 7:49 AM
With a very detailed logging it's possible to
follow the remote computations on their very
single steps, as well as to know their execution
times.
~
The KnowledgeFlow component with the
modified parts highlighted. Three buttons (in
the upper right corner) are used, from top to
bottom, to start all the tasks, to stop them and
the last one is to bring up the hosts list
checking window (described earlier). During
the computation the label below each
algorithm node displays the location address
upon where the computation is taking place.
The Proxy button (in the lower left corner)
brings up the Grid Proxy Initialization window
(described earlier).
~
The choice of the location where to run a
certain algorithm is made into the
configuration panel of each algorithm,
accessible right clicking on the algorithm icon
and choosing Configure: through a drop down
menu it is possible to choose on which remote
host we want the selected data mining task to
be computed.
~
5 of 8 12/11/2012 7:49 AM
For complex workflows the grouping feature in
sub-flows of the KnowledgeFlow is useful to
easily and quickly set the computing locations
of the algorithms by either setting to Auto all
the computing locations of the algorithms
belonging to the sub-flow, or choosing the
specific location of each algorithm by accessing
the relative configuration listed in the menu.

Changelog
Version 2.1:
* added the possibility to enable/disable a detailed logging at the remote host;
* improved datasets concurrent transfers performance: a dataset to be concurrently transferred to various
remote hosts is compressed by one thread only; a dataset to be analized by different data mining algorithms on
the same remote host is transferred only once by one thread only;
* updated xstream (the Java library used to serialize objects to XML and back again) to version 1.3;
* bugfix: tasks in the "runningTasks" list at the computing node were added twice for the same data mining
task when the dataset wasn't already available at the first service invocation;
* bugfix: tasks in the "runningTasks" list at the computing node side weren't removed from the list after their
termination;
* bugfix: resources at the computing node weren't destroyed when exceptions arised at the user or computing
node;
* bugfix: compressed datasets appeared corrupted after their transfer at destination;
* bugfix: method "isEmpty" of the string class (introduced in Java 6), used in the HostCheckThread class of the
user node, disproved compatibility with Java 1.4;
Version 2.0 (visit the web page of Weka4WS 2.0):
* Knowledge Flow front-end also extended to support remote data mining;
* code updated to the 3.4.12 book version of Weka (12th of December 2007);
* the client side of the application can run also on Windows machines;
* proxy credentials may be created inside the client application: a dedicated window may be called at any time
both in Explorer and Knowledge Flow and will automatically pop up at startup if the credentials are not available
or have expired;
* added pull-style message delivery mechanism for clients to whom notifications cannot be delivered (e.g.
because they are behind a firewall): the client now starts by default in pull-mode (that is it checks for the result
availability every 10 seconds) and requests a notification dispatch to the server: if it subsequently receives a
notification then the client switches to push-mode (that is it waits for a result availability notification),
otherwise it stays in pull-mode;
* improved hosts checking now including, besides the Globus Container and GridFTP availability check, also
user permissions check, Weka4WSService deployment check, versions compatibility check between client and
6 of 8 12/11/2012 7:49 AM
service;
* added possibility to extend the memory available for the virtual machine by running the 'weka4ws.sh' (or
'weka4ws.bat' on Windows) script and passing as first argument the amount of RAM (in MB) to be used. For
example running
weka4ws.sh 2048
will run Weka4WS setting the maximum Java heap size to 2048MB.
Version 1.0 (visit the web page of Weka4WS 1.0):
* code updated to the 3.4.11 book version of Weka (1st of June 2007);
* added detailed client and service logging;
* added full support to data preprocessing;
* added full support to data visualization;
* added full support to "classifier evaluation options";
* added possibility to set a supplied test set;
* added dataset compression to improve transfers speed;
* added reporting of server-side exceptions to the client;
* added JDBC support;
* added possibility to concurrently run multiple remote tasks;
* added possibility to stop remote task execution.
Download
The current Weka4WS packages (2.1, dated 2nd of July 2008) can be downloaded here:
Weka4WS client
TAR.GZ (3.6MB)
ZIP (4.1MB)
Weka4WS service
TAR.GZ (2.8MB)
ZIP (2.9MB)
Copyright (C) 2005-2008 University of Calabria - Dept. of Electronics, Computer Science and Systems
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General
Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option)
any later version.
Blog
Since June 2007 Weka4WS has a blog of its own. You can find it at the following address:
http://weka4ws.wordpress.com/
With the blog you can:
stay updated with the ongoing work and new versions releases of Weka4WS through RSS feeds;
request new features or suggest modifications to existing features;
report suspected bugs;
read the frequently asked questions;
have the printable Weka4WS user guide.
7 of 8 12/11/2012 7:49 AM
How to cite
Domenico Talia, Paolo Trunfio, Oreste Verta, "The Weka4WS framework for distributed data mining in service-
oriented Grids". Concurrency and Computation: Practice and Experience, vol. 20, n. 16, pp. 1933--1951, Wiley
InterScience, November 2008.[PDF]
About us
Grid Lab is located at the University of Calabria, Rende (CS), Italy, Cubo 41C 3rd floor.
For comments and suggestions please contact us:
Domenico Talia contact
Paolo Trunfio contact
Marco Lackovic contact
Universit della Calabria | Grid Computing Lab print this page
8 of 8 12/11/2012 7:49 AM

Weka4WS A WSRF-Enabled Weka Toolkit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weka4WS A WSRF-Enabled Weka Toolkit

Uploaded by

Copyright:

Available Formats

Weka4WS

You might also like