You are on page 1of 156

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

JavaTM Specification Request 73:


JavaTM Data Mining (JDM)

JSR-73 Expert Group


Specification Lead:
Mark Hornick, Oracle Corporation
mark.hornick@oracle.com

Technical comments:
jsr-73-comments@jcp.org

Version 1.1
Maintenance Release Specification
June 22, 2005

June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Copyright
Copyright (c) 2005 Oracle Corporation. All rights reserved.
Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or documentation may be reproduced in any form by any means without prior written authorization of
the copyright holders, or any of the licensors, if any. Any unauthorized use may be a violation of domestic or international law. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the U.S. Government and its agents is subject to the restrictions of
FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95)
and DFAR 227.7202-3(a).

Disclaimer
This document and its contents are furnished as is for informational purposes only, and
are subject to change without notice. Oracle Corporation (Oracle) does not represent or
warrant that any product or business plans expressed or implied will be fulfilled in any
way. Any actions taken by the user of this document in response to the document or its
contents shall be solely at the risk of the user.
ORACLE MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT
TO THIS DOCUMENT OR ITS CONTENTS, AND HEREBY EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR USE OR NON-INFRINGEMENT. IN NO EVENT SHALL
ORACLE BE HELD LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES IN CONNECTION WITH OR ARISING FROM THE USER
OF ANY PORTION OF THE INFORMATION.

Trademarks
Sun, Sun Microsystems, Java, JavaBeans, and Enterprise JavaBeans are trademarks, registered trademarks, or servicemarks of Sun Microsystems, Inc. in the U.S. and other countries.
OMG, Object Management Group, CORBA, Unified Modeling Language, UML, are registered trademarks or trademarks of the Object Management Group, Inc.
All other product or company names mentioned are for identification purposes only, and
may be trademarks of their respective owners.

June 22, 2005

Maintenance Release

1.

1.2
1.3
1.4
1.5
1.6

Introduction..........................................................................................................................1
1.1.1
Benefits..................................................................................................................1
1.1.2
Target audience......................................................................................................2
1.1.3
Data analytics JSRs ...............................................................................................2
1.1.4
Exclusions .............................................................................................................2
Architectural components ....................................................................................................3
Dependencies and relationships...........................................................................................4
Organization.........................................................................................................................4
Expert group members.........................................................................................................5
Acknowledgements..............................................................................................................5

Use cases..................................................................................................................6
2.1

2.2

3.

Version 1.1

Overview..................................................................................................................1
1.1

2.

JavaTM Data Mining (JDM)

Application use cases...........................................................................................................6


2.1.1
Mining GUI ...........................................................................................................6
2.1.2
Web specialty retailer ............................................................................................7
2.1.3
Campaign management .........................................................................................7
2.1.4
Minimal top level specification .............................................................................7
2.1.5
Selecting the best model ....................................................................................8
2.1.6
Comparing vendor implementations .....................................................................8
2.1.7
Incremental learning..............................................................................................8
2.1.8
Deferred task execution.........................................................................................9
2.1.9
Explaining model behavior....................................................................................9
2.1.10 Manually enhancing a model.................................................................................9
2.1.11 OLAP schema refinement ...................................................................................10
2.1.12 Web services ........................................................................................................10
Vendor use cases ................................................................................................................11
2.2.1
Broad support of JDM .........................................................................................11
2.2.2
Narrow support of JDM.......................................................................................12

Concepts.................................................................................................................13
3.1

3.2

3.3

June 22, 2005

Data mining functions........................................................................................................13


3.1.1
Classification .......................................................................................................13
3.1.2
Regression ...........................................................................................................13
3.1.3
Attribute Importance ...........................................................................................14
3.1.4
Clustering ............................................................................................................14
3.1.5
Association ..........................................................................................................14
Data mining tasks...............................................................................................................15
3.2.1
Building a model .................................................................................................15
3.2.2
Testing a model....................................................................................................16
3.2.3
Applying a model ................................................................................................17
3.2.4
Object import and export.....................................................................................18
3.2.5
Computing statistics on data................................................................................19
3.2.6
Verifying task correctness....................................................................................19
Principal objects.................................................................................................................20
3.3.1
Connection...........................................................................................................20
3.3.2
Task......................................................................................................................20
3.3.3
Execution handle and status ................................................................................20
3.3.4
Physical data set ..................................................................................................21
3.3.5
Physical data record.............................................................................................21
3.3.6
Build settings .......................................................................................................21
3.3.7
Algorithm ............................................................................................................22
3.3.8
Algorithm settings ...............................................................................................22
1

Maintenance Release

3.4

3.5

3.6
3.7
3.8
3.9

4.

JavaTM Data Mining (JDM)

Version 1.1

3.3.9
Model...................................................................................................................22
3.3.10 Model signature ...................................................................................................22
3.3.11 Model detail.........................................................................................................23
3.3.12 Logical attribute...................................................................................................23
3.3.13 Logical data .........................................................................................................23
3.3.14 Attribute statistics set ..........................................................................................23
3.3.15 Apply settings......................................................................................................24
3.3.16 Confusion matrix .................................................................................................24
3.3.17 Lift .......................................................................................................................24
3.3.18 Cost matrix ..........................................................................................................25
3.3.19 Prior probabilities ................................................................................................25
3.3.20 Category sets .......................................................................................................26
3.3.21 Taxonomy ............................................................................................................26
3.3.22 Rules ....................................................................................................................27
3.3.23 Verification report................................................................................................27
Physical data representations .............................................................................................27
3.4.1
Individual record .................................................................................................27
3.4.2
Single record case table .......................................................................................28
3.4.3
Multi-record case table ........................................................................................28
3.4.4
Data preparation ..................................................................................................29
Attribute mapping ..............................................................................................................29
3.5.1
Direct mapping ....................................................................................................29
3.5.2
Pivot mapping......................................................................................................30
Creating physical data objects ...........................................................................................30
Persistence .........................................................................................................................30
Object references ...............................................................................................................31
Reflection / introspection...................................................................................................32

Packages.................................................................................................................34
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

4.10
4.11
4.12
4.13
4.14

June 22, 2005

Design overview ................................................................................................................34


Notation .............................................................................................................................34
Package structure ...............................................................................................................36
Package javax.datamining .................................................................................................38
Package javax.datamining.base .........................................................................................40
Package javax.datamining.resource ...................................................................................43
Package javax.datamining.data..........................................................................................44
Package javax.datamining.task ..........................................................................................47
4.8.1
Package task.apply ..............................................................................................49
Package javax.datamining.supervised ...............................................................................50
4.9.1
Package supervised.classification........................................................................51
4.9.2
Package supervised.regression ............................................................................54
4.9.3
Package attributeimportance ...............................................................................55
Package javax.datamining.association...............................................................................56
Package javax.datamining.clustering.................................................................................58
Package javax.datamining.rule ..........................................................................................61
Package javax.datamining.statistics...................................................................................62
Package javax.datamining.algorithm .................................................................................63
4.14.1 Package algorithm.tree ........................................................................................63
4.14.2 Package algorithm.naivebayes ............................................................................64
4.14.3 Package algorithm.feedforwardneuralnet............................................................65
4.14.4 Package algorithm.svm........................................................................................66

Maintenance Release

4.15

5.

5.8

5.9
5.10
5.11

4.14.5 Package algorithm.kmeans ..................................................................................67


Package javax.datamining.modeldetail..............................................................................68
4.15.1 Package modeldetail.tree.....................................................................................68
4.15.2 Package modeldetail.feedforwardneuralnet ........................................................69
4.15.3 Package modeldetail.naivebayes .........................................................................69
4.15.4 Package modeldetail.svm ....................................................................................70

Building a clustering model ...............................................................................................71


Applying a clustering model to data ..................................................................................73
Applying a clustering model to a record............................................................................74
Building a classification model..........................................................................................75
Testing a classification model............................................................................................76
Building and extracting rules from a tree model ...............................................................77
Extracting rules from an association model.......................................................................79
5.7.1
Get rules with minimum support.........................................................................79
5.7.2
Get rules with minimum support and confidence................................................79
5.7.3
Get rules containing certain items .......................................................................80
5.7.4
Get rules that do not contain certain items ..........................................................81
Importing and exporting a model.......................................................................................81
5.8.1
Import an object using a URI ..............................................................................82
5.8.2
Export a model ....................................................................................................83
5.8.3
Export an object to a destination .........................................................................83
Using reflection..................................................................................................................84
Establishing a connection ..................................................................................................85
Uniform resource identifiers ..............................................................................................85

Conformance statement .........................................................................................87


6.1
6.2
6.3
6.4

6.5

7.

Version 1.1

Code examples .......................................................................................................71


5.1
5.2
5.3
5.4
5.5
5.6
5.7

6.

JavaTM Data Mining (JDM)

Required and optional features ..........................................................................................87


Vendor extensions ..............................................................................................................88
Compliance points .............................................................................................................88
Determining conformance .................................................................................................89
6.4.1
Function level conformance ................................................................................89
6.4.2
Algorithm level conformance..............................................................................90
6.4.3
Model apply engine conformance .......................................................................91
Claiming conformance.......................................................................................................91

Summary................................................................................................................93

Appendix A.Glossary.........................................................................................................94
Appendix B.Requirements...............................................................................................102
B.1.
B.2.
B.3.
B.4.
B.5.

Domain requirements.......................................................................................................102
Foundation technologies ..................................................................................................103
Data mining standards .....................................................................................................103
System behavior...............................................................................................................103
Exclusions for version 1 ..................................................................................................104
B.5.1. Domain exclusions ............................................................................................104
B.5.2. System exclusions .............................................................................................104

Appendix C.Optional Methods ........................................................................................106


Appendix D.Exceptions ...................................................................................................107
June 22, 2005

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Appendix E.Web services ................................................................................................110


E.1.
E.2.

E.3.
E.4.

Introduction......................................................................................................................110
Methods ...........................................................................................................................111
E.2.1. WSDL Document Structure ..............................................................................111
E.2.2. Listing DME Contents.......................................................................................112
E.2.3. Introspection / Reflection ..................................................................................114
E.2.4. Saving objects....................................................................................................115
E.2.5. Retrieving objects..............................................................................................116
E.2.6. Removing objects ..............................................................................................117
E.2.7. Renaming objects ..............................................................................................118
E.2.8. Retrieving Object Components .........................................................................119
E.2.9. Verify Object .....................................................................................................120
E.2.10. Executing tasks..................................................................................................121
E.2.11. Getting execution status ....................................................................................123
E.2.12. Terminating Tasks..............................................................................................123
Java methods supporting XML........................................................................................124
XML Schema Definition .................................................................................................125
E.4.1. JDM Document .................................................................................................125
E.4.2. Task....................................................................................................................125
E.4.3. Task.Apply.........................................................................................................128
E.4.4. Data....................................................................................................................129
E.4.5. Supervised .........................................................................................................132
E.4.6. Supervised.Classification ..................................................................................133
E.4.7. Supervised.Regression.......................................................................................135
E.4.8. Clustering ..........................................................................................................136
E.4.9. Association ........................................................................................................138
E.4.10. AttributeImportance ..........................................................................................138
E.4.11. Statistics.............................................................................................................139
E.4.12. Algorithm ..........................................................................................................140
E.4.13. Base ...................................................................................................................143
E.4.14. Root ...................................................................................................................145
E.4.15. Enumeration extension ......................................................................................146

Appendix F.References ....................................................................................................148

June 22, 2005

Maintenance Release

TABLE 1.
TABLE 2.
TABLE 3.
TABLE 4.
TABLE 5.
TABLE 6.
TABLE 7.

June 22, 2005

JavaTM Data Mining (JDM)

Version 1.1

An example of a single record case table ..........................................................................................28


An example of a multi-record case table ...........................................................................................28
Named and composite object referencing summary..........................................................................31
Function-level model behavior ........................................................................................................102
JDM optional methods for models and model details .....................................................................106
JDMException codes and messages ................................................................................................108
JDM runtime exceptions, codes, and messages...............................................................................109

Maintenance Release

FIGURE 1.1
FIGURE 1.2
FIGURE 4.2
FIGURE 4.3
FIGURE 4.4
FIGURE 4.5
FIGURE 4.6
FIGURE 4.7
FIGURE 4.8
FIGURE 4.9
FIGURE 4.10
FIGURE 4.11
FIGURE 4.12
FIGURE 4.13
FIGURE 4.14
FIGURE 4.15
FIGURE 4.16
FIGURE 4.17
FIGURE 4.18
FIGURE 4.19
FIGURE 4.20
FIGURE 4.21
FIGURE 4.22
FIGURE 4.23
FIGURE 4.24
FIGURE 4.25
FIGURE 4.26
FIGURE 4.27
FIGURE 4.28
FIGURE 4.29
FIGURE 4.30
FIGURE 4.31
FIGURE 4.32
FIGURE 4.33
FIGURE 4.34
FIGURE 4.35
FIGURE 4.36
FIGURE 4.37
FIGURE 4.38
FIGURE 4.39
FIGURE 4.40
FIGURE 4.41
FIGURE 4.42
FIGURE 4.43
FIGURE 4.44
FIGURE 4.45
FIGURE 4.46
FIGURE 4.47
FIGURE 4.48
FIGURE 4.49
FIGURE 4.50

June 22, 2005

JavaTM Data Mining (JDM)

Version 1.1

Architecture configuration options ......................................................................................................3


Example of attribute mapping for apply ............................................................................................29
Top level package structure ...............................................................................................................37
Common top level interfaces .............................................................................................................38
Exception classes ...............................................................................................................................38
Top level enumerations......................................................................................................................39
Execution Handle...............................................................................................................................39
Package javax.datamining.base - Named Objects .............................................................................40
Package javax.datamining.base - Build Settings, Model, and Task ..................................................40
Package javax.datamining.base - BuildSettings ................................................................................41
Package javax.datamining.base - Model............................................................................................42
Package javax.datamining.resource...................................................................................................43
Package javax.datamining.data - PhysicalData .................................................................................44
Package javax.datamining.data - LogicalData...................................................................................45
Package javax.datamining.data - ModelSignature.............................................................................45
Package javax.datamining.data - Taxonomy .....................................................................................46
Package javax.datamining.data - CategoryMatrix.............................................................................46
Package javax.datamining.data - CategorySet and Interval ..............................................................46
Package javax.datamining.task - Build..............................................................................................47
Package javax.datamining.task - Import and Export .........................................................................48
Package javax.datamining.task - ComputeStatistics..........................................................................48
Package task.apply - ApplyTask and ApplySettings .........................................................................49
Package javax.datamining.supervised - Settings and Model.............................................................50
Package javax.datamining.supervised - TestTask and TestMetrics ...................................................50
Package supervised.classification - Settings and Model ...................................................................51
Package supervised.classification - TestTask and TestMetrics..........................................................52
Package supervised.classification ClassificationTestMetricsTask.....................................................52
Package supervised.classification - ApplySettings............................................................................53
Package supervised.classification - Confusion Matrix and Cost Matrix ...........................................53
Package supervised.regression - Settings and Model ........................................................................54
Package supervised.regression - TestTask, and ApplySettings .........................................................54
Package supervised.regression - RegressionTestMetricsTrask..........................................................55
Package javax.datamining.attributeimportance - Settings and Model...............................................55
Package javax.datamining.associationrules - Settings and Model ....................................................56
Package javax.datamining.associationrules - Rule Selection ............................................................57
Package javax.datamining.clustering - Model...................................................................................58
Package javax.datamining.clustering - Settings ................................................................................59
Package javax.datamining.clustering - ApplySettings ......................................................................60
Package javax.datamining.clustering - Similarity Matrix .................................................................60
Package javax.datamining.rule - Rule and Predicate.........................................................................61
Package javax.datamining.statistics - AttributeStatistics ..................................................................62
Package algorithm.tree - TreeSettings ...............................................................................................63
Package algorithm.naivebayes - NaiveBayesSettings .......................................................................64
Package algorithm.feedforwardneuralnet - FeedForwardNeuralNetSettings....................................65
Package algorithm.svm.classification - SVMClassificationSettings.................................................66
Package algorithm.svm.regression - SVMRegressionSettings..........................................................66
Package algorithm.kmeans - KMeansSettings...................................................................................67
Package modeldetail.tree - TreeModelDetail ....................................................................................68
Package modeldetail.feedforwardneuralnet - NeuralNetworkModelDetail ......................................69
Package modeldetail.naivebayes - NaiveBayesModelDetail.............................................................69
Package modeldetail.svm - SVMModelDetail ..................................................................................70

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

1. Overview
1.1 Introduction
The Java Data Mining (JDM) specification addresses the need for a pure Java API to facilitate development of data mining-enabled applications. JDM supports common data mining operations, as well as the creation, persistence, access, and maintenance of metadata
supporting mining activities.
Currently, no existing Java platform specification provides a standard API for data mining
systems. Existing APIs are vendor-proprietary. By using JDM, implementers of data mining applications can expose a single, standard API that will be understood by a wide variety of developers writing client applications and components running on the Java 2
Platform. Similarly, data mining clients can be coded against a single API that is independent of the underlying data mining system. JDM is targeted for the Java 2 Platform,
Enterprise Edition (J2EE) and Standard Edition (J2SE).
In JDM, data mining [Mitchell1997, BL1997] includes the functional areas of classification, regression, attribute importance1, clustering, and association. These are supported by
such supervised and unsupervised learning algorithms as decision trees, neural networks,
Naive Bayes, Support Vector Machine, K-Means, and Apriori, on structured data. Common operations include model build, test, and apply (score). A particular implementation
of this specification may not necessarily support all interfaces and services defined by
JDM. However, JDM provides a mechanism for client discovery of supported interfaces
and capabilities.
JDM is based on a generalized, object-oriented, data mining conceptual model leveraging
emerging data mining standards such the Object Management Groups Common Warehouse Metadata (CWM), ISOs SQL/MM for Data Mining, and the Data Mining Groups
Predictive Model Markup Language (PMML), as appropriate
Implementation details of JDM are delegated to each vendor. A vendor may decide to
implement JDM as a native API of its data mining product. Others may opt to develop a
driver/adapter that mediates between a core JDM layer and multiple vendor products. The
JDM specification does not prescribe a particular implementation strategy, nor does it prescribe performance or accuracy of a given capability or algorithm.
To ensure J2EE compatibility and eliminate duplication of effort, JDM leverages existing specifications. In particular, JDM leverages the Java Connection Architecture [JSR16]
to provide communication and resource management between applications and the services that implement the JDM API. JDM also reflects aspects the Java Metadata Interface
[JSR40] for the interface specification.

1.1.1 Benefits
The availability of a J2EE-compliant data mining API provides benefit to both vendors
and users of tools and applications in the areas of business intelligence, business analytics,
data mining systems, data warehousing, and life sciences / bioinformatics.
Historically, application developers coded homegrown data mining algorithms into applications, or used sophisticated end-user GUIs. These GUIs packaged a suite of algorithms
complete with support for data transformation, model building, testing, and scoring. However, it was difficult, if not impossible, to embed data mining end-to-end in applications
using commercial data mining products due to inadequate APIs. If a vendor had an API, it
was proprietary, making the development of a product using that API risky. If a different

1. Attribute importance is also referred to as feature selection or key fields analysis.

June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

vendors solution was required, rewriting that product was also potentially costly.
The ability to leverage data mining functionality via a standard API greatly reduces risk
and potential cost. With a standard API, customers can use multiple products for solving
business problems by applying the most appropriate algorithm implementation without
investing resources to learn each vendors proprietary API. Moreover, a standard API
makes data mining more accessible to developers while making developer skills more
transferable. Vendors can now differentiate themselves on price, performance, accuracy,
and features. Java Data Mining (JDM) addresses this need for Java.

1.1.2 Target audience


The target audiences for the JDM specification can be categorized into the following
groups:

data mining vendors companies that intend to implement this API for their respective products, thereby providing the API to end users

application developers Java programmers who wish to use a data mining API for
building GUIs or other applications that benefit from data mining technology

data mining experts individuals with advanced degrees in statistics, machine learning, or data mining; or with significant practical data mining experience

data mining novices Java-knowledgeable developers who have a basic understanding of the problems that data mining can solve, who can minimally leverage the function-level of data mining tasks

1.1.3 Data analytics JSRs


The complement to data mining in data analytics is online analytical processing (OLAP).
To distinguish between OLAP and Data Mining, consider that OLAP follows a deductive
(query-oriented) strategy of analyzing data. Users formulate hypotheses, and execute queries to gain understanding of the underlying data. Data Mining follows an inductive strategy of analyzing data where users apply machine learning algorithms to extract nonobvious knowledge from the data.
JOLAP, (JSR-69) specifies a Java API for OLAP, and shares a common basis in the OMG
CWM meta-model. The JDM expert group is working with the JOLAP expert group to
minimize overlap and leverage common modeling techniques and infrastructure where
applicable.

1.1.4 Exclusions
The domain of data mining is quite large. The JDM expert group made decisions early
to exclude certain features from JDM to make it more manageable. As such, functionality
such as data transformations, visualization, mining unstructured data (e.g., text), wrappers
and ensembles, and sensitivity analysis have been omitted from this first version of the
API. Note that with respect to visualization, JDM does provide many of the key data
objects necessary to support visualization, e.g., confusion matrix, lift results, decision tree
representation, and neural network architecture.
From a systems perspective, JDM does not specify behavior for transactions, scheduling,
or security. These are left to vendors to determine what best suits their respective products
and customer base.

June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

1.2 Architectural components


JDM has three logical components that may be implemented as one executable or in a distributed environment:

application programming interface (API) - The API is the end-user-visible component of a JDM implementation that allows access to services provided by the data mining engine (DME). An application developer using JDM requires knowledge only of
the API library, not of other supporting components.

data mining engine (DME) - A DME provides the infrastructure that offers a set of
data mining services to its API clients. When implemented as a server of a clientserver architecture, it is referred to as a data mining server (DMS), which is a specific
instantiation of the more general Enterprise Information System (EIS) as specified in
the Connector Architecture (JSR-16).

mining object repository (MOR) - The DME uses a mining object repository which
serves to persist data mining objects. This repository can be based on, e.g., the CWM
framework, specifically leveraging the CWM Data Mining metamodel, or implemented using a vendor-proprietary representation. The MOR may exist in a file-based
environment, or in a relational / object database. Section 3.7 discusses JDM persistence options.
Figure 1.1 depicts three possible architectures for a JDM implementation. In (a), each
component resides in a separate physical location or separate executable. We view this as
a three-tier architecture with the data stored in a separate repository, such as a database. In
(b), the DME contains the MOR and results in a classic client-server architecture. This
scenario is possible, e.g., where the database contains both the DME and MOR, or the
DME uses the local files system for persistent storage. In (c), the system is monolithic,
i.e., API, DME and MOR reside in, or are managed by a single executable.

API

API

API

DME

DME
DM E

MOR

MOR

MOR

(a)

(b)

(c)

FIGURE 1.1 Architecture configuration options


A vendor may choose to provide additional utilities and management interfaces to the
DME and MR, however, these are not defined as part of JDM and may be proprietary. The
JDM specification does not place any requirements on the DME and MOR design or
implementation except to support functionality as required by the JDM interface.
Vendors may implement a subset of the complete JDM specification as noted in the section on conformance. This a la carte approach provides a common framework for all data
mining functionality, while allowing vendors to support only vendor-relevant portions of
it.
June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

1.3 Dependencies and relationships


JDM leverages aspects of the CWM Data Mining metamodel and the Java Metadata Interface (JSR-40). CWM Data Mining facilitates the construction and deployment of data
warehousing and business intelligence applications, tools, and platforms based on OMG
open standards for metadata and system specification (i.e., MOF, UML, XMI, CWM). The
Java Metadata Interface provides a common naming convention for methods.
The following specifications serve as design references for JDM:

DMG PMML 2.0, [PMML], provides an XML-based representation for mining models and facilitates interchange among vendors for model results.

ISO SQL/MM Part 6. Data Mining [SQL/MM-DM] provides a standard interface to


RDBMSs for performing data mining. Concepts from this approach are leveraged in
the overall JDM design.

Common Warehouse Metamodel [CWM] and CWM Specification, Volume 1, Chapter


15, Data Mining [CWM-DM] provides a sense of the overall structure of the metadata
JDM supports.

1.4 Organization
This document focuses on JDM requirements, concepts, use cases, code examples, packages supporting the API, and vendor conformance.
In Section 2, we present use cases to help the reader appreciate how this API can be used
under various circumstances, both by end users and vendors conforming to the standard.
In Section 3, we present the synthesis of data mining concepts that form the basis of the
JDM model. These concepts result from analyzing the requirements of many different data
mining functions and algorithms. These concepts are key to providing a unified data mining framework.
In Sections 4, we present the JDM packages and class diagrams to illustrate the relationship between the various interfaces and classes. Details of each class are provided in the
companion Javadoc-generated documentation.
In Section 5, we provide and explain code examples using the JDM API. These examples
represent working with the API as a non-data mining expert, relying on convenience routines to automate much of the specification, as well as exposing detailed specification for
data mining experts.
In section 6, we present the requirements for vendor conformance to the API.
In section 7, we summarize our JDM experience and where the standard is likely to go in
subsequent versions.
In appendix A, we provide a glossary of terms used in this document.
In appendix B, we review the data mining domain requirements and foundation technologies driving the API. We explore related data mining standards and common system
behavior.
In appendix C, we list optional methods for models and model detail a vendor may choose
to implement.
In appendix D, we provide JDM error codes for JDMException.
June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

In appendix E, we define Web services based on the JDM model. There has been significant interest expressed within the expert group and from external comments for defining a
JDM Web services interface.
In appendix F, we provide a list of references.

1.5 Expert group members


Sarabjot Anand
Robert Brunner
Robert Chu
Werner Dubitzky*
Kim Horn
Mark Hornick
Bill Hosken*
Ronny Kohavi*
Achim Kraiss
Marwane Jay Lamimi
Christoph Lingenfelder
Erik Marcade
Somesh Marepalli
Waddys Martinez*
Cindy McMullen
Chuck Mosher
John Poole
Michal Prussak
Alex Russakovskii
Mike Smith
Qian (Cherry) Yang
Sunil Venkayala
Andrew Walaszek
Hankil Yoon

Corporate Intellect
California Institute of Technology
SAS Institute
University of Ulster, N. Ireland
Sun Microsystems, Inc.
Oracle Corporation
SPSS, Inc.
Blue Martini Software
SAP AG
KXEN
IBM Germany
KXEN
Computer Associates International, Inc.
Magnify
BEA Systems
Sun Microsystems, Inc.
Hyperion Solutions
Fair Isaac
Hyperion Solutions
Strategic Analytics
Computer Associates International, Inc.
Oracle Corporation
SPSS, Inc.
Oracle Corporation

* former member

1.6 Acknowledgements
The expert group recognizes and thanks Dipankar Roy and Shiby Thomas for reviewing
previous drafts. We also recognize and thank Marcos Campos, Gary Drescher, Boriana
Milenova, Joe Yarmus, and Yan.Zhuang for their contributions to the JDM effort.

June 22, 2005

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

2. Use cases
The use cases presented in this section provide a context in which to understand the possible uses of JDM. We have divided use cases into two categories: those relevant to applications and those relevant to vendors implementing JDM conforming products. Readers
already familiar with data mining may want only to browse this section.
Several JDM concepts are introduced briefly below to assist in understanding the use
cases. These are described in more detail in Section 3. The reader is expected to be familiar with common data mining terminology.
Mining Function - A major subdomain of data mining that shares common high level
characteristics. Functions include classification, regression, attribute importance, association, and clustering.
Task - A container within which to specify arguments to data mining operations to be performed by the data mining engine. Tasks include model building, testing, applying (scoring), computing statistics, and object import and export. Tasks may execute synchronously
or asynchronously.
Settings - A collection of parameters specifying the input for building a data mining
model or applying a model to data (i.e., scoring). Build settings may be high level, specified for mining functions, or detailed, specified for mining algorithms. Apply settings
specify the content of the scoring result, and in some cases, affect the type of content provided. For example, a cost matrix may be specified for classification at apply time.
Model - An algorithm often produces a compressed representation of input data called a
model. This model contains the essential knowledge extracted from the data as determined
by the algorithm. A model can be descriptive or predictive. A descriptive model helps in
understanding the underlying data or model behavior. For example, an association rules
model on market basket data can be used to describe consumer behavior. A predictive
model can be an equation or set of rules that makes it possible to predict an unseen or
unknown value (the dependent variable or target) from other, known values (independent
variables or predictors).

2.1 Application use cases


In this section, we present several end-user use cases involving application developers that
explore a wide variety of situations in which JDM can be used.

2.1.1 Mining GUI


A team of developers is tasked with producing a GUI for visualizing data mining objects.
They use JDM to develop a tool for exposing objects for building models such as build
settings, and viewing model representations or contents. The models themselves include
decision trees, neural networks, and mining results such as confusion matrices and lift.
Decision trees can be traversed and graphically displayed in a tree representation; neural
networks can be traversed and graphically displayed to show hidden layers and weights on
connections. The GUI also supports scoring data, testing models, computing lift and
graphically displaying lift charts.
In this use case, a JDM implementation provides the enabling data mining functionality. If
only standard JDM features are leveraged, this GUI could be portable across vendor JDM
implementations.
June 22, 2005

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

2.1.2 Web specialty retailer


A specialty retailer sells from a website, catalogs, and stores. The website has a recommendation feature that is supported by data mining. Customer data are collected from each
companys points of sale into its data warehouse. Sales data are combined with demographic data such as age, gender, and income. Demographic data together with product
categories are regularly mined for customer 'clusters' using a clustering algorithm. Product
sales data are then partitioned by customer cluster and each cluster is mined for product
associations using association rules algorithms. The website uses the resulting association
rules to make online product recommendations with each addition to the customers virtual shopping cart.
In this use case, multiple JDM mining functions are leveraged: clustering, association, and
the ability to score individual records to support online product recommendations.

2.1.3 Campaign management


A campaign management application provides automated support for identifying customers to receive a marketing campaign. The application has access to data collected on customer demographics and responsiveness to such mailing campaigns. This application
leverages database vendor-specific transformations to prepare data for mining.
Using the mining function attribute importance (also referred to as feature selection), the
application determines which attributes are most relevant for model building. By using a
smaller set of attributes, model build time can be reduced, model predictive accuracy can
increase, and the attributes most valuable to collect from customers can be highlighted.
The application uses a decision tree algorithm to produce rules that can be understood by
the marketing manager, possibly for developing more targeted mailings to customers of a
given set of demographics. Once the model is built, the application tests the model and
sends the test and lift results to the campaign manager, who can assess model quality and
expected results.
Unless directed otherwise, the application uses this model to score new customers eligible
for a new mailing campaign. Those customers who have a probability greater than 75% to
respond to the mailing will be selected for the mailing.
In this use case, data preprocessing may occur outside JDM using proprietary or ad hoc
techniques. Multiple JDM mining functions and operations are leveraged through task
specification. To communicate models and results to other users, these objects can be
exported, perhaps using an XML representation. JDMs flexible apply settings allow the
application to specify the score, probability, customer id, and possibly other input data to
be part of the apply result table. Finally, JDMs rule representation and the ability of certain algorithms to produce rules is leveraged to explain model behavior. Note that JDM
defines predicate-based rules from the decision tree algorithm for either classification or
regression mining functions, and the clustering mining function for the K-Means algorithm.

2.1.4 Minimal top level specification


A college student learned about the potential of data mining to solve many problems. For
her senior biology thesis, she wants to cluster the data shes collected over the past year on
wild grasses of the African plains to help her categorize those grasses.

June 22, 2005

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Although an avid Java programmer, she is unfamiliar with the details of data mining. Having read about JDM and having access to a commercial implementation through her
school, she leverages all the automated aspects of JDM, specifying only the data and
accepting all default settings for the Clustering build settings. In this way, no algorithm
selection is necessary, nor any algorithm-specific settings.
She uses the API for the clustering model for inspecting the identified clusters.
In this use case, JDM allows novice users to extract benefit from data mining technology
by eliding algorithm details. Vendor implementations may vary in the degree of automation and the quality of models that automation produces.

2.1.5 Selecting the best model


An e-tailer builds models on projected customer revenue from which to base providing
customer discounts. The data analyst for the e-tailer builds multiple regression models
drawing on several algorithms: neural networks, decision trees, and naive bayes. After
building several models of each type, the models are tested against held-aside test data and
lift is computed. An initial criterion for selecting the best model is the one with the least
r-squared error.
In this use case, the data analyst leverages a JDM implementations ability to reuse a single regression build settings object, supplying different algorithm settings. In addition,
each model can be tested by defining test tasks, and coding an outer loop to iterate over the
test results to identify the best model.

2.1.6 Comparing vendor implementations


Data Mining Laboratories (DML) performs independent analysis on data mining software
to measure performance, ease of use, and model portability. DML compares the effectiveness of several vendors regression decision tree implementations in building models for
economic forecasting. Economic forecasts are used in corporate planning to align corporate strategy with the expected economic climate. Using JDM, the DML developers code a
test application that builds one neural network model per vendor implementation. After
testing each model, the investigators rank order models according to forecast accuracy,
learning time and the ratio of these two. To ensure fairness in assessing model performance and conformance for model portability, a separate scoring engine is used that
accepts PMML standard XML models and generates scores for the test data.
In this use case, the developers are able to code a single program and execute on multiple
vendor implementations, modifying only login information. By exporting models in
PMML format, models can be objectively assessed in a common scoring engine.

2.1.7 Incremental learning


A machine tool manufacturer collects data on the machine settings, materials, and defect
rates for the tools manufactured. These data are provided to a neural network algorithm to
predict the probability of defective components in a given batch of product. Because data
are collected over time, and the architecture of the neural network and specific learning
algorithm chosen is compute intensive, the manufacturer needs to apply incremental learning on the neural network as new data is available from each production run.
In this use case, JDM provides an interface that enables incremental learning, i.e., the ability to continue building a model with the original build data or new data. To support this, a
June 22, 2005

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

user specifies an existing JDM model as input to a build task, along with other required
inputs. On execution of the task, the DME uses this model as a seed from which to continue building the model. This optional specification can be used for any type of algorithm
that can leverage a seed model.

2.1.8 Deferred task execution


A cancer researcher, who has limited access to hardware for building and testing models,
needs to define and verify a series of mining tasks and storing them in the mining object
repository. The researcher may even build trial models on very small datasets as part of
verifying the task. Using an external scheduling mechanism, such as UNIX cron jobs, the
researcher schedules execution of these tasks over night, when computing resources are
more available.
In this use case, the researcher uses JDMs task specification and ability to store objects in
the mining object repository. These can later be retrieved for execution. The verify method
allows the researcher to have a greater sense that his tasks will execute to completion. The
verify method typically checks if the logical and physical data map properly and if the
combination of settings specified are compatible.

2.1.9 Explaining model behavior


A bank leverages data mining to predict credit risk for customers seeking home equity
loans. To comply with government regulations to not discriminate based on gender or
race, the bank must be able prove that the rules they apply to determine credit risk exclude
such criteria.
The banks data analyst is required to produce a set of human understandable rules, ideally
in english-like format, that can be presented to government auditors as needed. Bank management also reviews these rules to target certain customer segments for special promotions.
In this use case, the analyst uses the JDM tree settings to request a decision tree representation for a classification model, predicting credit worthiness as low, medium, or high. The
analyst then uses JDMs interface to generate rule objects from the decision tree model
and translate these rules to a particular format. A given vendor may have an english format
implemented for rules.

2.1.10 Manually enhancing a model


A private security agency builds decision tree models to profile suspicious individuals and
identify individuals at airports for further screening. However, in their experience they
have found manually enhancing a model can improve its performance and accuracy. Their
data mining analyst builds a model, generates an english-like representation of the rules,
removes certain irrelevant rules and possibly even adjusts some of the rule predicates.
Importing this modified model to the data mining system, the analyst sets up an application to enable profiling by leveraging single record scoring of individuals, accessing information stored in government databases and information obtained from travelers at the
airport.
In this use case, the analyst also uses the JDM tree settings to build a classification model.
The rules are generated from the decision tree model and analyzed. However, since JDM
does not enable direct model modification via the API, the analyst can export the model,
perhaps in PMML, to ensure model integrity. The analyst modifies the model and attempts
June 22, 2005

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

to import the model. Validation of the manually modified model occurs at import. JDMs
support for single record scoring enables the analyst to produce an application that joins
information stored in a database about individuals with that dynamically acquired by airport personnel, perhaps at the ticket counter.

2.1.11 OLAP schema refinement


An OLAP vendor creates cube schemas from fact tables stored in a relational database. A
particular fact table contains millions of records representing sales and customer information of a beverage retail company. The OLAP vendor needs to create a schema for the
OLAP cube to enable analyzing and reporting the retailer's sales data.
A cube schema is a set of dimensions each having a particular hierarchy of attributes.
Dimensions usually correspond to several columns in the fact table, however, not all columns should necessarily produce a dimension. A dimension normally represents an
attribute that is orthogonal to other dimensions in the fact table. In addition, some of the
columns, identified in advance, represent measures in the model.
Choosing the right set of dimensions is key to OLAP providers. If the number of dimensions is too large, efficient processing of the cube becomes practically impossible. On the
other hand, dropping important attributes makes data analysis deficient. Poor cube design
is one of the factors that inhibit OLAP productivity. Therefore it is important to choose the
right schema.
The optimization process of a cube structured can be seen from two different perspectives.
Starting from a fact table with hundreds of columns, OLAP vendors are either interested
in:

identifying truly independent columns, or


identifying what are the important columns to be kept in the optimized cube structure.
Attribute importance can be used to select the most important independent columns to better see a given measure. For example, an internal mechanism can build an analytic data
set with columns describing both customer characteristics and product characteristics with
the sales amount as a target. Then this system trains an attribute importance model on this
data set. It returns the columns (either describing some aspects of the customer or the
product) that allow to understand better the spread of average sales figure. Some advanced
systems can even return not only the important columns but also the drilling hierarchies
that can be associated with these columns (segments for continuous variables and groups
of categories for discrete variables). These important columns will be used to create an
(eventually ad-hoc) optimized cube structure that the final user will use to understand better the average sales figure and build segments that will combine the customer or products characteristics that are the most explanatory.
Such schema refinements are intractable in large cubes without data mining.

2.1.12 Web services


List Inc. offers a comprehensive list management service that includes data warehousing,
grooming, merging and predictive modeling. All their services are available as Web services allowing customers to integrate List Inc.s software seamlessly into their own enterprise systems using the Internet.

June 22, 2005

10

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

The Data Web service allows customers to connect to a managed warehouse and store
their transaction, customer and sales data using a secure Web service interface. List Inc.
manages the customer data in its data warehouse, cleans and grooms the data, and provides a range of preprocessing and transformation facilities. They maintain a comprehensive repository of high quality background data including income, census, and
demographic and geographic data. List Inc. has relationships with many data vendors and
can call upon their services when required. This background data is merged with the customer data using their proprietary merge technology.
List Inc. offers a complete model training and testing facility that guarantees optimal
results. The customer data is used to build predictive models to determine the best
responders, cross sell and up sell models and investigate return on investment (ROI). List
Inc. has a comprehensive testing facility that can choose the best algorithm and product
combination that delivers the optimal ROI. The customer does not have to worry about
data mining tool integration, training and testing.
The customer decides only on the schedule for updating models and the ROI they require.
List Inc. owns two super computers to provide the fastest modeling facilities available
today.
JDM is critical to List Inc.s services. The Predictive Web service wraps JDM to allow the
customer to apply models. The Training Web service wraps JDM to allow the customer to
build models and set parameters. JDM is used internally to connect to different vendor
data mining tools and algorithms in their building and testing processes.
The Training Web service can be used by both novice customers and experienced data
analysts. Mining savvy data analysts can tailor the training process, choose particular
algorithms and their settings. In addition, they can choose the attributes from their data
they wish to include in models.
The Prediction Web service provides access to the resultant models across the net. The
Prediction Web service interface is called with new prospect data and the score outcome
returned. The service allows customers to enhance their software systems and their own
web sites with predicted outcomes as if they owned the data mining tools themselves.

2.2 Vendor use cases


In this section, we present several use cases that explore how vendors can leverage JDM in
commercial JDM implementations.

2.2.1 Broad support of JDM


A data mining vendor has a wide range of algorithms that addresses each of the JDM mining functions. The vendors objective is to simplify mining for unsophisticated users. As
such, the vendor provides automated selection of algorithms without requiring (or allowing) the user to select specific algorithms or provide algorithm-specific control of algorithms, e.g., maximum tree depth in a decision tree.
In this use case, the vendor must implement all packages of the API except Algorithm subclasses and model detail subclasses. Users of the vendors data mining product will specify build settings only, obtain models, and be able to view and use those models as
appropriate. Note that the end-user can see only the function-specific model representations, not their underlying algorithm-specific model representations.

June 22, 2005

11

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

2.2.2 Narrow support of JDM


A data mining vendor Neural Networks, Inc. (NNI) supports various neural network algorithms, both published and proprietary in their data mining tool. NNI supports both classification and regression. The vendor chooses to be JDM compliant to gain acceptance in
the marketplace.
In this use case, JDM, as an a la carte standard, allows a vendor to implement a narrow
portion of the standard to reflect its specific domain, or subset of mining functions supported. The JDM packages to support this include the core foundation packages and a
select few specific to neural networks including algorithm settings and model detail.
For the vendors proprietary algorithms, an additional Java package
nni.feedforwardneuralnetwork is provided which includes the specific proprietary algorithm settings and model representations.

June 22, 2005

12

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3. Concepts
In this section, we introduce JDM concepts: mining function, task, principal objects, physical data representations, attribute mapping, physical data storage, object references, and
reflection and introspection.

3.1 Data mining functions


In general, data mining functions can be classified into two categories: supervised and
unsupervised. Supervised functions are typically used to predict a value and require the
specification of a known outcome or target for each case to be used during model building. Examples of targets include binary attributes indicating buy/no-buy, churn/no-churn,
success/failure, and multi-class attributes indicating preferred color choice from among
the primary colors, likely salary range binned in $20,000 increments. The target allows the
algorithm to determine how well it is predicting target values. An example of supervised
learning algorithms includes Naive Bayes for classification.
Unsupervised functions do not use a target, and are typically used to find the intrinsic
structure, relations, or affinities in a body of data. Examples of unsupervised learning
algorithms include k-means clustering and Apriori association. Clustering may be used to
identify naturally occurring groups of proteins among hundreds of cases, or retail customer segmentation. The itemset rules returned from an association model can be used to
identify products to cross-sell to retail customers.
Another view of mining involves whether data mining is descriptive or predictive.
Descriptive data mining describes a dataset in a concise and summary manner, and presents interesting general properties of the data. Algorithms supporting descriptive data
mining include k-means clustering, Apriori association, and even decision tree classification. Predictive data mining constructs one or a set of models, performs inference on the
available dataset, and attempts to predict outcomes for new data sets. Algorithms supporting predictive data mining include neural networks, SVM, and decision tree classification/
regression, and even k-means clustering when used to assign new records to clusters.
Different algorithms serve different purposes, each algorithm offering its own advantages
and disadvantages. JDM specifies the following mining functions: classification, regression, attribute importance, clustering, and association. Some algorithms can be used
across multiple data mining functions.

3.1.1 Classification
Classification has been used in customer segmentation, business modeling, and credit
analysis. As a type of supervised learning, an algorithm supporting classification builds a
model from a set of predictors that are used to predict a target. A set of predictors may
include demographic data such as age, income, number of children, and zip code, to predict the binary target buy/no-buy a minivan. The input or build data for a supervised learning algorithm requires the presence of attributes for both predictors and target in each
case. Given a pre-determined set of classes in the target attribute, classification analyzes
the build data to create a model that can predict to which class a given case belongs.

3.1.2 Regression
Regression has been used in financial forecasting, time series prediction, biomedical and
drug response modelling, and environmental modelling. Also a type of supervised learnJune 22, 2005

13

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

ing, regression involves predicting a continuous, numerical valued target attribute given a
set of predictors. A regression problem may use the same predictors as a classification
problem, but specifies a target such as the predicted lifetime value of a customer.

3.1.3 Attribute Importance


Attribute importance is used to determine which attributes are most relevant for building a
model. Attribute importance can be used for both supervised and unsupervised learning.
Attribute importance enables users to reduce model build time, and in some algorithms,
reduce data scoring time by including only the most important attributes from the build
data. Eliminating noise attributes from data can also improve accuracy or model quality.
Attribute importance serves a purpose similar to feature selection. It produces a model that
ranks attributes according to how each attribute contributes to the quality of a model built.
From the ranking of attributes, users may select the attributes to be used in building models. The user can specify a number or percentage of attributes to use; alternatively a user
can specify a cutoff point. Note that the ranking of attributes is interpretable usually in a
relative sense. JDM specifies no precise interpretation of attribute rank values other than
attributes with a greater numeric value are relatively more important.

3.1.4 Clustering
Clustering has been used in customer segmentation, gene and protein analysis, product
grouping, finding numerical taxonomies, and text mining. Clustering analysis identifies
clusters embedded in the data, where a cluster is a collection of data objects that are similar to one another. A good clustering method produces high quality clusters to ensure that
the inter-cluster similarity is low and the intra-cluster similarity is high. The similarity of
two values of an attribute can be expressed as distance functions. For numeric data, this
can be as simple as the euclidean distance between points. For categorical data, similarity
can be expressed to make married and cohabiting closer to one another, as well as separated and divorced.

3.1.5 Association
Association has been used in market basket analysis and the analysis of consumer behavior for the discovery of relationships or correlations among a set of items, e.g., the presence of one pattern implies the presence of another pattern. They help to identify the
attribute value conditions that occur frequently together in a given set of data. Association
analysis is widely used in transaction data analysis for directed marketing, catalog design,
and other business decision-making process. Traditionally, association is used for market
basket data analysis such as 90% of the people who buy milk also buy bread.
Support and confidence metrics are used as a quality measure of a rule within an association model. These are available in JDM as part of the Association model for each rule produced. Note that the rules returned from an association model are different from the
predicate-based rules produced from clustering models or decision tree models. Here, the
rules consist of a set of items. These items typically occur together in a single transaction,
such as the items purchased at an online retail checkout.
The support of a rule is used to ensure that the items in associated in the rule occur
together frequently enough to be considered significant. Using the probability notation,
support (A o B) = P(A, B)

June 22, 2005

14

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

The confidence of a rule is the conditional probability of B given A; confidence (A o B)


= P (B/A) which is equal to P(A, B) / P(A).

3.2 Data mining tasks


Data mining revolves around a few common tasks: building a model, testing a model,
applying a model to data, computing statistics, and importing and exporting mining
objects. Each of these are discussed below.

3.2.1 Building a model


JDM enables users to build models in the functional areas: classification,
regression, attribute importance, clustering, and association. The model serves as a typically concise or compact representation of the information contained in the data, relative
to the algorithm that produced it. To build models, users define tasks, which minimally
require the input parameters: model name, mining data and mining settings. Settings contain parameters that describe the type of model to be built, as well as directions to the specific algorithm used to build the model.
There are two levels of settings: function and algorithm. Recall that the mining function
addresses the type of problem to be solved, e.g., classification or clustering, and the mining algorithm addresses the specific technique to be applied to solve that problem, e.g.,
decision tree or k-means. When a user does not specify algorithm settings in a build settings, the Data Mining Engine (DME) may choose an appropriate algorithm for the task,
either dynamically or statically, providing defaults for the relevant parameters. Model
building at the function level eliminates much of the technical details of data mining for
the user. The quality of models will be determined by the sophistication of the vendors
implementation and the quality of the data.
Build data, i.e., the data used as input to build a model, can be in different forms. The
attributes of the build data to be used in model building may be specified in the logical
data associated with the build settings. JDM supports flexible assignment of build data to
the logical data. If logical attributes do not map directly to physical attributes with namebased equivalence, an explicit mapping may be provided using the task object.
A typical scenario for model building is as follows:
1. Create a physical data object (by identifying existing data in a database table or file)
2. Create a build settings object
3. Create a logical data instance based on the physical data and associate it with the build
settings (optional)
4. Create an algorithm settings object and associate it with the build settings (optional)
5. Create a build task and set the physical data and build settings
6. Map the physical attributes to logical attributes (if necessary)
7. Invoke the execute method using the task
After a model is built by the DME, it can be persisted in the MOR. See section 3.7 for
details on JDM persistence options.
The result of a build is a model. Especially for predictive models, the number of logical
attributes used by the model may be a subset of those provided in the logical data. As
such, the model has a signature specifying the possible input attributes to the model for
June 22, 2005

15

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

apply. These are not required attributes as a subset may be specified where NULL values
can be handled. Some algorithms perform automatic attribute selection, e.g., with a decision tree model, 100 attributes may have been used to train the model, but only 25 were
used in the final rule set and are necessary for scoring. These 25 constitute the model signature.

3.2.1.1 Incremental learning


Some applications have a nearly continuous stream of data available for model building. A
typical approach is to collect a certain amount of data, build a model from it, use the
model to score new data for some period, and then build a new model from scratch, possibly using all the data accumulated to date, or using a fixed amount of data, but using the
most recent data.
This approach can be unnecessarily costly, especially for algorithms such as Naive Bayes
or Association Rules where summary frequency counts are maintained. The frequency
counts of the existing data do not change, only the new data added needs to be counted and
the results merged with the previous counts. This produces refreshed models in much less
time.
Algorithms such as neural networks can also leverage incremental learning. Here, a previously trained neural network can be provided as a seed model. The model is further trained
using new data, but starting from an already good model.
JDM provides support for incremental learning by allowing a seed model to be specified
as input to the build task. Not all functions or algorithms are expected to handle the specification of a seed model for incremental learning. The function and algorithm capabilities
list indicates if this feature is supported, which is vendor-specific.

3.2.1.2 Model evaluation


Some algorithms, such as neural networks or decision trees use a portion of the build data
to iteratively determine how well the model is learning patterns from the data. These algorithms will split the build data into a train and evaluation dataset according to some internal percentage, e.g., 50%-50%, or 70%-30%. Some users, however, wish to control more
carefully the data that is used for training versus that used for evaluation.
JDM provides support for specifying the evaluation data explicitly in the build task to be
used during model building. Although some vendors may provide proprietary algorithm
settings to allow specifying the percentage of data to be used for evaluation, JDM provides
the more explicit option of providing the actual data.

3.2.2 Testing a model


Model testing gives an estimate of the accuracy a model has in predicting the target of a
supervised model. Testing follows model building to compute the accuracy of a models
predictions when the model is applied to a previously unseen dataset, separate from the
build dataset. This provides an honest estimate of the accuracy.
The test task accepts a model and data for testing the model. Test results are stored in a
TestMetrics object as specified in the task. Physical attribute to logical attribute mapping
may be specified if the names of physical and logical attributes do not match. The test data
must be compatible with the model signature.

June 22, 2005

16

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Test data must be preprocessed in the same way as the build data. The user is responsible
for ensuring this compatibility. However, some DMEs may choose to use information
present in the LogicalData stored with the model to flag incompatibilities.
Test metrics content depends on the type of model. For example, classification models
produce a confusion matrix, whereas regression models provide error estimates. In addition to obtaining a confusion matrix, model testing includes option to compute lift and
receiver operating characteristics (ROC). A user may specify to compute lift or ROC in a
test task.
Lift is a measure of the effectiveness of a predictive model calculated as the ratio between
the results obtained with and without the predictive model. The cumulative gains and lift
charts are often used as visual aids for measuring model performance. A positive target
value v and the number of quantiles q are two common parameters to computing lift.
Suppose that there exist n records in the input data, p of which are known to have the positive target v, thus yielding a total gain p/n. These input records are applied to the predictive model to get the predicted target value and its likelihood. Then the records are
rearranged in the order of the likelihood of the positive prediction and divided into q equal
segments. The cumulative gain ci of a quantile i is the ratio of the cumulative number of
positive targets to the total number of records n. The lift value li of quantile i is computed
as the ratio of the cumulative gain ci to the total gain p/n.
ROC is a measure of comparison between individual models to determine thresholds
which yield a high proportion of positive hits. ROC curves aid users in selecting samples
by minimizing error rates. ROC was originally used in signal detection theory to gauge the
true hit versus false alarm ratio when sending signals over a noisy channel.
The horizontal axis of an ROC graph measures the false positive rate as a percentage. The
vertical axis shows the true positive rate. The top left hand corner is the optimal location in
an ROC curve, indicating high TP (true-positive) rate versus low FP (false-positive) rate.
The ROC Area Under the Curve is useful as a quantitative measure for the overall performance of models over the entire evaluation data set. The larger this number is for a specific model, the better. However, if the user wants to use a subset of the scored data, the
ROC curves help in determining which model will provide the best results at a specific
threshold.
In addition to computing test metrics from a task specification, JDM enables the computation of test metrics using a scored dataset. Here, by specifying a dataset that has the
required attributes, e.g., actual value, predicted value, the confusion matrix can be computed. Similar capabilities exists for computing lift and ROC. This separation of apply
results from the test computation provides greater application flexibility as well as enables
computing test metrics using data produced outside of JDM or a data mining system.

3.2.3 Applying a model


In many applications, the ability to make predictions is the main purpose for mining data.
Applying a model to a case produces one or more predictions or assignments. JDM
enables batch scoring as well as single case scoring, intended for real-time response. In
supervised mining, model apply produces predictions or scores along with their corresponding probability. In unsupervised mining, such as clustering, apply assigns a case to a
cluster, with the probability indicating how well the case fits with a given cluster. Where
the probability of a prediction or assignment is high, e.g., closest to 1.0, the algorithm

June 22, 2005

17

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

encoded in the model that this pattern (combination of predictors with the target or grouping) was seen frequently in the build data.
Using an apply settings specification, a user may tailor the content of the result. For example, a user may want the customer identifier, along with the score, probability, and rule
number, e.g., if the model is a decision tree, to be output for each case in the apply data. In
the case of classification, users can also specify a cost matrix in the apply settings.
As with test above, the data input to the trained model must match the models signature.
However, not all attributes present in the signature must be present in the physical data for
apply (or test). Missing values handling depends on the algorithm or its implementation.
For supervised models, the target attribute, of course, is not needed. The result of the apply
operation is placed according to the specification of the apply settings and destination
physical data location in the task.

3.2.4 Object import and export


The purpose of importing and exporting JDM objects is several fold, including:

interchange with other DMEs (homogeneous or heterogeneous)


persistent storage outside the DME and MR
object inspection or manipulation
JDM explicitly enables the import and export of system metadata. The formats provided
are vendor-specific and may include XML, Java serialized objects, and proprietary formats. Since XML is a common format for metadata interchange, JDM cites three standard
definitions for data mining metadata in XML: PMML for mining models, and CWM for
Data Mining for non-algorithm-specific build settings and tasks, and JDM itself as defined
for web services. Applications needing to export other objects in a data warehousing context can leverage the broader CWM metadata specification. The need still exists for a standard algorithm settings XML representation. Of course, vendors may choose to identify
proprietary representations.
Importing models to be used for applying, testing, or incremental learning requires the
vendor to establish a mapping between the importable object representation (e.g., PMML
XML) and JDM models and metadata (e.g., function and algorithm settings). JDM does
not specify the mapping between importable objects formats and JDM objects. This is
vendor-specific.

3.2.4.1 Use of PMML


PMML is becoming an increasingly popular standard representation for models in XML.
Most vendors now support export and/or import of PMML models for at least a couple of
model types. PMML does not specify the settings used to create the model, only the resulting model representation, e.g., the nodes in a decision tree or the clusters in a k-means
clustering model.
As such, users cannot expect to access an imported PMML model as a standard JDM
model with complete settings available. However, a PMML model does contain sufficient
information to enable apply and test. Depending on a vendors implementation, the import
of a PMML model and subsequent export may not be lossless. It is possible that the export
of the same imported model may result in additional or reduced metadata in an exported
PMML model. Losslessness can be achieved if a vendor maintains the original PMML
string to be returned on export of the same model in the PMML format.
June 22, 2005

18

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Some PMML models map readily to JDM as PMML served as input to aspects of the JDM
design. JDM also influenced certain aspects of PMML in the PMML 2.0 release. Note that
JDM is not a PMML viewer; not all contents of a PMML model are readily exposed
through Java objects.
Vendors may supplement a PMML model using extensions to include JDM metadata as
part of the model, however, this is outside the specification of PMML.
Even for PMML, not all information that comprises a JDM model is immediately exportable in the PMML standard, e.g., PMML does not specify build settings. Vendors may
choose to leverage the PMML extension provision where arbitrary text can be provided. In
this case, the CWM-DM or JDM XML representation of settings may be used.

3.2.4.2 Use of CWM


For metadata, the JDM metadata (excluding models) largely maps to CWM-DM. As such,
importing and exporting certain objects in CWM-DM XML may be lossless. Due to the
size and complexity of the CWM metamodel, only a few vendors actually support CWMDM.

3.2.5 Computing statistics on data


An important aspect of the data mining process is understanding the nature of the data to
be mined. This is initially accomplished by look at the univariate statistics derived from a
dataset. The statistics computed depend on the type of attributes present in the data. For
example, integer or string data corresponding to categorical values, such as marital status,
can be analyzed from frequency counts of each of the values. Whereas floating point or
other integer data can be analyzed from mean, mode, median, standard deviation, etc.,
JDM supports the computation of univariate statistics using the compute statistics task on
a given physical data set.
It is implementation defined as to which statistics provided in the Statistics package are
computed for each attribute type. Users may optionally specify how the data should be
interpreted using a logical data object, described below.

3.2.6 Verifying task correctness


Tasks support the specification of parameters to data mining operations. Since a task may
be run asynchronously, users care if the task is likely to execute without obvious errors.
For example, users want to avoid the situation where a task is scheduled for an overnight
run, only for the user to come back the next morning to find the name of the data file was
mistyped, causing the operation to fail immediately.
Generally, function setting and task parameters need not be validated at the time the
parameter is set to avoid potentially time-consuming set operations. As such, each task
interface has a verify method that can be invoked after all parameters have been set. The
verify method returns NULL if no errors were detected, otherwise it returns a verification
report object, which contains vendor specific content describing the error.
The DME will also verify the parameters when a task executes. The DME may contain
some logic to know whether a complete reverification is necessary or just automatically
re-execute verify.This is at the vendors discretion.

June 22, 2005

19

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3.3 Principal objects


In this section, we introduce the principal objects defined in JDM. This discussion is valuable to understand the code examples and Javadoc. In Section 4, we present each of the
packages and corresponding objects through a series of class diagrams. This depicts the
relationship of the objects presented in this section. For details of each class, refer to the
documentation generated using Javadoc.

3.3.1 Connection
JDM connection objects are abstractions for vendor specific access to the DME, e.g.,
java.resource.cci.Connection or a JCX Connection. JDM users access a DME by creating
a connection object via a connection factory, per the Java Connection Architecture (JCX).
The factory accepts a user name and password to gain access to the DME. Connections are
expected to be single-threaded, i.e., a single application thread is expected to use a given
connection instance, thereby avoiding concurrency control issues.
The connection also provides access to objects persistent in the MOR. Methods defined on
connection objects enable the creation, deletion, and retrieval of mining objects present in
a user namespace.
When a user establishes a connection to a DME, the connection provides access to objects
in the users namespace. Although not required, users can reference objects in other user
namespaces using the convention <username>.<objectname> when supplying an object
name for applicable method arguments.
The connection does not provide direct access to the data to be mined. For this, other standard interfaces for talking to databases (e.g., JDBC) and file systems already exist. Data to
be used for data mining is specified via a URI, a reference to the actual data with a vendorspecific format.

3.3.2 Task
A task object serves as a container within which to specify arguments for data mining
operations to be performed by the DME. By providing an object to specify tasks, we separate the specification of the task from its execution. To support deferred or batch processing, one or more task objects can be saved and scheduled for execution by the application.
JDM defines tasks for each of the mining operations, i.e., build, test, and apply. It also
defines tasks for computing statistics and importing and exporting mining objects.

3.3.3 Execution handle and status


Tasks can execute synchronously or asynchronously. Asynchronous task execution, which
is optionally supported, results in an execution handle that can be used to manipulate running tasks (e.g., getting status and terminating) or completed tasks (e.g., getting status).
The specific behavior for terminating an executing task is determined by the vendor. It is
generally expected that any objects created by the task up to the point of termination
would be deleted.
An execution handle object also supports a method for blocking until the task execution
completes. As such, synchronous execution can be simulated. Applications that require
notification of task completion could leverage various notification mechanisms if inte-

June 22, 2005

20

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

grated by the vendor, such as the Java Messaging Service (JMS), JMX remote MBeans,
and JCA 1.5 inbound communication.
By invoking getStatus on an execution handle, applications can determine the current status, e.g., executing or terminated. In addition, implementations may provide incremental
status on task execution using this mechanism to inform users, e.g., whether a decision
tree model is training or pruning, or what percentage of the model build or apply is complete. Applications can leverage this information to provide real-time feedback to end
users, especially if a visual interface is present.
Each time a task is executed asynchronously, an execution handle is created and associated with the task. A vendor implementation may choose to keep all or a subset of the past
execution handles with the task. The most recently executed handle must be provided.

3.3.4 Physical data set


Physical data sets refer to the data to be used as input to data mining operations. Physical
data set objects reference specific data via a URI, e.g., specifying a table or file, and a set
of physical attributes.
A physical attribute object typically corresponds to a field in a formatted file or column in
a database table. Using task objects, physical attributes can be mapped to the logical
attributes of a models signature or logical data of a build settings object.
The physical data set object can support multiple data representations including relational
tables, row-column structured files, star schemas, XML files, and OLAP cubes. For data
mining, JDM expects the input data to be provided as a single entity.
Physical data set representations are discussed further in section 3.4.

3.3.5 Physical data record


Physical data records are used as input to apply data mining operations. In particular,
physical data records are used for single case scoring, for both input and output. This
enables real-time scoring in JDM.
Physical data record representations are discussed further in section 3.4.

3.3.6 Build settings


A build settings object captures the high level specification input for building a model.
JDM specifies mining functions: classification, regression, attribute importance, association, and clustering.
Build settings allows a user to specify the type of result desired without having to specify
a particular algorithm. Although a build settings object allows for the specification of an
algorithm and its settings, if omitted, the DME selects an algorithm based on the build settings and possibly characteristics of the data.
Build settings may also be validated for correct parameters using the verify method.

June 22, 2005

21

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3.3.7 Algorithm
A data mining algorithm is a technique or procedure that when applied to data produces a
model. The set of data mining algorithms is extensive and growing. As such, JDM does
not include a large set of algorithms, but specifies a framework for including new algorithms and their model representations. This enables vendors to provide additional algorithms and functionality in advance of their inclusion in the standard.
An algorithm consists of optional specifications:
1. Algorithm settings that specify parameters or inputs that affect model building
2. Model detail that defines a models content, e.g., the specific decision tree representation that describes the tree nodes (predicates, support, etc.)
For each of these specifications, this may involve the definition of a new interface, the
reuse of existing interfaces, or the specialization of existing interfaces.

3.3.8 Algorithm settings


An algorithm settings object captures the parameters associated with a particular algorithm. It allows a knowledgeable user to fine tune algorithm parameters. Generally, not all
parameters must be specified, however, those specified are taken into account by the
DME.
Distinguishing algorithm settings from build settings provides a natural and convenient
separation for those users experienced with data mining and those familiar only with mining functions.

3.3.9 Model
A model object is the result of applying an algorithm to data as specified in a build settings
object. The representation of a model is specific to the algorithm used and vendors may
choose whether to expose the model detail representation. Models can be (1) used for
direct inspection, e.g., to examine the rules produced from a decision tree or association,
(2) tested for accuracy, (3) applied to data for scoring, (4) exported to an external representation such as PMML and (5) imported for use in the DME.
A model references its build settings as well as the task that created it. A model has a signature, as described below.
In this first release, models are intended to be read-only objects and cannot be directly
modified via the Java API or explicitly stored in the mining object repository (MOR) by
the user.

3.3.10 Model signature


A model signature represents the input expected for applying a model, and testing where
applicable. The signature consists of a set of signature attributes that capture the attribute
name, type, and representation (data type). The model signature may contain fewer
attributes than specified in the logical data used to build the model. For example, a decision tree model may have used a subset of the logical attributes.

June 22, 2005

22

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3.3.11 Model detail


A model detail object is the detailed state or representation of a model, which is often
algorithm-related. The interface Model and its subclasses provide a function-level, generic
representation for models. For example, a clustering model allows users to access clusters.
However, a decision tree model, resulting from a specific algorithm, has specific model
detail that a knowledgeable user may want to inspect.
Model detail need not be linked to a specific algorithm. It is possible that a given algorithm may provide none, one, or multiple detail representations for its models. In the case
of multiple representations, the algorithm settings may contain an enumeration for model
detail to be produced from the model build.

3.3.12 Logical attribute


A logical attribute describes a domain of data to be used as input to data mining operations. Logical attributes are typically either categorical or numerical, but can be indicated
as being both if the chosen algorithm is capable of handling such a specification. A logical
attribute can reference additional metadata that characterizes the attribute as either categorical, e.g., a list of the categories; ordinal, e.g., a list of ordered categories; or numerical, e.g., the bounds of the data.
Through the build settings interface, users can specify additional handling of attributes.
For example, a user may want to suppress automatic algorithm-applied transformations on
one or more logical attributes, or specify a weight for individual attributes.
A logical attribute can be identified as supplementary, meaning that it is not used for
model building, but may be used by the algorithm for other purposes. For example, in
clustering, attributes used to compute another attribute could be identified as supplementary. Here, the clustering model is built using the computed attributes and the algorithm
also computes cluster statistics on the supplementary attributes. Statistics on attributes
such as AGE and INCOME are more readily understood by those viewing cluster definitions than an attribute ratio computed as log (INCOME) / AGE*2.

3.3.13 Logical data


A logical data object is a set of logical attributes that describes the logical nature of the
data used as input for model building, essentially how physical data should be interpreted
for model building. Each logical attribute is uniquely named within a logical data object.
The specification of logical data is optional. If absent, the DME makes assumptions about
the logical data. For example, string physical attributes are treated as categoricals, whereas
number physical attributes are treated as numericals. Names and datatypes are carried forward, and all data is considered active and prepared.

3.3.14 Attribute statistics set


An attribute statistics set object is a container for univariate statistics on a related set of
attributes and/or cases. It results from computing statistics on a physical data set object
directly, as a characterization of the data used to build a model, or information on a particular model element, e.g., a cluster in a clustering model.
The univariate statistics include: continuous, numerical, and discrete. Continuous statistics are applicable to continuous numerical attributes and provide frequencies and sums of
June 22, 2005

23

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

squares for value and value range. Numerical statistics are applicable to continuous and
discrete numerical values. Numerical statistics include mean, median, variance, max and
min. Discrete statistics are applicable to discrete numeric and categorical data such as
strings. Discrete statistics include the model, and histogram data.

3.3.15 Apply settings


An apply settings object allows users to tailor the results of an apply task. It contains a set
of ordered items. Output can consist of:

data to be passed through to the output from the input dataset, e.g., key attributes
values computed from the apply itself, e.g., score, probability and in the case of decision trees, rule identifiers

multi-class categories for their associated probabilities, e.g., in a classification model


with target favoriteColor, users could select the specific colors to receive the probability that a given color is favorite.
Each mining function class defines a method to construct a default apply settings object.
This simplifies the programmers effort if only standard output is desired. For example,
typical output for a classification apply would include the top prediction and its probability.
Apply settings may also be validated for correct parameters using the verify method.

3.3.16 Confusion matrix


A confusion matrix is a two-dimensional, N x N table that indicates the number of correct
and incorrect predictions a classification model made on specific test data. It provides a
measure of how well a classification model predicts outcomes and where it makes mistakes. The row and column indexes refer to the classes of the target. For example, consider
the table:
Actual \ Predicted

Churner

Non-Churner

Churner

250

6 (Type I Error)

Non-Churner

21 (Type II Error)

506

The accuracy of the model on the test data is (250 + 506) / (250 + 506 + 21 + 6) = 96.6%
The error is (21 + 6) / (250 + 506 + 21 + 6) = 3.4%

3.3.17 Lift
Lift is a measure of how prediction results improve using a model than could be obtained
by chance. For example, consider that 2% of the customers contacted from a mailing list
would purchase a product. To ensure all 2% would respond, a catalog mailing would have
to be sent to the entire mailing list. Using a data mining model to select catalog recipients,
we could select those customers most likely to make a purchase. For a given customer segment, perhaps 10% of the likely purchasers can be sent a catalog. The lift then is computed
as 10/2 or 5. Lift can also be computed on a per-decile basis.
Lift may also be used as a measure to compare different data mining models. Since lift is
computed using a dataset with actual outcomes, lift compares how well a model performs

June 22, 2005

24

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

with respect to this dataset on predicted outcomes. Lift also indicates how well model predictions improve over random selection.

3.3.18 Cost matrix


A cost matrix is a two-dimensional, N x N table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models,
where N is the number of classes in the target, and the columns and rows are labeled with
class values. For example, consider the following table:

Actual \ Predicted

Churner

Non-Churner

Churner

300

Non-Churner

100

The problem is to determine who will churn (change service provider) and who will not. If
a person turns out to be a churner, but is predicted to be a non-churner, the cost may be
$300 since the service provider did not act with promotions to entice that customer not to
churn and replacing that customer is expensive. However, if the customer is predicted to
be a churner, and in fact would not churn, there is a cost of $100 in unnecessary promotions given to that customer.
To represent when an algorithm predicts unknown, the cost matrix may handle this case by
defining an additional category for the target with CategoryProperty of unknown. This
allows a model to predict unknown if model cannot make a prediction better than chance.
A cost matrix may be used when building a model, applying a model to data, and computing lift or return on investment. Note, however, that although a cost matrix may be specified for all classification problems, it may be ignored if the particular algorithm cannot
handle such input. It is up to each vendor to document the behavior in this case.

3.3.19 Prior probabilities


The prior probabilities, or priors, refer, in general, to a mathematical description of a
users domain knowledge of the parameters characterizing the target prior to analyzing
data resulting from an experiment. The priors are typically combined with conditional
probabilities to compute posterior probabilities of the quantities of interest. The posterior
probabilities are the model output, used for prediction or other analyses. The form of the
conditional probabilities is usually known and its parameters are computed from the data.
Often, for example in a Naive Bayes model, the priors are taken to be the global distribution of the target.
The global distribution of the target can be computed from the data or specified by the
user. User-specification becomes critical when stratified sampling or other special sampling procedures have been applied where that stratification is with respect to target value.
For example, when building a model to predict fraud, the source dataset may include only
.5% fraudulent cases. Yet in a dataset with 1 million cases, it may be sufficient to build a
model. The data might be sampled to provide an equal number of fraudulent and nonfraudulent cases. By specifying that the target priors were .995 for non-fraudulent and
.005 for fraudulent cases, the algorithm can adjust the resulting model to more accurately
reflect the characteristics of the general population.

June 22, 2005

25

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

The sum of the priors must equal 1. If a value is present in the data that is not specified in
the priors and at least one prior has been specified, an exception is raised.
Note that stratification is a general procedure that can apply to attributes other than the target. It is a procedure that first groups data rows by some criterion and then samples differentially among the groups. In the more general case, a row weighting scheme would be
required for the model to be able to re-construct the relation of sample to population.
Note, however, that although priors may be specified for all classification problems, they
may be ignored if a particular algorithm cannot handle such input. It is up to each vendor
to document the behavior in this case.

3.3.20 Category sets


A category, also referred to as a nominal value, is a discrete value that can belong to a categorical attribute. A related collection of categories is called a category set. For example,
the category set of colors could contain categories red, green, and blue. JDM limits category values to integer, double, and string datatypes. Since category values correspond to
physical attribute content, these values may be cryptic, e.g., the number 1 used to represent the color red. As such, categories may have a name associated with them for informational purposes. These names may serve as display names or contain identifiers to allow
mapping to external data tables, e.g., to display category names in different languages.
A category set can be associated with a categorical attribute to identify the categories
expected to be found in the physical attribute. If the data contains values different from
those specified in the category set for a target attribute, an exception is raised at task execution time and/or object validation time. It is vendor specific whether exceptions are
raised on non-target attributes where category sets are specified.
The category set can also identify categories that represent missing data, i.e., missing values. For example, some physical attributes may contain the value 999 to represent a
missing number. Others may contain null values or a . character to indicate missing values. These can be specified in the category set with a corresponding property. Through a
function or algorithms capabilities specification, the user can determine if missing values
are handled automatically. Vendor documentation must specify how missing values are
handled.

3.3.21 Taxonomy
A taxonomy represents hierarchical relationships between categories. Generally, the topmost categories are most general, and the leaves are most specific or referring to specific
item categories. For example, in the category taxonomy of beverages, there may be two
sub-categories alcoholic and non-alcoholic. The category alcoholic may have further subcategories of beer, wine, liquor, and sparkling wine.
Taxonomies can exist as explicit relationships between categories, represented as Java
objects, or as metadata that references external tabular data. Large taxonomies are
expected to exist as external tabular data.
Taxonomies are an optional specification for Association settings.

June 22, 2005

26

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3.3.22 Rules
Certain algorithms produce rules of the general form: antecedent implies consequent.
JDM defines two kinds of rules: itemset-based as used in association rules and predicatebased as used in decision trees and clustering.
Decision tree rules involve predicate-based antecedents with probability based predictions
in the consequents. Clustering can express the component clusters as rules producing
assignments to a cluster.
Association rules are presented in the form of association objects that link an antecedent
itemset with a consequent itemset.
JDM provides methods for extracting rules from the corresponding models. Rule objects
may be translated into a vendor-defined format string, or a standard XML representation.
Rule objects provide methods giving access to the various rule features.

3.3.23 Verification report


A verification report results from executing the verify method on certain objects, e.g.,
tasks and settings. If the verify method finds no issue with the provided object, it returns
NULL. Issues may be warnings or errors. Warnings provide information to the user that
the DME may not be able to process the object as expected. Errors indicate the DME will
not process the object and will throw an exception if execution is attempted.
In its simplest form, a verification report indicates its type, warning or error, and contains
a text string. The format of this string is vendor-specific. Vendors may choose to subclass
VerificationReport with specific structure.

3.4 Physical data representations


Physical data can occur in several forms: individual records, single record case tables, and
multiple record case tables. By using URIs to specify single and multiple record case
tables, the source of data may come from any vendor-supported datastore, e.g., databases,
files systems, OLAP cubes, star schemas, etc. In addition, data may be prepared or unprepared.

3.4.1 Individual record


Certain applications require apply results (e.g., predictions) for a single record of data. In
customer call center applications, for example, a customer calls and responds to certain
questions over the phone. This information combined with corporate customer demographic data can be used as input to a model to determine which offer should be made to
the customer, dynamically. These applications often expect real-time response.
JDM supports the scoring of individual records, where the data is provided in a PhysicalDataRecord object. Such records can be created by explicitly setting values for individual
attributes, or via string or input stream and a specified format, such as CWM or SQL/MM.
A RecordApplyTask specifies one PhysicalDataRecord for input and one for output. The
RecordApplyTask requires synchronous execution to facilitate real time response.
The Java objects defining the input record may be reused by the application by resetting
the data values. This minimizes the number of objects created for apply. The input record
is not to be modified by the execution of the RecordApplyTask. The output record is autoJune 22, 2005

27

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

matically created by the system and reused upon each invocation. It is up to the application to copy the values from the output record as needed. Obviously, the output record is
not valid with results until the execute completes.

3.4.2 Single record case table


Most data used for data mining is stored in the single record case table format. In this format, each column of data corresponds to a logical attribute, e.g., age and income. Each
row of data corresponds to an individual case to be considered during mining.
Table 1 illustrates a typical example of a single record case table. There are two records,
representing two cases of different individuals. In a supervised learning situation, the column churner could be identified as the target. The column id would not be used for model
building, but would be used for scoring to enable matching a particular score with the corresponding record.

id

age

income

churner

100

45

100

False

200

23

25

True

TABLE 1. An example of a single record case table

3.4.3 Multi-record case table


Sparse data is more effectively stored in multi-record case format. Here, data that has a
variable number of entries (or items) from among many possible can be stored more compactly where only the items present are stored in the table. This representation is typically
used for association functions, but can be used in other functions as well. A common name
for data represented in this format is transactional data.
Table 2 illustrates the same data presented in Table 3 in multi-record case table format.
SeqID

Name

Value

100

age

45

100

income

100

100

churner

False

200

age

23

200

income

25

200

churner

True

TABLE 2. An example of a multi-record case table


In this format, a column no longer correspond to a logical attribute, but assumes a role.
Here, columns assume the roles of a sequence identifier, item or attribute name, value, and
possibly ordering of the items within a transaction. The values in the attribute name columns can be mapped to logical attributes. Note that the value column is not typically
available or necessary for association and is optionally specified.
A challenge with multi-record-case data is that the value column may need to allow multiple data types, e.g., strings as well as numbers. Some vendor implementations may require
a uniform datatype for all values. This can be achieved through data preprocessing.
June 22, 2005

28

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

3.4.4 Data preparation


Although transformations supporting data preparation are outside JDMs scope, JDM does
allow the specification on a per attribute basis whether an attribute has been prepared by
the user or not.
If the user does not want the DME to further manipulate an attributes data values, perhaps
by binning or normalization, the attribute is flagged as prepared. If the DME cannot work
with the data as presented, perhaps a neural network requiring normalized data was presented with data in an invalid range, the DME may choose to throw an exception or produce a poor model.
Some DMEs may be able to accept data in a more raw form and perform automated
transformations within the DME. In this case, user may flag the data as unprepared and
expect the DME to preprocess the data. One benefit of allowing the DME to prepare the
data is that the transformations are typically embedded in the model. When data is scored
or model details examined, values are presented in terms of the original data value.
Note that flagging data as unprepared does not mean that the user did not, or could not,
prepare the data in some way, perhaps removing or replacing missing values, or computing new attributes. Transformations done by the user on data prior to invoking a model
build task must be performed in similar manner to data used for a model test or apply task.

3.5 Attribute mapping


The mapping of physical attributes to logical attributes is based primarily on the representation of physical data and how it should be interpreted for mining operations.
Attribute mapping is used from physical data to logical data for mining task input.
Attribute mapping is also used for mapping from apply settings to physical data for scoring.
JDM specifies two kinds of mapping: direct and pivot, as explained below.

3.5.1 Direct mapping


Direct mapping involves a one-to-one mapping of physical attributes to logical attributes
in a build settings. Normally, this mapping can be avoided if the names of physical and
logical attributes match. However, if the names are not the same, users need to explicitly
provide mapping between attributes.

Model Signature

Apply Input (Physical Data)

Score Prob

Apply Ouput (Physical Data)

FIGURE 1.2 Example of attribute mapping for apply

June 22, 2005

29

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Figure 1.2 illustrates an example that maps apply input data (AID) to apply settings data
(ASD), and the mapping of AID to the model signature (MS). A direct mapping between
AID and MS can be specified explicitly. Here, since the physical attribute named X is different from the MS attribute named A, mapping is required. To output key values or other
data values from AID to ASD, a direct mapping from X to ASD attributed named A is
specified. If AID attribute Z were a key, a direct mapping to ASD attribute Y would
rename the attribute and output key values for each cases score. From the model itself, the
user specifies to output the top score and corresponding probability (assuming a classification model).

3.5.2 Pivot mapping


Pivot mapping involves mapping multi-record case data to logical attributes. Here, the
physical attribute with role attributeName contains values that can be mapped to logical
attributes.
A user need not perform this mapping in the case of an association function as all
attributeName values are treated as items. In this case, no logical attributes need be identified. This is particularly beneficial if there are thousands of items that would otherwise
need to be mapped.
Mapping is required for other algorithms that use logical attribute specifications. Only
attributeName values that are mapped to logical attributes are used for data mining.

3.6 Creating physical data objects


With the exception of individual records, physical data will reside in a database or file system. Individual records are created programmatically. Having metadata that describes the
name and datatype of each column or attribute of physical data is essential for automatically creating physical data objects.
In the case of databases, such metadata is readily available in system tables. A JDM
implementation can query the database to populate a physical data objects attributes automatically.
In the case of files, such metadata requires explicit specification. There are typically two
approaches to solving this problem: provide a separate file that describes the columns in
the file, or provide a header in the file that describes the rest of the files content. JDM
does not specify file formats, but does provide for identifying a separate descriptor file, if
needed.

3.7 Persistence
JDM defines several named objects that can be stored in the MOR of the DME. These
objects can be categorized into input objects (build settings, logical data, physical data set,
cost matrix, taxonomy, apply settings), task objects and output objects (model, test metrics). Named objects are defined to enable applications to reuse the objects and avoid having to maintain such application metadata independently. However, a vendor can choose
which objects are persistent and which are transient based no the needs of the vendors
users. An object is persistent if it can be accessed across sessions.A session is defined as
the duration of an open connection to the DME. An object is transient if it removed, or no
longer accessible, once the session terminates. Transient objects can be access by name
during the session.

June 22, 2005

30

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Consider the following use cases:

Some vendors may choose to persist all named objects across sessions. In this case,
named objects are persisted independent of connection availability. Named objects will
be available for reuse until the application explicitly removes them using the connection method. This is applicable when a vendor needs to support a full-scale MOR, perhaps supporting an end-user data mining tool.

Some vendors may persist no metadata, e.g., named objects are persisted only for the
lifetime of the connection. This is applicable when a vendor supports only synchronous execution of the tasks and has no need to persist objects once the mining operation completes.

We expect most vendors will support persistence of some metadata to enable asynchronous execution. Data mining operations are often long running, so support for asynchronous execution can be critical for some applications. To support asynchronous
task execution, both tasks and output objects need to be persisted across sessions. This
is applicable when a vendor wishes to minimize metadata storage and maintenance
requirements.
Through the use of Connection.supportsCapability (NamedObject, PersistenceOption)
and Connection.getNamedObjects (PersistenceOption), users of a vendor implementation
can determine which objects are persistent or transient.

3.8 Object references


JDM named objects are currently: task, build settings, model, logical data, physical data
set, result, taxonomy, cost matrix. Table 3 describes the type of referencing each kind of
named object uses in JDM.
Named Reference (a.k.a. reference by name) - one object refers to another by name.
Here, when a referenced object is replaced, all objects referencing that named object see
the new version. When a referenced object is deleted, attempting to access the referenced
object results in the exception, object not found.
Objects referencing named objects are maintained by name reference for non-result
objects. For example, a named build settings object references its logical data by name. If
the logical data object is replaced, the build settings object refers to the new logical data
object. If the logical data object is deleted, the build settings object would return an object
not found exception if the logical data object was accessed.
Owned Reference (a.k.a. reference by value) - one object references another by its unique
object identifier, but also owns this object. Here, the owned object cannot change or be
deleted from the referencing object. When the referencing object is deleted, the owned
object is deleted.

Object

Referenced Objects

Comment

Task

named

References physical data, build settings, model, apply settings

BuildSettings

named - outside model


owned - within model

References logical data, taxonomy,


category set, cost matrix, similarity
matrix

TABLE 3. Named and composite object referencing summary


June 22, 2005

31

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Object

Referenced Objects

Comment

Model

owned

Contains build settings. Models


maintain a complete immutable snapshot of name-referenced objects, e.g.,
the build settings referenced by a
model is effectively replicated for the
model and subobjects are fully
instantiated. These subobjects, even
if once named, no longer have their
names available, i.e., they become
unnamed objects.

LogicalData

named - outside model


owned - within model

Contains LogicalAttributes which


may reference CategorySet and Taxonomy objects.

PhysicalDataSet

N/A

Taxonomy

N/A

CostMatrix

N/A

TestMetrics

owned

ApplySettings

owned - once saved

Contains PhysicalAttributes.

Subclass instances may contain, e.g.,


Lift and ReceiverOperatingCharacterstics objects.
ClassificationApplySettings may
contain CostMatrix objects.

TABLE 3. Named and composite object referencing summary


JDM specifies three kinds of references between objects with associated behavior: named,
identifier, and owned. These are described below.

3.9 Reflection / introspection


Introspection allows classes to examine one another and determine the components comprising them. Reflection allows programs to determine at runtime the methods supported
by an object and the exceptions thrown.
Since JDM provides a la carte package compliance, and certain vendors may support different data mining capabilities or options, it is useful for programs to dynamically determine the capabilities of a vendor implementation.
We approach this need on several levels:

package support
capability support
default values for object variables
The Connection interface provides the method supportsPackage which allows programs
to know which packages are supported by the implementation. The standard Java object
for Class allows determining which methods are supported by the class. Within appropriate classes, the method supportsCapability is available to allow a program to determine if
the implementation will use a provided value for an object variable. For example, ClassificationSettings can be queried to learn if the cost matrix or priors specifications will have
any effect on the model build.

June 22, 2005

32

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Programs can determine the default values provided for objects by using the default constructor for an object and using the get methods to retrieve the default values. For example, invoking getMaxSurrogates from a TreeSettings instance will indicate the default for
this value. However, a program should invoke supportsCapability for MaxSurrogates to
know if the implementation supports surrogates. If the implementation does not support a
capability, the value returned by the get method or the value supplied by the set method is
undefined.
Methods that are not implemented but the method signature must be provided will throw
the java.lang.UnsupportedOperationException.

June 22, 2005

33

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4. Packages
In this section, we first introduce the notation used for depicting the JDM interfaces in
subsequent sections. Then, we introduce the packages that support the JDM specification.
This section is provided to show relationships graphically between the various components and objects. The methods on each interface are also depicted, without further comment. For details of the interfaces depicted below, refer to the accompanying Java
documentation produced using Javadoc.

4.1 Design overview


To achieve an a la carte specification, the package structure in JDM revolves around
required packages and various optional packages. By including a given optional package,
other optional packages may become required.
In section 4.3, several required packages exist to group interfaces in a logical framework.
This also facilitates introduction of other optional subpackages. For example, each of the
mining functions attribute importance, association, and clustering has a top level,
optional package. The optional supervised mining functions of classification and regression are grouped under the optional supervised package.
Within each function-related package, we include its build settings, top-level algorithm
settings (to be specialized in algorithm packages), and function-generic model representation.
The algorithm package contains subpackages for each of the algorithm settings selected
for standardization in this release of the JDM specification.
In addition to the model details noted above, we introduced a separate model detail package that allows for a more detailed view of a given models representation. These representations are typically associated with a particular algorithm.

4.2 Notation
Each diagram in this section represents JDM objects with the following conventions, in
some cases, to facilitate code generation.
Package - a collection of interfaces, classes, and enumerations that maps to a Java package. A package is depicted as a tabbed folder. 4.1 depicts three packages. PackageA is an
individual package. PackageC is a subpackage of PackageB.
Interface - a named specification for a set of methods that provide a service. Classes
implement interfaces. An interface is depicted as a rectangle, named, possibly with methods specified. Interfaces are distinguished from classes by the italicized name.
Class - a named specification for a set of methods. Classes are used in JDM where constructors and / or static methods are desired. A class is depicted as a rectangle, named, possibly with methods specified. A class may implement multiple interfaces. Note that JDM
defined only one class, JDMException. All other objects are described as interfaces.
Inheritance - a relationship between interfaces, classes, or between one or more interfaces and a class. Inheritance is depicted as an open triangle at the more general element,
and a line to the more specific element.

June 22, 2005

34

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Association - a relationship between objects (classes or interfaces). We adopt an irregular


use of associations for the purposes of calling out object relationships. However, this does
not place restrictions on the underlying implementation. There are two kinds of associations: ownership and reference. Ownership relationships are depicted as a solid diamond
on the owner side of the association with a line to the owned side. Figure 4.1 depicts this
between ClassA and ClassC. Reference relationships are depicted as directed arrows from
the referencing object to the referenced. Figure 4.1 depicts this between ClassA and
ClassB.
Enumeration - a named class that lists an explicit set of values allowed for the class. Enumerations are depicted as rectangles, named, with values lists and the stereotype <<enumeration>> above the class name.
Dependency - a relationship between packages, or between a class or interface and an
enumeration class. This indicates that the class or interface uses the enumeration in at least
one method, either as input or output. Dependencies are depicted as dashed arrows. Figure
4.1 depicts a dependency where PackageA depends on PackageB, and where ClassA and
Enumeration, where ClassA depends on the enumeration. Not all dependencies are
depicted in the diagrams to aid readability.

PackageA

PackageB
PackageC

Interface
et hodZ()

ACassociation +a
lassC +c
0..n

ClassA +a ABassociation +b
methodX()
1 methodY() 1
0.. 1

ClassB

<<Enumeration>>
Enumeration
value1
value2

FIGURE 4.1 JDM diagram notation

June 22, 2005

35

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.3 Package structure


The javax.datamining package consists of several sub-packages as described in this section. The top level packages are:

javax.datamining: Defines objects supporting all JDM subpackages.


javax.datamining.base: Defines objects supporting many top level mining objects.
Introduced to avoid cyclic package dependencies.

javax.datamining.resource: Defines objects that support connecting to the DME, and


executing tasks.

javax.datamining.data: Defines objects supporting logical and physical data, model


signature, taxonomy, category set and the generic superclass category matrix.

javax.datamining.statistics: Defines objects supporting attribute statistics.


javax.datamining.rules: Defines objects supporting rules and their predicate components.

javax.datamining.task: Defines objects supporting tasks for build, compute statistics,


import, and export. Task has an optional subpackage for apply since apply is used
mainly for supervised and clustering functions.

javax.datamining.association: Defines objects supporting the build settings and


model for association.

javax.datamining.clustering: Defines objects supporting the build settings and model


for clustering.

javax.datamining.attributeimportance: Defines objects supporting the build settings


and model for attribute importance.

javax.datamining.supervised: Defines objects supporting the build settings and models for supervised learning functions, specifically: classification and regression, with
corresponding optional packages. It also includes a common test task for the classification and regression functions.

javax.datamining.algorithm: Defines objects supporting the settings that are specific


to algorithms. The algorithm package has optional subpackages for different algorithms.

javax.datamining.modeldetail: Defines objects supporting details of various model


representations. ModelDetail has optional subpackages for different model details.
Figure 4.2 depicts the JDM top level package structure for the packages described above.

June 22, 2005

36

JavaTM Data Mining (JDM)

Maintenance Release

<<metamodel>>
Base

<<metamodel>>
Algorithm
(from JDM)

KMeans

Version 1.1

<<metamodel>>
Data

<<metamodel>>
Statistics

(from JDM)

(from JDM)

NaiveBayes
<<metamodel>>
Task

FeedForwardNeural
Net

Tree

<<metamodel>>
Rule

(from JDM)
<<m etam odel>>
Apply

(from JDM)

<<metamodel>>
Association
(from JDM)

SVM

<<metamodel>>
AttributeImportance

<<metamodel>>
Clustering

(from JDM)

(from JDM)

<<metamodel>>
Supervised
(from JDM)
Regression

Classification

FIGURE 4.2 Top level package structure

June 22, 2005

37

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.4 Package javax.datamining


The top level javax.datamining package contains miscellaneous interfaces shared by other
JDM packages.

Collection
contains(object : Object) : boolean
containsAll(collection : Collection) : boolean
equals(o : Object) : boolean
hashCode() : int
isEmpty() : boolean
iterator() : Iterator
size() : int
toArray() : Object
toArray(objectArray : Object) : Object

Factory

VerificationReport
getReportText() : String
getReportType() : ReportType

<<enum eration>>
ReportType

java.uti l.Col lection


without m utable
methods.. .

error
warning

FIGURE 4.3 Common top level interfaces

Exception

Java
Exception

JDMException
JDMException(errorCode : int, errorMessage : String)
JDMException(errorCode : int, errorMessage : String, vendorCode : int, vendorMessage : String)
getErrorCode() : int
getVendorErrorCode() : int
getVendorErrorMessage() : String

ConnectionFailureException

TaskException

Inval idURIException

ObjectExistsException

UnsupportedOperationException

JDMUnsupportedFeatureException

IncompatibleSpecificationException

InvalidObjectException

ObjectNotFoundException

Java Runtime Exceptions

DuplicateEntryException

EntryNotFoundException

IllegalArgumentException

JDMIllegalArgumentException

FIGURE 4.4 Exception classes

June 22, 2005

38

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Enum
getEnum() : String
isEqual(src : Enum) : boolean
<<enumeration>>
ReportType
error
warning

<<enumeration>>
OutlierTreatm ent
systemDefault
systemDetermined
asIs
asMissing

<<enumeration>>
MiningFunction

<<enumeration>>
MiningAlgorithm

as sociation
attributeIm portance
regressi on
clustering
classification

feedForwardNeuralNet
kMeans
naiveBayes
decisionTree
svmRegression
svmClassification

<<enumeration>>
LogicalAttributeUs age
active
supplementary
inactive

<<enumeration>>
ExecutionState
submitted
executing
success
error
terminating
terminated

<<enumeration>>
SortOrder
systemDefault
asIs
ascending
descending

<<enumeration>>
SizeUnit
count
percentage

<<enumeration>>
MiningTask
buildTask
testTask
applyTask
computeStatisticsTask
exportTask
importTask
<<enumeration>>
ImportExportFormat

<<enumeration>>
NamedObject
task
buildSettings
model
logicalData
physicalDataSet
testMetrics
taxonomy
costMatrix
applySettings

systemDefault
PMML1_0
PMML2_0
PMML2_1
PMML3_0
CWM1_0
CWM1_1
JDM1_0

FIGURE 4.5 Top level enumerations

ExecutionHandle
terminate() : ExecutionStatus
getLatestStatus() : ExecutionStatus
getStatus(fromTimestamp : Date) : Collection
getStartTime() : Date
waitForCompletion(timeoutInSeconds : int) : Execut...
getDurationInSeconds() : Integer
getTaskName() : String
getWarnings() : ExecutionStatus
containsWarning() : boolean

ExecutionStatus
getState() : ExecutionState
getTimestamp() : Date
getDescription() : String
containsWarning() : boolean

<<enumeration>>
ExecutionState
submitted
executing
success
error
terminating
terminated

FIGURE 4.6 Execution Handle

June 22, 2005

39

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.5 Package javax.datamining.base


The base package contains interfaces that support task, build settings, model, results, algorithm settings, and model detail.

<<enumeration>>
NamedObject
task
buildSettings
model
logicalData
physicalDataSet
testMetrics
taxonomy
costMatrix
applySettings

MiningObject
getObjectType() : NamedObject
getDescription() : String
setDescription(description : String)
getName() : String
getCreatorInfo() : String
getCreationDate() : Date
getObjectIdentifier() : String

BuildSettings

Task

Model

Taxonomy

PhysicalDataSet

ApplySettings

(f rom Data)

(f rom Data)

(from Apply)

TestMetrics

LogicalData

(from Supervised)

(from Data)

CostMatrix
(f rom Class ification)

FIGURE 4.7 Package javax.datamining.base - Named Objects

MiningObject
getDescription()
setDescription()
getName()
getCreatorInfo()
getCreationDate()
getObjectIdentifier()

BuildSet ti ngs
getMiningFunction()
getDesiredExecutionTimeInMinutes()
setDesiredExecutionTimeInMinutes()
getAlgorithmSettings()
setAlgorithmSettings()
getLogicalData()
getLogicalDataName()
setLogicalDataName()
getLogicalAttributes()
getWeight()
setWeight()
setWeightAttribute()
getWeightAttribute()
getUsage()
setUsage()
setOutlierTreatment()
getOutlierTreatment()
setOutlierIdentification()
getOutlierIdentification()
verify()

Model
getUniqueIdentifier()
getVersion()
getMajorVersion()
getMinorVersion()
getProviderName()
getProviderVersion()
getApplicationName()
getMiningFunction()
getMiningAlgorithm()
getSignature()
getBuildSettings()
getEffectiveBuildSettings()
getModelDetail()
getAttributeStatistics()
getTaskIdentifier()
getBuildDuration()

TestMetrics
getTaskIdentifier() : Integer
getModelName() : String
getTestDataName() : String

Task
getExecutionHandle() : ExecutionHandle

FIGURE 4.8 Package javax.datamining.base - Build Settings, Model, and Task

June 22, 2005

40

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

MiningObject
LogicalData
(from Data)

+logicalData

<<enumeration>>
LogicalAttributeUsage

0..1

BuildSettings
getMiningFunction() : MiningFunction
getDesiredExecutionTimeInMinutes() : int
buildSettingsRefLogicalData
setDesiredExecutionTimeInMinutes(minutes : int)
getAlgorithmSettings() : AlgorithmSettings
+buildSettings setAlgorithmSettings(algorithmSettings : AlgorithmSettings)
0..n getLogicalData() : LogicalData
getLogicalDataName() : String
setLogicalDataName(name : String)
+buildSettings getLogicalAttributes(usage : LogicalAttributeUsage) : Collection
0..n getWeight(logicalAttrName : String) : double
setWeight(logicalAttrName : String, weight : double)
getWeightAttribute() : String
buildSettingsRefAlgorithmSettings
setWeightAttribute(logicalAttrName : String)
getUsage(logicalAttrName : String) : LogicalAttributeUsage
setUsage(logicalAttrName : String, usage : LogicalAttributeUsage)
0..1 +algorit hmSettings
getOutlierTreatment(logicalAttrName : String) : OutlierTreatment
setOutlierTreatment(logicalAttrName : String, treatment : OutlierTreatment)
AlgorithmSettings
getOutlierIdentification(logicalAttrName : String) : Interval
verify() : VerificationReport
setOutlierIdentification(logicalAttrName : String, bounds : Interval)
getMiningAlgorithm() : MiningAlgorithm
getAttributeNames(retrievalType : AttributeRetrievalType) : String
verify() : VerificationReport

AssociationSet tings

SupervisedSettings

ClassificationSettings

AttributeImportanceSettings

act ive
suppl ementary
i nacti ve

<<enumeration>>
OutlierTreatment
systemDefault
systemDetermined
asIs
asMissing

<<enumeration>>
AttributeRetrievalType
usage
weight
outlierTreatment
outlierIdentification

ClusteringSettings

RegressionSettings

FIGURE 4.9 Package javax.datamining.base - BuildSettings

June 22, 2005

41

JavaTM Data Mining (JDM)

Maintenance Release

MiningObject

Version 1.1

modelHasSettings

( fromJDM Root)

+setti ng s

+model 1

BuildSettings

0.. 1

+effectiveSettings 0..1
Model
getUniqueIdentifier() : String
getVersion() : String
getMajorVersion() : String
getMinorVersion() : String
getProviderName() : String
getProviderVersion() : String
getApplicationName() : String
getMiningFunction() : MiningFunction
getMiningAlgorithm() : MiningAlgorithm
getSignature() : ModelSignature
getBuildSettings() : BuildSettings
getEffectiveBuildSettings() : BuildSettings
getModelDetail() : ModelDetail
getAttributeStatistics() : AttributeStatisticsSet
getTaskIdentifier() : String
getBuildDuration() : Integer

+model
1

modelHasEffectiveSettings

+model

+signature

modelHasSignature

1
+model

ModelSignature
(fromData)

0..1

+dataStatisti cs

AttributeStatisti csSet

miningModelHasStatist... 0..1

(fromStatistics)

+m ode l
1

miningModelHasDetail
+d etai l 0..1
ModelDetail

SupervisedModel

Re gressionMo de l

Asso ciati onMo del

AttributeImportanceModel

ClusteringModel

ClassificationModel

FIGURE 4.10 Package javax.datamining.base - Model

June 22, 2005

42

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.6 Package javax.datamining.resource


The resource package contains interfaces that support interaction with the DME. The
interfaces ConnectionFactory, Connection, ConnectionSpec, and ConnectionMetaData are
specified as part of the Java 2 Connector Architecture.

ConnectionSpec
getName() : String
setName(userName : String)
getURI() : String
setURI(uri : String)
setPassword(password : String)
setLocale(locale : Locale)
getLocale() : Locale

<<enumeration>>
ConnectionCapabi lity
containerManaged
connectionSpec
jcxConnection
scoringEngine

ConnectionMetaData
getVersion() : String
getMajorVersion() : int
getMinorVersion() : int
getProviderName() : String
getProviderVersion() : String

ConnectionFactory
getConnection() : Connection
getConnection(spec : ConnectionSpec) : Connection
getConnection(connection : Connection) : Connection
getConnectionSpec() : ConnectionSpec
supportsCapability(capability : ConnectionCapability) : boolean
+factory

<<enumeration>>
PersistenceOption
transientObject
persistentObject

factoryHasConnections
+conne cti on

0..n
Conn ectio n

close()
getFactory(objectName : String) : Factory
getMetaData() : ConnectionMetaData
getConnectionSpec() : ConnectionSpec
setLocale(locale : Locale)
getLocale() : Locale
getSupportedFunctions() : MiningFunction
getSupportedAlgorithms(function : MiningFunction) : MiningAlgorithm
supportsCapability(function : MiningFunction, algorithm : MiningAlgorithm, taskType : MiningTask) : boolean
supportsCapability(object : NamedObject, persistence : PersistenceOption) : boolean
getNamedObjects(persistenceOption : PersistenceOption) : NamedObject
getMaxNameLength() : int
getMaxDescriptionLength() : int
getDescription(objectName : String, objectType : NamedObject) : String
setDescription(objectName : String, objectType : NamedObject, description : String)
saveObject(name : String, object : MiningObject, replace : boolean)
removeObject(name : String, objectType : NamedObject)
renameObject(oldName : String, newName : String, objectType : NamedObject)
doesObjectExist(objectName : String, objectType : NamedObject) : boolean
retrieveObject(name : String, objectType : NamedObject) : MiningObject
retrieveObject(objectIdentifier : String) : MiningObject
retrieveObjects(createdAfter : Date, createdBefore : Date, objectType : NamedObject) : Collection
retrieveObjects(createdAfter : Date, createdBefore : Date, objectType : NamedObject, minorType : Enum) : Collection
getObjectNames(objectType : NamedObject) : Collection
getObjectNames(createdAfter : Date, createdBefore : Date, objectType : NamedObject) : Collection
getObjectNames(createdAfter : Date, createdBefore : Date, objectType : NamedObject, minorType : Enum) : Collection
getModelNames(function : MiningFunction, algorithm : MiningAlgorithm, createdAfter : Date, createdBefore : Date) : Collection
getCreationDate(objectName : String, objectType : NamedObject) : Date
retrieveModelObjects(function : MiningFunction, algorithm : MiningAlgorithm, createdAfter : Date, createdBefore : Date) : Collection
getLastExecutionHandle(taskName : String) : ExecutionHandle
getExecutionHandles(taskName : String) : ExecutionHandle
execute(taskName : String) : ExecutionHandle
execute(task : Task, timeout : Long) : ExecutionStatus
requestModelLoad(modelName : String)
requestModelUnload(modelName : String)
getLoadedModels() : String
requestDataLoad(dataURI : String)
requestDataUnload(dataURI : String)
getLoadedData() : String

FIGURE 4.11 Package javax.datamining.resource

June 22, 2005

43

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.7 Package javax.datamining.data


The data package contains several sets of objects each relating to different kinds of data
specification. These include:

physical data
logical data
model signature
taxonomy
category matrix
category set
caseIdRequired
multiAttributeCaseId

Attribute

PhysicalAtt ributeFactory

getName() : String
getDescription() : String

create(attrName : String, dataType : AttributeDataType) : PhysicalAttribute


create(attrNameArray : String, dataType : AttributeDataType) : PhysicalAttribute
create(attrName : String, dataType : AttributeDataType, role : PhysicalAttributeRole) : PhysicalAttribute
supportsCapability(capability : PhysicalAttributeCapability) : boolean
MiningObject
(fromJDMRoot)

PhysicalAttribute
setName(attributeName : String)
PhysicalDataSet
setDescription(description : String)
getAttributes() : Collection
getDataType() : AttributeDataType
getAttributeNames(dataType : AttributeDataType) : Collecti...
+physicalData
+attribute
setDataType(dataType : AttributeDataType)
getAttributeNames(role : PhysicalAttributeRole) : Collection
getRole() : PhysicalAttributeRole
getAttributeCount() : int
0..n
1
setRole(role : PhysicalAttributeRole)
getAttribute(attributeName : String) : PhysicalAttribute
physicalDataHasAttributes
getAttributeIndex(attributeName : String) : Integer
getAttribute(index : int) : PhysicalAttribute
addAttribute(attribute : PhysicalAttribute)
+physicalData
+statistics
addAttributes(attributeArray : PhysicalAttribute)
AttributeStatisticsSet
removeAttribute(name : String)
(fromStatistics)
0..1
1
removeAllAttributes()
physicalDataHasStatisticsSet
importMetaData()
getAttributeStatistics() : AttributeStatisticsSet
getURI() : String
PhysicalDa taSetFactory
create(uri : String, importMetaData : boolean) : PhysicalDataSet
supportsCapability(capability : PhysicalDataSetCapability) : boolean

<<enumeration>>
PhysicalDataSetCapability
singleRecordCaseData
multiRecordCaseData

<<enumeration>>
AttributeDataType
integerType
doubleType
stringType
unknownType
<<enumeration>>
PhysicalAttributeRole
data
caseId
attributeName
attributeValue
taxonomyChildId
taxonomyParentId

PhysicalDataRecordFactory
create() : PhysicalDataRecord
create(signature : ModelSignature) : PhysicalDataRecord

PhysicalDataRecord
getValue(attributeName : String) : Object
setValue(attributeName : String, value : Object)
getAttributeNames() : Collection
getAttributeCount() : int
removeAttribute(attributeName : String)
resetValues()
removeAllAttributes()

FIGURE 4.12 Package javax.datamining.data - PhysicalData

June 22, 2005

44

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

LogicalData
getAttributes() : Collection
getAttribute(name : String) : LogicalAttribute
getAttributes(type : AttributeType) : Collection
addAttribute(attribute : LogicalAttribute)
removeAttribute(attributeName : String)
removeAllAttributes()

Attribute
getName() : String
getDescription() : Stri...

+logicalData
logicalDataHasAttributes

0..1

+attribute

1..n
LogicalAttribute

setName(attributeNam e : Stri ng)


setDescription(description : String)
getAttributeType() : Attri buteType
setAttributeT ype(type : AttributeType)
getDataPreparati onStatus() : DataPreparati onStatus
setDataPrep arationStatus(preparationStatus : DataPreparati onStatus)
i sDiscrete(isDiscrete : boolean)
i sDiscrete() : bool ean
setCategorySet(categorySet : CategorySet)
getCategorySet() : CategorySet

LogicalDataFactory
create() : Logi calData
create(physicalDataSet : Physi calDataSet) : Logical Data
create(physicalDataSetName : String) : Logical Data

LogicalAttributeFactory
create(attrName : String, type : AttributeType) : LogicalAttribute
create(attrNameArray : String, type : AttributeType) : LogicalAttribute
supportsCapability(capability : LogicalAttributeCapability) : boolean

<<enumeration>>
DataPreparationStatus

<<enumeration>>
AttributeType
categorical
ordinal
numerical
notSpecified

unprepared
prepared
<<enumeration>>
LogicalAttributeCapability
discreteAttributes
boundedAttributes
ordinalAttributes
unpreparedAttributes
categorySetEnabled

FIGURE 4.13 Package javax.datamining.data - LogicalData

Mo delSignature
Attribute

getAttributes() : Collection
getAttribute(attributeName : String) : SignatureAttribute
getAttributesByRank(ordering : SortOrder) : Collection
+modelSignature

1
+attribute
SignatureAttribute

modelSignatureHasAttribute

1..n

getAttributeType() : AttributeT yp e
getDataType () : AttributeDataType
getRank() : int
getImpo rtanceValue() : double

<<enum eration>>
AttributeType
categorical
ordinal
numerical
notSpecified

FIGURE 4.14 Package javax.datamining.data - ModelSignature

June 22, 2005

45

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

TaxonomyFactory
createTable(taxonomyName : String, physicalDataName : String) : TaxonomyTable
createObject() : TaxonomyObject
supportsCapability(capability : TaxonomyCapability) : boolean

Taxonomy
getChildren(parent : Object) : Collection
getParents(child : Object) : Collection
getRoots() : Collection
getLeaves() : Collection

TaxonomyTable
getPhysicalDataName() : String

<<enumeration>>
TaxonomyCapability
tableTaxonomy
objectTaxonomy

TaxonomyObject
addChildren(parent : Object, childArray : Object)
removeDescendants(parent : Object)
removeRelationship(parent : Object, childArray : Object)

FIGURE 4.15 Package javax.datamining.data - Taxonomy

CategoryMatrix
getCategories() : Collection
getValue(rowCategoryValue : Object, columnCategoryValue : Object) : Double
getCategorySet() : CategorySet

FIGURE 4.16 Package javax.datamining.data - CategoryMatrix

CategorySetFactory
create(dataType : AttributeDataType) : CategorySet
create(categorySet : CategorySet) : CategorySet

CategorySet
addCategory(categoryValue : Object, property : CategoryProperty) : int
insertCategory(categoryValue : Object, property : CategoryProperty, beforeIn...
removeCategory(index : int)
getSize() : int
getDataType() : AttributeDataType
getIndex(categoryValue : Object) : Integer
getValue(index : int) : Object
getValues() : Object
getValues(property : CategoryProperty) : Object
getName(index : int) : String
getProperty(index : int) : CategoryProperty
getDefaultProperty() : CategoryProperty
setDefaultProperty(property : CategoryProperty)

Interval
getIntervalClosure() : IntervalClosure
getStartPoint() : double
getEndPoint() : double

<<enumeration>>
IntervalClos ure
closedClosed
closedOpen
openClosed
openOpen

<<enumeration>>
CategoryProperty
valid
error
unknown
missing

FIGURE 4.17 Package javax.datamining.data - CategorySet and Interval

June 22, 2005

46

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

4.8 Package javax.datamining.task


The task package defines the build, apply, import, and export tasks. The test task is specific to supervised functions and is included in the supervised package.

Task
getExecutionHandle() : ExecutionHa ndle
verify() : Veri ficati on Report

BuildTask
getModel Name() : String
setModelName(name : String)
getBuil dDataName() : Strin g
setBu ildDataNam e(na me : String)
getBuil dSettingsName () : Stri ng
setBu ildSettingsName (nam e : Stri ng)
getIn putModel Na me() : String
setInputModelName(mode lNam e : Stri ng)
getVali da tionDataName() : String
setVa lidationDataNam e(va lidationData Nam e : Stri ng)
getApplicati onNa me() : String
setAp pl icatio nName(n ame : String)
getModel Description() : String
setModelDescription(d escri pti on : Strin g)
getBuil dDataMap () : Map
setBu ildDataMap (buil dDat aMap : Map )
getVali da tionDataMa p() : Map
setVa lidationDataMap (vali dationDataMap : Map)

BuildTaskFactory
create(buildData : String, buildSettingsName : String, modelName : String) : BuildTask
supportsCapability(function : MiningFunction, algorithm : MiningAlgorithm, capability : BuildTaskCapability) : boolean

<<enumeration>>
BuildTaskCapability
i np utModel
val idati onData
da taMa ppin g

FIGURE 4.18 Package javax.datamining.task - Build

June 22, 2005

47

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

ImportSummary

ExportTaskFactory
create() : ExportTask
supportsCapability(objectType : NamedObject, exportFormat : ImportExportForm
. ..

Task
getExecutionHandle() : ExecutionHandle
verify() : VerificationReport

getObjectCount() : int
getObjectNames() : String
getObjectTypes() : NamedObject
getObjectClassNames() : String
getObjectDescriptions() : String
getCreationDates() : Date
getFormat() : ImportExportFormat
1

+summary

importTaskHasSummary
0..1
addObjectName(name : String, namedObjectType : NamedO...
removeObjectName(name : String, namedObjectType : Nam...
getURI() : String
setURI(uri : String)
getFormat() : ImportExportFormat
setFormat(format : ImportExportFormat)
getObjectNames() : String
setIncludeModelSettings(option : SettingsInclusionOption)
getIncludeModelSettings() : SettingsInclusionOption

<<enum eration>>
SettingsInclusionOption
systemDefault
none
settings
effectiveSettings
settingsOnly
effectiveSettingsOnly
...

+importTask

ImportTask

ExportTask

<<enumeration>>
ImportExportFormat

getURI() : String
setURI(uri : String)
includeModelSettings() : boolean
includeModelSettings(includeModelSettings : boolean)
useOriginalCreationDates(useOriginalCreationDates : boolean)
useOriginalCreationDates() : boolean
populateSummary()
getSummary() : ImportSummary
getObjectNamesMap() : Map
setObjectNamesMap(map : Map)

ImportTaskFactory
create() : Im portTask
create(uri : String, populateSummary : boolean) : ImportTask
supportsCapability(objectType : NamedObject, exportFormat : ImportExportForm
...

systemDefault
PMML1_0
PMML2_0
PMML2_1
PMML3_0
CWM1_0
CWM1_1
JDM1_0
JDM1_1

FIGURE 4.19 Package javax.datamining.task - Import and Export

Task

ComputeStatisticsTask
getPhysicalDataName() : String
setPhysicalDataName(name : String)
getLogicalDataName() : String
setLogicalDataName(logicalDataName : String)

ComputeStatisticsTaskFactory
create(phys icalDataName : String) : ComputeStatisticsTask
supportsCapability(capability : Com puteStatis ticsTaskCapability) : boolean

<<enumeration>>
ComputeStatisticsTaskCapability
logicalData

FIGURE 4.20 Package javax.datamining.task - ComputeStatistics

June 22, 2005

48

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.8.1 Package task.apply


The apply subpackage supports those functions that allow models to be applied to data,
i.e., scoring. This package also enables users to specify the output desired from the apply
task.

Task

ApplyTask
getModelName() : String
setModelName(modelName : String)
getApplySettingsName() : String
setApplySettingsName(applySettingsName : String)
getApplyDataMap() : Map
setApplyDataMap(applyDataMap : Map)

DataSetApplyTask

RecordApplyTask

getApplyOutputDestination() : String
setApplyOutputDestination(applyOutputDestinationURI : String)
getApplyDataName() : String
setApplyDataName(applyDataName : String)

getInputRecord() : PhysicalDataRecord
s etInputRecord(record : PhysicalDataRecord)
getOutputRecord() : PhysicalDataRecord

RecordApplyTaskFactory
create(applyRecord : PhysicalDataRecord, modelName : String, applySettingsName : String) : RecordApplyTask

DataSetApplyTaskFactory
create(applyDataName : String, modelName : String, applySettingsName : String, applyOutputDestinationURI : String) : DataSetApplyTask

ApplySettings
getSourceDestinationMap() : Map
setSourceDestinationMap(sourceDestinationMap : Map)
resetMapping()
verify() : VerificationReport

FIGURE 4.21 Package task.apply - ApplyTask and ApplySettings

June 22, 2005

49

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.9 Package javax.datamining.supervised


The supervised package consists of the sub-packages: classification and regression. This
package also contains definitions for common settings, test task, and test metrics.

AlgorithmSettings

BuildSettings

SupervisedAlgorithmSettings
SupervisedSettings
getTargetAttributeName() : String
setTargetAttributeName(attributeName : String)

Model

Supervised Model
getTargetAttributeName() : String

FIGURE 4.22 Package javax.datamining.supervised - Settings and Model

MiningObject
(from JDMRoot)

Task

TestTask
getTestDataName() : String
setTestDataName(testDataName : String)
getModelName() : String
setModelName(modelName : String)
getTestMetricsName() : String
setTestMetricsName(testMetricsName : String)
getTestDataMap() : Map
setTestDataMap(testDataMap : Map)
verify() : VerificationReport

TestMetrics
getTaskIdentifier() : Integer
getModelNam e() : String
getTestDataNam e() : String

TestMetricsTask
getApplyOutputDataName() : String
setApplyOutputDataName(applyOutputData : String)
getActualTargetAttrName() : String
setActualTargetAttrName(actualTargetAttrName : String)
getPredictedTargetAttrName() : String
setPredictedTargetAttrName(predictedTargetAttrName : String)
getPredictionRankingAttrName() : String
setPredictionRankingAttrName(predictionRankingAttrName : String)
getTestMetricsName() : String
setTestMetricsName(testMetricsName : String)
verify() : VerificationReport

FIGURE 4.23 Package javax.datamining.supervised - TestTask and TestMetrics

June 22, 2005

50

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.9.1 Package supervised.classification


The classification package describes the settings, model, test task, and test metrics for the
classification function.

SupervisedSettings

ClassificationSettings
getCostMatrixName() : String
setCostMatrixName(costMatrixName : String)
getPriorProbabilitiesMap(attributeName : String) : Map
setPriorProbabilitiesMap(attributeName : String, priorsMap : Map)
usePriors(usePriors : boolean)
getUsePriors() : boolean

SupervisedModel

ClassificationModel
getClassificationError() : double
getTargetCategorySet() : CategorySet
wasCostMatrixUsed() : boolean

ClassificationSettingsFactory
create() : Clas sificationSettings
supportsCapability(capability : ClassificationCapability) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : Clas sificationCapability) : boolean

<<enumeration>>
ClassificationCapability
costMatrix
priorProbability
weightedAttributes
ordinalAttributes
automatedDataPreparation
supplementaryAttributes
weightAttribute
classificationError
outlierTreatment
logicalAttributeUsage
logicalData

FIGURE 4.24 Package supervised.classification - Settings and Model

June 22, 2005

51

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

ClassificationTestTaskFactory
TestTask

create(testDataName : String, modelName : String, testResultName : String) : ClassificationTestTask


supportsCapability(metricOption : ClassificationTestMetricOption) : boolean
supportsCapability(capability : TestTaskCapability) : boolean

Cla ssifi catio nTestTask


<<enumeration>>
TestTaskCapability

setNumberOfLiftQuantiles(numberOfQuantiles : int)
computeMetric(testMetric : ClassificationTestMetricOption, flag : boolean)
computeMetric(testMetric : ClassificationTestMetricOption) : boolean
getNumberOfLiftQuantiles() : int
getPositiveTargetValue() : Object
setPositiveTargetValue(positiveTargetValue : Object)
setCostMatrixName(costMatrixName : String)
getCostMatrixName() : String
getTestMetricsDescription() : String
setTestMetricsDescription(description : String)

da taMapping
<<enumeration>>
ClassificationTestMetricOption
confusionMatrix
lift
receiverOperatingCharacteristics

TestMetrics
Lift

(fromSupervised)

Classifi ca tionTestMetrics
getAccuracy() : Double
getConfusionMatrix() : ConfusionMatrix
getLift() : Lift
getROC() : ReceiverOperatingCharacterics

ReceiverOperatingCharacterics
getAreaUnderCurve() : double
getNumberOfThresholdCandidates() : int
getProbabilityThreshold(index : int) : double
getPositives(index : int, trueFalse : boolean) : long
getNegatives(index : int, trueFalse : boolean) : long
getHitRate(index : int) : double
getFalseAlarmRate(index : int) : double

getLift(lowerIndex : int, upperIndex : int) : Double


getCumulativeLift(index : int) : Double
getCases(lowerIndex : int, upperIndex : int) : long
getCumulativeCases(index : int) : long
getNumberOfPositiveCases(lowerIndex : int, upperIndex : int) : long
getCumulativePositiveCases(index : int) : long
getNumberOfNegativeCases(lowerIndex : int, upperIndex : int) : long
getCumulativeNegativeCases(index : int) : long
getPercentageSize(lowerIndex : int, upperIndex : int) : Double
getCumulativePercentageSize(index : int) : Double
getTargetDensity(lowerIndex : int, upperIndex : int) : Double
getCumulativeTargetDensity(index : int) : Double
getNumberOfQuantiles() : int
getTotalCases() : long
getTotalPositiveCases() : long
getTargetAttributeName() : String
getPositiveTargetValue() : Object

FIGURE 4.25 Package supervised.classification - TestTask and TestMetrics

TestMetricsTask
(from Supervised)

ClassificationTestMetricsTask
getNumberOfLiftQuanti les() : i nt
setNumberOfLiftQuantiles(numberOfQuanti les : in t)
getPositiveTargetValue() : Object
setPositiveTargetValue(positiveTargetValue : Object)
getCostMatrixName() : String
setCostMatrixName(costMatrixName : String)
com puteMetrics(testMetric : Cl assifica tionTestMetricOptio n, flag : boolean)
com puteMetric(te stMetric : ClassificationTestMetricOption) : boolean

<<enumeration>>
ClassificationT estMetricOption
confusionMatrix
lift
receiverOperatingCharacteristics

ClassificationTestMetricsTaskFactory
create(applyOutputData : String, actualTargetAttrName : String, predictedTargetAttrName : String, testMetricsName : String) : ClassificationTestMetricsTask
supportsCapability(metricOption : ClassificationTestMetricOption) : boolean

FIGURE 4.26 Package supervised.classification ClassificationTestMetricsTask

June 22, 2005

52

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<<enumeration>>
Classification ApplyContent

ApplySettings

pre di ctedCategory
pro babilit y
cost
nodeId

ClassificationApplySettings
mapByRank(co ntent : Classifi catio nApplyCo nte nt, d estPhysAttrName Array : St ring, fro mTop : b oolean )
mapByCategory(con tent : Classi ficati onAppl yConte nt, catego ryValue : Object , destin ati onAttrName : String)
mapT opPredictio n(conte nt : ClassificationApplyConte nt, d estPh ysAttrName : String)
mapPre di cti on s(content : Classi ficati onAp pl yContent, baseDe stPhysAttrNam e : Stri ng )
getRank(destin ati onAttrName : String ) : In teger
getRanks() : Integ er
i sFromT op () : boo lean
getMap pe dCateg ories() : Obj ect
getMap pe dDestin ationAttrName(cate goryVa lue : Obje ct, con tentT ype : Cla ssi ficati on Ap pl yCon tent) : String
getMap pe dDestin ationAttrNames(con tent : Classi ficationApp lyConte nt) : Strin g
getConten t(destin ationAttrName : String) : Classi ficati onAppl yConte nt
getConten tsByRa nk(rank : i nt) : Cla ssification ApplyCon tent
getConten tsByCa teg ory(cat egoryValu e : Object) : Classi ficati onAppl yContent
setCostMa tri xNam e(costMa trixNam e : Stri ng )
getCost Ma tri xName() : Strin g
getMap pe dConte nts() : Cla ssi ficati onAppl yContent
getMap pe dBaseDestinationAttri bu teName(content : Classi ficati onAp pl yConte nt) : String

ClassificationApplySettingsFactory
create() : ClassificationApplySettings
getDefaultApplySettings() : ClassificationApplySettings
supportsCapability(algorithm : MiningAlgorithm, content : ClassificationApplyContent) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : ClassificationApplyCapability) : boolean
getSupportedApplyContents(algorithm : MiningAlgorithm) : ClassificationApplyContent

<<enumeration>>
ClassificationApplyCapability
topSequentialRanks
bottomSequentialRanks
individualCategories
allPredictions
topPrediction
costMatrix

FIGURE 4.27 Package supervised.classification - ApplySettings

CategoryMatrix
getCategories() : Collection
getValue(rowCategoryValue : Object, columnCategoryValue : Object) : Double
getCategorySet() : CategorySet

ConfusionMatrix
getAccuracy() : double
getError() : double
getNumberOfPredictions(actualCategoryValue : Object, predictedCategoryValue : Object) : long

CostMatrix
CostMatrixFactory
create(categorySet : CategorySet) : CostMatrix

setCellValue(actualTarget : Object, predictedTarget : Object, cost : double)


getCellValue(actualTarget : Object, predictedTarget : Object) : double

FIGURE 4.28 Package supervised.classification - Confusion Matrix and Cost Matrix


June 22, 2005

53

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.9.2 Package supervised.regression


This package describes the settings, model, test task, and test metrics for the regression
function.

SupervisedSettings

RegressionSettingsFactory
create() : RegressionSettings
supportsCapability(capability : RegressionCapability) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : RegressionCapability) : boolean

RegressionSettings
<<enumeration>>
RegressionCapability
weightedAttributes
automatedDataPreparation
supplementaryAttributes
weightAttribute
outli erTreatment
logicalAttributeUsage
logicalData

SupervisedModel

RegressionModel
getRSquared() : double

FIGURE 4.29 Package supervised.regression - Settings and Model

TestMetrics
(from Supervised)

TestTask
<<enumeration>>
TestTaskCapability

RegressionTestMetrics

dataMapping
RegressionTestTask

getMeanPredictedValue() : Double
getMeanActualValue() : Double
getMeanAbsoluteError() : Double
getRMSError() : Double
getRSquared() : Double

RegressionTestTaskFactory
create(inputDataName : String, modelName : String, testMetrics Name : String) : RegressionTestTask
supportsCapability(capability : TestTas kCapability) : boolean

ApplySettings
(from Apply)

RegressionApplySettings
map(content : RegressionApplyContent, destPhysAttrName : String)
getContent(destinationAttrName : String) : RegressionApplyContent
getContents() : RegressionApplyContent
getMappedDestinationAttributeName(content : RegressionApplyContent) : String

<<enumeration>>
Regress ionApplyContent
predictedValue
confidence

RegressionApplySettingsFactory
create() : RegressionApplySettings
getDefaultApplySettings() : RegressionApplySettings
supportsCapability(algorithm : MiningAlgorithm, content : RegressionApplyContent) : boolean
getSupportedApplyContents(algorithm : MiningAlgorithm) : RegressionApplyContent

FIGURE 4.30 Package supervised.regression - TestTask, and ApplySettings

June 22, 2005

54

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

TestMetricsTask
(f rom Supervised)

RegressionTestMetricsTask

RegressionTestMetricsTaskFactory
create(applyOutputData : String, actualTargetAttrName : String, predictedTargetAttrName : String, testMetricsName : String) : RegressionTestMetricsTask

FIGURE 4.31 Package supervised.regression - RegressionTestMetricsTrask

4.9.3 Package attributeimportance


The attribute importance package contains the interfaces for defining build settings and
the interface for extracting content from an attribute importance model. Note that neither
the apply nor test tasks are applicable for attribute importance.
Model

AlgorithmSettings
AttributeImportanceModel

AttributeImportanceAlgorithmSettings

BuildSettings

getAttributesByRank(ordering : SortOrder) : Collection


getAttributesByRank(lowerRank : int, upperRank : int) : Collection
getAttributesByPercentage(percent : double, ordering : SortOrder) : Collection
getAttributeCount() : int
getMaxRank() : int

AttributeImportanceSettingsFactory
create() : AttributeImportanceSettings
supportsCapability(capability : AttributeImportanceCapability) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : AttributeImportanceCapability) : boolean

AttributeImportanceSettings
isSupervised() : boolean
setTargetAttributeName(targetAttrName : String)
getTargetAttributeName() : String
getMaxAttributeCount() : int
setMaxAttributeCount(maxCount : int)

<<enumeration>>
AttributeImportanceCapability
weightedAttributes
maximumResultSize
supervised
unsupervised
supplementaryAttributes
weightAttribute
outlierTreatment
logicalAttributeUsage
logicalData

FIGURE 4.32 Package javax.datamining.attributeimportance - Settings and Model

June 22, 2005

55

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.10 Package javax.datamining.association


The association package contains the interfaces supporting build settings and model interface required for the association function. It also specifies interfaces to enable rules selection from an association rules model.

Model
AlgorithmSettings

AssociationModel
getRules() : Collection
getRules(filter : RulesFilter) : Collection
getItems() : Collection
getItemsets() : Collection
getItemsets(itemsetSize : int) : Collection
getMaxTransactionSize() : int
getAverageTransactionSize() : Double
getNumberOfTransactions() : long
getNumberOfItems() : int
getNumberOfItemsets() : int
getMinAbsoluteSupport() : int
getMaxAbsoluteSupport() : int
getMinConfidence() : Double
getMaxConfidence() : Double
getMaxRuleLength() : int

AssociationRulesAlgorithmSettings

AssociationRulesAlgorithmSettingsFactory
create() : AssociationRulesAlgorithmSettings

AssociationSettingsFactory
create() : AssociationSettings
supportsCapability(capability : AssociationCapability) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : AssociationCapability) : boolean

BuildSettings
<<enumeration>>
AssociationCapability
AssociationSettings
getMinSupport() : Double
setMinSupport(minSupport : double)
getMinConfidence() : Double
setMinConfidence(minConfidence : double)
getMaxRuleLength() : int
setMaxRuleLength(maxRuleLength : int)
getMaxRuleComponentLength(isAntecedent : boolean) : int
setMaxRuleComponentLength(maxLength : int, isAntecedent : boolean)
getMaxNumberOfRules() : int
setMaxNumberOfRules(maxRules : int)
getItems(included : boolean) : Object
addItem(item : Object, included : boolean)
addItems(itemArray : Object, included : boolean)
removeItem(item : Object, included : boolean)
removeItems(itemArray : Object, included : boolean)
setTaxonomyName(attributeName : String, taxonomyName : String)
getTaxonomyName(attributeName : String) : String

mi nimu mSupport
mi nimu mConfidence
ma xi mu mRu le Le ng th
ma xi mu mNu mbe rOfRul es
excludedItems
inclu de dIt ems
an teced en tL en gt h
conse que ntL en gth
taxon omy
au toma te dDataPrepa rat ion
supplementaryAttributes
ou tlie rT reatment
log icalAtt rib ute Usage
log icalDa ta

FIGURE 4.33 Package javax.datamining.associationrules - Settings and Model

June 22, 2005

56

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

AssociationRule
ge tRule Id en tifier() : in t
ge tAnte ce dent() : Itemset
ge tCon sequ en t() : I temset
ge tSup po rt() : do ub le
ge tAbsolu te Su pp ort() : i nt
ge tCon fid en ce () : d ou bl e
ge tLift() : do ub le
ge tLength () : i nt

+association

associationRefConsequent +consequent

0..n
+a ssociation associationRefAntecedent
0..n

Itemset

get Ite ms() : Obj ect


1 get Support() : doubl e
+antecedent get Abso luteSupp ort() : int
get Size() : in t
1

RulesFilter
setRang e(typ e : RuleProperty, min Valu e : dou ble, maxValue : double )
getMaxValue(type : RuleProperty) : Double
getMi nValue (type : RuleProperty) : Doubl e
setThreshold(property : RuleProperty, compOp : ComparisonOperator, th reshold Valu e : dou ble)
getThresholdValue(property : RuleProperty) : Doub le
getThresholdOperator(property : RuleProperty) : ComparisonOperator
getIte ms(componentOpti on : RuleCo mpone ntOption, included : boolean) : Object
setI tems(itemArray : Object, componentOpti on : RuleCompone ntOption, included : boolean)
setOrderingConditio n(orderByArray : Rule Prop erty, so rtOrd erArray : So rtOrder)
getOrde ringConditions() : RuleProperty
getOrde ringCondition(orderBy : Rule Prop erty) : SortOrder
setMaxNumb erOfRu les(maxRules : int)
getMaxNumberOf Rule s() : i nt

<<enumeration>>
RuleProperty
support
confidence
length
lift

<<enumeration>>
Rul eComponentOpt ion
systemDefault
antecedent
consequent
antecedentOrConsequent

RulesFilterFactory
create() : RulesFilter
supportsCapability(property : RuleProperty) : boolean
supportsCapability(property : RuleProperty, compOp : ComparisonOperator) : boolean

FIGURE 4.34 Package javax.datamining.associationrules - Rule Selection

June 22, 2005

57

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.11 Package javax.datamining.clustering


The clustering package contains interfaces supporting the settings, model, and apply settings, and similarity matrix for the clustering function. All clustering models share a common representation based on the cluster object. The signature attribute object is further
specialized for clustering.

<<enumeration>>
ClusteringModelProperty
centroid
hi erarchy
statistics
similarityScale
clust erSimilarity
splitPredicate
rules

Model

SignatureAttribute

ClusteringSignatureAttribute
getComparisonFunction() : AttributeComparisonFunction
getSimilarityScale() : double
getSimilarityMatrix() : SimilarityMatrix

ClusteringModel
getCluster(identifier : int) : Cluster
getNumberOfClusters() : int
getNumberOfLevels() : int
getRootClusters() : Collection
getClusters() : Collection
getLeafClusters() : Collection
getRules() : Collection
getSimilarity(clusterIdentifier1 : int, clusterIdentifier2 : int) : Double
hasProperty(property : ClusteringModelProperty) : boolean
+clusteri ng

+clustering

+clustering

0. .n
clusteringModelHasClusters

systemDefault
systemDetermined
absDiff
gaussSim
delta
equal
similarityMatrix

clusteringModelRefRoot

+root

<<enumeration>>
At tribut eCom pari sonFunction

clusteringModelHasRules +rule

Rule

+clusters 1..n
AttributeStatisticsSet

Cluster

getClusterId() : int
+statistics 0..1
getName() : String
getParent() : Cluster
+cluster
getAncestors() : Cluster
getLevel() : int
1 clusterHasStatisticsSet
getCaseCount() : long
getSupport() : double
getCentroidCoordinate(numericalAttributeName : String) : Double
getCentroidCoordinate(categoricalAttributeName : String, category : Object) : Doub...
+chil dren
getSplitPredicate() : Predicate
getChildren() : Cluster
0..n
getStatistics() : AttributeStatisticsSet
isLeaf() : boolean
isRoot() : boolean
getRule() : Rule
+cluster

0..n

+cluster

0..1

clusterRefSplitPredicate
clusterRefChildren
+splitPredicate

Predicate

FIGURE 4.35 Package javax.datamining.clustering - Model

June 22, 2005

58

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<<enumeration>>
AttributeComparisonFunction
systemDefault
systemDetermined
absDiff
gaussSim
delta
equal
similarityMatrix

BuildSetting s

ClusteringSettings
getMaxNumberOfClusters() : int
setMaxNumberOfClusters(maxClusters : int)
getMinClusterCaseCount() : long
setMinClusterCaseCount(minCaseCount : long)
getMaxClusterCaseCount() : long
setMaxClusterCaseCount(maxCount : long)
getAggregationFunction() : AggregationFunction
setAggregationFunction(function : AggregationFunction)
getAttributeComparisonFunction(logicalAttributeName : String) : AttributeComparisonFunction
setAttributeComparisonFunction(logicalAttributeName : String, function : AttributeComparisonFunction)
getMaxLevels() : int
setMaxLevels(numberOfLevels : int)
getSimilarityMatrix(logicalAttributeName : String) : SimilarityMatrix
setSimilarityMatrix(logicalAttributeName : String, matrix : SimilarityMatrix)

<<enumeration>>
AggregationFunction
systemDefault
systemDetermined
euclidean
squaredEuclidean
chebychev
cityBlock
minkowski
simpleMatching
jaccard
tanimoto
binarySimilarity

ClusteringSettingsFactory
create() : ClusteringSettings
supportsCapability(capability : ClusteringCapability) : boolean
supportsCapability(aggregationFunction : AggregationFunction) : boolean
supportsCapability(comparisonFunction : AttributeComparisonFunction) : boolean
supportsCapability(algorithm : MiningAlgorithm, capability : ClusteringCapability) : boolean
supportsCapability(aggregationFunction : AggregationFunction, comparisonFunction : AttributeComparisonFunction) : boolean

<<enumeration>>
ClusteringCa pability
minClusterCaseCount
maxClusterCaseCount
maxNumberOfClusters
weightedAttributes
automatedDataPreparation
supplementaryAttributes
weightAttribute
hierarchicalClusters
outlierTreatment
logicalAttributeUsage
logicalData

AlgorithmSettings

ClusteringAlgorithmSettings

FIGURE 4.36 Package javax.datamining.clustering - Settings

June 22, 2005

59

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<<en ume ratio n>>


ClusteringApplyContent

ApplySettings
(from Apply)

clusterIdentifier
probability
qualityOfFit
distance

ClusteringApplySettings
mapByRank(content : ClusteringApplyContent, destPhysAttrNameArray : String, fromTop : boolean)
mapByClusterIdentifier(content : ClusteringApplyContent, clusterIdentifier : int, destPhysAttrName : String)
mapTopCluster(content : ClusteringApplyContent, destPhysAttrName : String)
mapClusters(content : ClusteringApplyContent, baseDestPhysAttrName : String)
getRank(destinationAttrName : String) : Integer
getRanks() : Integer
isFromTop() : boolean
getContent(destPhysAttrName : String) : ClusteringApplyContent
getContentsByCluster(clusterIdentifier : int) : ClusteringApplyContent
getContentsByRank(rank : int) : ClusteringApplyContent
getMappedClusterIdentifiers() : int
getMappedClusterIdentifier(destPhysAttrName : String) : Integer
getMappedDestinationAttrName(clusterIdentifier : int, contentType : ClusteringApplyContent) : String
getMappedDestinationAttrNames(content : ClassificationApplyContent) : String
getMappedContents() : ClusteringApplyContent
getMappedBaseDestinationAttributeName(content : ClusteringApplyContent) : String

ClusteringApplySettingsFactory
create() : ClusteringApplySettings
getDefaultApplySettings() : ClusteringApplySettings
supportsCapability(algorithm : MiningAlgorithm, content : ClusteringApplyContent) : boolean
supportsCapability(algorithm : MiningAlgorithm, mappingType : ClusteringApplyCapability) : boolean
getSupportedApplyContents(algorithm : MiningAlgorithm) : ClusteringApplyContent
<<enumeration>>
ClusteringApplyCapability
topSequentialRanks
bottomSequentialRanks
individualClusters
allClusters
topCluster

FIGURE 4.37 Package javax.datamining.clustering - ApplySettings

CategoryMatrix

SimilarityMatrix
getCellValue(category1 : Object, category2 : Object) : double
setCellValue(category1 : Object, category2 : Object, similarityValue : double)

SimilarityMatrixFactory
create(ategorySet : CategorySet) : SimilarityMatrix

FIGURE 4.38 Package javax.datamining.clustering - Similarity Matrix

June 22, 2005

60

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.12 Package javax.datamining.rule


The rule package contains the interfaces that define predicate rules for supporting clustering, and trees. JDM supports several types of predicate specifications: simple predicates of
the form attribute-operator-value(s), boolean predicates for TRUE or FALSE, and compound predicates.

Rule
getSupport() : double
getAbsoluteSupport() : long
getConfidence() : double
getAntecedent() : Predicate
getConsequent() : Predicate
getRuleIdentifier() : int
translate() : String
translate(format : RuleTranslationFormat) : String
+rule

+a ntecedent

{ordered}

+compoundPredicate

systemDefault

0..n 0..n +rule

rul eRef Antecedent

compoundPredicateHasPredicates +predicate

<<enumeration>>
RuleTranslationFormat

ruleRefConsequent
1

+consequent

Predicate

1..n

CompoundPredicate
getOperator() : BooleanOperator
getPredicates() : Predicate

<<enumeration>>
BooleanOperator
or
and
xor
not
surrogate

BooleanPredicate
getValue() : bool ean

<<enumeration>>
ComparisonOperator

SimplePredicate
getAttributeName() : String
getComparisonOperator() : ComparisonOperator
isNumericalValue() : boolean
getNumericalValue() : Double
getCategoryValues() : Object

equal
notEqual
lessThan
greaterThan
lessOrEqual
greaterOrEqual
in
notIn

FIGURE 4.39 Package javax.datamining.rule - Rule and Predicate

June 22, 2005

61

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.13 Package javax.datamining.statistics


The statistics package contains interfaces for accessing attribute statistics. Statistics may
be associated with physical attributes, logical attributes, or signature attributes. Vendors
determine which statistics are actually computed for their respective implementations, and
whether or not they are available on various objects, such as models. A task that computes
statistics on physical data is provided. Although the package is required, the use of statistics is optional.

AttributeStatisticsSet
getStatistics(attri buteName : String) : Univaria teStati stics
getStatistics() : Collectio n
getNumberOfCases() : long
supportsCap ab iltiy(capabi lity : Attrib ut eStatisticsSetCap ability) : boolean
getStatisticsTi mestamp() : Date

UnivariateStatistics
getName() : String
getValues() : Object
getFrequency(index : int) : long
getFrequencies() : long
getProbabilities() : double
getFrequency(property : CategoryProperty) : long
getDiscreteStatistics() : DiscreteStatistics
getNumericalStatistics() : NumericalStatistics
getContinuousStatistics() : ContinuousStatistics

NumericalStatistics
getVa riance() : do uble
getQuantil eLimits() : dou ble
getQuantil e(limit : double) : d ouble
getMi nimumValue() : double
getMaximumValu e() : do ub le
getMeanValue() : double
getStandardDeviation() : double
getMedi anValue() : do uble
getInt erQu artileRange() : double

<<enum erati on >>


AttributeStatisticsSetCapability
missingFrequency
invalidValuesFrequency
continuousSum
continuousSumOfSquares
continuousSumsByInterval
numericalQuantiles
numericalMedian
numericalInterQuartileRange

ContinuousStatistics
getIntervals() : Interval
getFrequency(range : Interval) : long
getFrequencies() : long
getSum(range : Interval) : double
getSum() : double
getSumOfSquares(range : Interval) : doub...
getSumOfSquares() : double

DiscreteStatistics
getModalValue() : Object
getDiscreteValues() : Object
getFrequency(discreteValue : Object) : long
getFrequencies() : long

FIGURE 4.40 Package javax.datamining.statistics - AttributeStatistics

June 22, 2005

62

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.14 Package javax.datamining.algorithm


The Algorithm package consists of several sub-packages, each of which describes the
details that are specific to the algorithm and model.

javax.datamining.algorithm.tree: algorithm settings for decision tree algorithms


javax.datamining.algorithm.naivebayes: algorithm settings for the Naive Bayes
algorithm

javax.datamining.algorithm.feedforwardneuralnet: algorithm settings for Feed


Forward Neural Network algorithms

javax.datamining.algorithm.kmeans: algorithm settings for the K-Means clustering


algorithm

4.14.1 Package algorithm.tree


The tree algorithm package defines settings for building classification and regression
trees. The algorithm interface can be used as a front end to several popularly used algorithms.

SupervisedAlgorithmSettings

TreeSettings

TreeSettingsFactory

getMaxSurrogates() : int
setMaxSurrogates(maxSurrogates : int)
getMaxDepth() : int
setMaxDepth(maxDepth : int)
determineMaxDepth(determineMaxDepth : boolean)
determineMaxDepth() : boolean
getMinNodeSize() : double
getMinNodeSizeUnit() : SizeUnit
getMinNodeSize(sizeUnit : SizeUnit) : double
setMinNodeSize(size : double, unit : SizeUnit)
getMinDecreaseInImpurity() : double
setMinDecreaseInImpurity(minImpurity : double)
getTreeSelectionMethod() : TreeSelectionMethod
setTreeSelectionMethod(selectionMethod : TreeSelectionMethod)
getMaxSplits() : int
setMaxSplits(maxSplits : int)
getMaximumPValue() : double
setMaximumPValue(maxPValue : double)
getBuildHomogeneityMetric() : TreeHomogeneityMetric
setBuildHomogeneityMetric(buildMetric : TreeHomogeneityMetric)
getPruningHomogeneityMetric() : TreeHomogeneityMetric
setPruningHomogeneityMetric(pruningMetric : TreeHomogeneityMetric)
computeNodeStatistics(computeNodeStatistics : boolean)
getComputeNodeStatistics() : boolean

<<enumeration>>
SizeUnit
count
percentage

<<enumeration>>
TreeHomogeneityMetric
systemDetermined
systemDefault
meanSquaredError
meanAbsoluteDeviation
gini
entropy
misclassificationRatio

cre at e() : Tre eSe tting s


sup portsCap ab ili ty(ca pa bil ity : TreeCapa bi lity) : boolea n
getMaxSu rro gate sAllo wed () : in t
getMaxDe pt hAllo wed () : in t
sup portsMin No de Si zeUn it(sizeUni t : Si ze Unit) : SizeUnit

<<en ume ration>>


TreeCapability
maxSurrogates
maxDepth
minAbsoluteSize
minPercentageSize
minDecreaseInImpurity
treeSelectionMethod
maxSplits
maxPValue
buildHomogeneityMetric
pruningHomogeneityMetric
nodeStatistics
missingValueHandling

<<enumeration>>
TreeSelectionMethod
systemDetermined
systemDefault
minimumErrorTree
oneStandardErrorTree

FIGURE 4.41 Package algorithm.tree - TreeSettings


June 22, 2005

63

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

4.14.2 Package algorithm.naivebayes


The Naive Bayes algorithm package defines settings for building probabilistic models
using the Naive Bayes algorithm, or variants of it.

SupervisedAlgorithmSettings

NaiveBayesSettings
getSingletonThreshold() : double
setSingletonThreshold(singletonThreshold : double)
getPairwiseThreshold() : double
setPairwiseThreshold(pairwiseThreshold : double)

NaiveBayesSettingsFactory
create() : NaiveBayesSettings
supportsCapability(capability : NaiveBayesCapability) : boolean

<<enumeration>>
NaiveBayes Capability
singletonThreshold
pairwiseThreshold
missingValueHandling
singletonCount

FIGURE 4.42 Package algorithm.naivebayes - NaiveBayesSettings

June 22, 2005

64

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.14.3 Package algorithm.feedforwardneuralnet


The feed forward neural network algorithm package defines settings for building classification and regression models using neural networks. The algorithm interface can be used
as a front end to several popular variants of neural networks.

FeedForwardNeuralNetSettingsFactory
SupervisedAlgorithmSettings

create() : FeedForwardNeuralNetSettings
supportsCapability(capability : FeedForwardNeuralNetCapability) : boolean

FeedForw ardNeural NetSettin gs

<<enumeration>>
Feed ForwardNeuralNetCapabi lity

getNeuralLayers() : NeuralLaye r
setNeuralLayers(hi ddenLa yerArray : NeuralLa yer)
getLearningAlgorithm() : Learn ingAlgorithm
setLearni ngAl gorithm(learni ngAl gorithm : Learnin gAlgorithm)
getMaxNumberOfIteration s() : int
setMaxNu mberOfIterati ons(maxItera ti ons : int)
getMinErrorTolerance() : double
setMinErrorTolerance(min Tole rance : double )
determineNum berOfNodesPerLayer() : boolean
determineNum berOfNodesPerLayer(determin eNumberOfNodesPerLayer : boolean)
1

backPropagation
backPropagationWithMomentum
bias
maximumIterations
minimumErrorTolerance
missingValueHandling
hiddenLayers

+backpropSettings 0..n

+backpropSettings
backpropAlgorithmSettin gsRefLearningAlgori thm
backpropAlgorithmSe ttingsHasLaye r
+learningAlgorithm
+neuralLayer

Learn ingAlgorith m

1. .n
NeuralLayer

getNumberOfNodes() : int
setNumberOfNodes(nodes : int)
useBias() : boolean
useBias(useBias : boolean)
getActivationFunction() : ActivationFunction
setActivationFunction(function : ActivationFunction)

<<enumeration>>
ActivationFunction
systemDetermined
systemDefault
linearIdentity
logistic
hyperbolicTangent
sign
symmetricSign
softMax

Backpropagation
getLearningRate() : double
setLearningRate(rate : double)
getMomentum() : double
setMomentum(momentum : double)

Backpropagati onFactory
create() : Backpropa gation

NeuralLayerFactory
create(numberOfNodes : int) : NeuralLayer
getMaxNumberOfNodes() : int

FIGURE 4.43 Package algorithm.feedforwardneuralnet - FeedForwardNeuralNetSettings

June 22, 2005

65

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.14.4 Package algorithm.svm


The SVM algorithm package defines settings for building classification and regression
models.

SupervisedAlgorithmSettings
(from Supervised)

<<enumeration>>
KernelFunction
SVMClassificationSettings
getKernelFunction() : KernelFunction
setKernelFunction(kernelFunction : KernelFunction)
getCStrategy() : double
setCStrategy(cValue : double)
getTolerance() : double
setTolerance(tolerance : double)
getStandardDeviation() : double
setStandardDeviation(stdDeviation : double)
getComplexityFactor() : double
setComplexityFactor(factor : double)
getKernelCacheSize() : int
setKernelCacheSize(cacheSize : int)
getPolynomialDegree() : int
setPolynomialDegree(degree : int)

systemDefault
systemDetermined
kLinear
kGaussian
polynomial
hypertangent
sigmoid
<<enumeration>>
SVMClassificationCapability
cStrategy
tolerance
standardDeviation
complexityFactor
kernelCacheSize
polynomialDegree

SVMClassificationSettingsFactory
create() : SVMClassificationSettings
supportsCapability(capability : SVMClassificationCapability) : boolean
supportsCapability(kernelFunction : KernelFunction) : boolean

FIGURE 4.44 Package algorithm.svm.classification - SVMClassificationSettings

SupervisedAlgorithmSettings
(from Supervised)

SVMRegressionSettings
getKernelFunction() : KernelFunction
setKernelFunction(kernelFunction : KernelFunction)
getCStrategy() : double
setCStrategy(cValue : double)
getTolerance() : double
setTolerance(tolerance : double)
getStandardDeviation() : double
setStandardDeviation(stdDeviation : double)
getComplexityFactor() : double
setComplexityFactor(factor : double)
getKernelCacheSize() : int
setKernelCacheSize(cacheSize : int)
getPolynomialDegree() : int
setPolynomialDegree(degree : int)
getEpsilon() : double
setEpsilon(epsilon : double)

SVMRegressionSettingsFactory
create() : SVMRegressionSettings
supportsCapability(capability : SVMRegressionCapability) : boolean
supportsCapability(kernelFunction : KernelFunction) : boolean

<<enumeration>>
KernelFunction
systemDefault
systemDetermined
kLinear
kGaussian
polynomial
hypertangent
sigmoid

<<enumeration>>
SVMRegressionCapability
cStrategy
tolerance
standardDeviation
complexityFactor
kernelCacheSize
polynomialDegree
epsilon

FIGURE 4.45 Package algorithm.svm.regression - SVMRegressionSettings

June 22, 2005

66

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

4.14.5 Package algorithm.kmeans


The k-Means algorithm package defines settings for building clustering models. The algorithm interface can be used as a front end to several popularly used algorithms.

ClusteringAlgorithmSettings

KMeansSettings
getMaxNumberOfIterations() : int
setMaxNumberOfIterations(maxIterations : int)
getMinErrorTolerance() : double
setMinErrorTolerance(minErrorTolerance : double)
getDistanceFunction() : ClusteringDistanceFunction
setDistanceFunction(distanceFunction : ClusteringDistanceFunction)

KMeansSettingsFactory
create() : KMeansSettings
supportsCapability(capability : KMeansCapability) : boolean
supportsCapability(distanceFunction : ClusteringDistanceFunction) : boolean

<<enumeration>>
ClusteringDistanceFunction
systemDetermined
systemDefault
euclidean

<<enumeration>>
KMeansCapability
minimumErrorTolerance
missingValueHandling

FIGURE 4.46 Package algorithm.kmeans - KMeansSettings

June 22, 2005

67

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

4.15 Package javax.datamining.modeldetail


The modeldetail package consists of several sub-packages, each of which describes the
details specific to the representation of a particular kind of model.

javax.datamining.modeldetail.tree: representation for decision tree models


javax.datamining.modeldetail.feedforwardneuralnet: representation for feed forward neural network models

javax.datamining.modeldetail.naivebayes: representation for Naive Bayes models


4.15.1 Package modeldetail.tree
The tree package provides a general representation for decision tree models.

ModelDetail

TreeModelDetail
getRootNode() : TreeNode
getNodes() : TreeNode
getNodeIdentifiers() : int
getRules() : Collection
getRule(nodeId : int) : Rule
getNode(nodeId : int) : TreeNode
getTreeDepth() : int
getNumberOfNodes() : int
getNumberOfLeafNodes() : int

<<enumeration>>
PredictionType
category
mean
median

+treeModel

treeRepresentationHasNode

1
+rootNode 1
TreeNode
getIdentifier() : int
getTargetCount(target : Object) : long
getTargetCounts() : long
getCaseCount() : long
getNumberOfChildren() : int
getParent() : TreeNode
getAncestors() : TreeNode
getChildren() : TreeNode
getPredicate() : Predicate
getSurrogates() : Predicate
getPrediction() : Object
getLevel() : int
getNodeStatistics() : AttributeStatisticsSet
getPredictionType() : PredictionType
isLeaf() : boolean
getRule() : Rule
+treeNode

+childNode
0..n

+parent

treeNodeHasPredic...
treeNodeHasChild
+predicate

0..1

Predicate

FIGURE 4.47 Package modeldetail.tree - TreeModelDetail

June 22, 2005

68

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

4.15.2 Package modeldetail.feedforwardneuralnet


The feedforwardneuralnet package provides a general representation for feed forward neural network models.
ModelDetail

NeuralNetworkModelDetail
getNumberOfLayers() : int
getLayerIdentifiers() : int
getActivationFunction(layerId : int) : ActivationFunction
getNumberOfNeurons(layerId : int) : int
getNeuronIdentifiers(layerId : int) : int
getWeight(parentLayerId : int, parentNeuronId : int, childNeuronId : int) : double
getBias(layerId : int, neuronId : int) : double

Layer ID
0 : input
1..n-1 : hidden
n : output

FIGURE 4.48 Package modeldetail.feedforwardneuralnet - NeuralNetworkModelDetail

4.15.3 Package modeldetail.naivebayes


The naivebayes package provides a general representation for Naive Bayes models.

ModelDetail

NaiveBayesModelDetail
getCount(attributeName : String, attributeValue : Object) : int
getPairCount(attributeName : String, predictorValue : Object, targetValue : Object) : int
getPairProbability(attributeName : String, predictorValue : Object, targetValue : Object) : double
getPairProbabilities(attrName : String, targetValue : Object) : Map
getTargetCount(targetValue : Object) : long
getTargetProbability(targetValue : Object) : double

FIGURE 4.49 Package modeldetail.naivebayes - NaiveBayesModelDetail

June 22, 2005

69

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

4.15.4 Package modeldetail.svm


The svm package provides a general representation for Support Vector Machine models.

ModelDetail
(from Base)

SVMModelDetail
is LinearSVMModel() : boolean
getNumberOfSupportVectors() : int
getNumberOfBoundedVectors () : int
getNumberOfUnboundedVectors () : int

SVMRegressionModelDetail
getCoefficient(categoricalAttrName : String, categoryValue : Object) : double
getCoefficient(numericalAttrName : String) : double
getCoefficients(attrName : String) : Map
getBias() : double
SVMClassificationModelDetail
getCoefficient(targetValue : Object, categoricalAttrName : String, categoryValue : Object) : double
getCoefficient(targetValue : Object, numericalAttrName : String) : double
getCoefficients(targetValue : Object, attrName : String) : Map
getBias(targetValue : Object) : double

FIGURE 4.50 Package modeldetail.svm - SVMModelDetail

June 22, 2005

70

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

5. Code examples
In this section, we provide several code examples to illustrate the intended use of the JDM
API. These examples do not explore all mining functions nor all algorithms. We have
selected a few data mining usage scenarios from which other examples could be derived
given the individual interface documentation.
In particular, we illustrate:

building a clustering model using the clustering mining function (Section 5.1)
applying a clustering model to a data set and specifying the apply settings (Section 5.2)
applying a clustering model to an individual record (Section 5.3)
building a classification model using the classification mining function (Section 5.4)
testing a classification model to determine model accuracy (Section 5.5)
extracting rules from a decision tree model (Section 5.6)
extracting rules from an association model (Section 5.7)
importing and exporting a model (Section 5.8)
using reflection (Section 5.9)
establishing a connection to the DME (Section 5.10)
In the examples, a connection to a DME is assumed to be readily available as dmeConn,
and exception handling is omitted intentionally for improved code readability. For the
same reason, the vendor capability is not checked in the examples. The uniform resource
identifiers (URI) are used to represent the physical data in the examples. Refer to
Section 5.11 for more information on URIs. Since URI format is specific to the vendor, we
do not specify URI values in the examples.

5.1 Building a clustering model


The following code illustrates how to build a clustering model on a table stored in a location that is expressed as a URI.
// Create the physical representation of the data
(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dmeConn.getFactory( javax.datamining.data.PhysicalDataSet );
(2) PhysicalDataSet buildData = pdsFactory.create( uri, true );
(3) dmeConn.saveObject( myBuildData, buildData, false );
// Create the logical representation of the data from physical data
(4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(
javax.datamining.data.LogicalData );
(5) LogicalData ld = ldFactory.create( buildData );
(6) dmeConn.saveObject( myLogicalData, ld, false );
// Create the settings to build a clustering model
(7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dmeConn.getFactory( javax.datamining.clustering.ClusteringSettings
);
(8) ClusteringSettings clusteringSettings = csFactory.create();
(9) clusteringSettings.setLogicalDataName( myLogicalData );
(10) clusteringSettings.setMaxNumberOfClusters( 20 );

June 22, 2005

71

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

(11) clusteringSettings.setMinClusterCaseCount( 5 );
(12) dmeConn.saveObject( myClusteringBS, clusteringSettings, false );
// Create a task to build a clustering model with data and settings
(13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(
javax.datamining.task.BuildTask );
(14) BuildTask task = btFactory.create( myBuildData, myClusteringBS,
myClusteringModel );
(15) dmeConn.saveObject( myClusteringTask, task, false );
// Execute the task and check the status
(16) ExecutionHandle handle = dmeConn.execute( myClusteringTask );
(17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done
(18) ExecutionStatus status = handle.getLatestStatus();
(19) if( ExecutionState.success.equals( status.getState() ) )
(20)

// task completed successfully...

In lines 1 to 3, we create a PhysicalDataSet object specifying the location of data via a


URI and save the object into the mining server. In line 2, the PhysicalAttribute metadata
is imported to the object from the data pointed to by the URI, automatically derived from
the identified data (by the second parameter being true).
In lines 4 to 6, we create a LogicalData instance based on the specified physical data, and
save the object into the mining server. Here, the default behavior is to create a LogicalAttribute instance for each physical attribute in the source data. Whether an attribute is of
categorical or numerical type is derived from its attribute (or column) data type and possibly the number of unique values in the attribute. Note that the logical data needs to be
saved to be used in build settings. The logical data may be omitted in the build settings if it
is not supported by the mining function or all attributes are to be used as they are. In this
example, since no changes are made to the logical data after its content is populated from
a physical data, it can be omitted. As such, lines 4 to 6 as well as line 9 are not required for
this example.
In lines 7 through 12, we create and save a ClusteringSettings instance, providing a
name, logical data, the maximum number of clusters and the minimum cluster size
desired, and persist the settings in the mining server through the connection.
In lines 13 and 15, we create a BuildTask instance, which specifies the data to be mined,
the settings describing the type of model to build, and the name of the model to be placed
in the repository connected with dmeConn. The mapping between physical attributes (or
columns) in the data and logical attributes is defined based on name equivalence, by
default. However, if a specific mapping was desired, the following operations could be
used to map physical attributes to logical attributes:
java.util.Map buildDataMap = new java.util.Map();
buildDataMap.put( PERSON_AGE, AGE );
task.setBuildDataMap( buildDataMap );

This code shows how to map physical attributes to logical attributes. The physical
attribute name is PERSON_AGE, but it will be mapped to the name AGE in the logical
data and will appear in the model signature. Operations using this model, e.g., apply, must
use the name AGE. This is useful when an attribute has a name that is difficult to understand.
In line 15, we explicitly save the task. All objects associated with a task and the task itself
must be saved prior to asynchronous task execution. Task need not be saved for synchronous execution.

June 22, 2005

72

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

In line 16, we use the connection to execute the task. At the DME, an algorithm with suitable default settings is selected to produce the clustering model when an algorithm is not
specified in the settings. The resulting model is placed in the MOR represented by the connection through which the task is executed. The user could later use the name of the model
for applying the model to data.
In lines 17 through 19, the application asynchronously checks the status of the execution
by extracting the execution handle.

5.2 Applying a clustering model to data


The following code illustrates how to apply a clustering model to input data to find the
clusters and associated probabilities of each case in the input.
// Create the physical representation of the input data
(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dmeConn.getFactory( javax.datamining.data.PhysicalDataSet );
(2) PhysicalDataSet applyInputData = pdsFactory.create( uri, true );
(3) dmeConn.saveObject( applyInput, applyInputData, false );
// Create the output specification of apply
(4) ClusteringApplySettingsFactory casFactory = (ClusteringApplySettingsFactory) dmeConn.getFactory( javax.datamining.clustering.ClusteringApplySettings );
(5) ClusteringApplySettings applySettings = casFactory.create();
(6) java.util.Map sourceDestMap = new java.util.Map();
(7) sourceDestMap.put( CustomerId, ID ); // map CustomerId as ID
(8) applySettings.setSourceDestinationMap( sourceDestMap );
(9) applySettings.mapTopCluster( ClusteringApplyContent.clusterIdentifier,
"ClusterId" ); // Output column for the top cluster
(10) applySettings.mapTopCluster( ClusteringApplyContent.probability,
"Probability" ); // Output column for the top probability
(11) dmeConn.saveObject( myApplySettings, applySettings, false );
// Create a task for apply with data set
(12) DataSetApplyTaskFactory datFactory = (DataSetApplyTaskFactory) dmeConn.getFactory( javax.datamining.task.apply.DataSetApplyTask );
(13) DataSetApplyTask applyTask = datFactory.create(
applyInput, myClusteringModel, myApplySettings, outputURI );
(14) dmeConn.saveObject( myApplyTask, applyTask, false );
// Execute the apply task
(15) ExecutionHandle execHandle = dmeConn.execute( myApplyTask );
(16) execHandle.waitForCompletion( Integer.MAX_VALUE );// wait until done

In lines 1 through 3, we create the PhysicalDataSet object from a URI. The URI provides information necessary to access the apply data. The PhysicalDataSet object is populated with physical attributes (by the second parameter being true) and saved in the
repository represented by the connection.
From lines 4 through 11, we create a ClusteringApplySettings object to specify the
results of the apply operation. In this example the apply settings table will have columns
for the customer id directly copied from the input table (with the new name ID), cluster id
with the highest probability (with the name ClusterId), and its probability (with the name
Probability).
In lines 12 through 14, we create the DataSetApplyTask object with the input clustering
model and data, output data and apply settings name, and save the task. A URI is also proJune 22, 2005

73

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

vided indicating where apply output data is to be persisted. In line 15, we execute the
apply operation using the data mining server connection. In line 16, we wait for the completion of the apply task until it is completed.

5.3 Applying a clustering model to a record


The following code illustrates how to apply a model to an individual record in support of
real-time scoring. This is intended to support real time scoring. The output record is
obtained from the apply task after execution.
// Create a physical data record for input
(1) PhysicalDataRecordFactory pdrFactory = (PhysicalDataRecordFactory) dmeConn.getFactory( javax.datamining.data.PhysicalDataRecord );
(2) PhysicalDataRecord applyInputRecord = pdrFactory.create();
(3) applyInputRecord.setValue( ID, new Integer(1) );
(4) applyInputRecord.setValue( AGE, new Integer(45) );
(5) applyInputRecord.setValue( INCOME, new Integer(75000) );
// Create the output specification of apply (output will be a record too)
(6) ClusteringApplySettingsFactory casFactory = (ClusteringApplySettingsFactory) dmeConn.getFactory( javax.datamining.clustering.ClusteringApplySettings );
(7) ClusteringApplySettings applySettings = // default settings gives
casFactory.getDefaultApplySettings();
// the top cluster
(8) java.util.Map sourceDestMap = new java.util.Map();
(9) sourceDestMap.put( ID, ID ); // copy the attribute to the output
(10) applySettings.setSourceDestinationMap( sourceDestMap );
(11) dmeConn.saveObject( myRecordScoringSettings, applySettings, false );
// Load the model into the memory for faster scoring (if supported)
(12) dmeConn.requestModelLoad( myClusteringModel );
// Create an apply task and execute (here, name equivalence is expected)
(13) RecordApplyTaskFactory ratFactory = (RecordApplyTaskFactory) dmeConn.getFactory( javax.datamining.task.apply.RecordApplyTask );
(14) RecordApplyTask applyTask = ratFactory.create( applyInputRecord,
myClusteringModel, myRecordScoringSettings );
// Synchronous execution without timeout
(15) ExecutionStatus status = dmeConn.execute( applyTask, null );
(16) if( !status.getState().equals( ExecutionState.success ) )
{ ... }

// error

// Unload the model to free up the mining engine


(17) dmeConn.requestModelUnload( myClusteringModel );
// Retrieve the results from the task as a record
(18) PhysicalDataRecord applyOutputRecord = applyTask.getOutputRecord();
(19) Integer customerId = // attribute copied from the input record
(Integer) applyOutputRecord.getValue( "ID" );
// Cluster_Identifier is in the default apply settings for predicted cluster
(20) Integer topClusterId =
(Integer) applyOutputRecord.getValue( "Cluster_Identifier" );

An input record is created for input in lines 1 through 5. This record can be reuse din subsequent scoring tasks by chaning the specific attribute values. In lines 6 through 11, the
default apply output specification is used and only the top cluster (determined with the criteria by the vendor) is to be included in the output under the attribute name
Cluster_Identifier (line 20). The clustering apply settings is saved and referenced by
name in subsequent invocations. An attribute (ID) is directly copied from the input to
June 22, 2005

74

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

the output (lines 8 through 10) and is retrieved from the output (line 19). The attributes in
the input data must be compatible with those in the model signature including names.
Note that real-time record apply has its own task, RecordApplyTask.
Some implementations may support loading the models for faster real-time scoring. It is
up to the implementation on how to manage the loaded models. If this feature is not supported, this operation is a no-op (lines 12 and 17).

5.4 Building a classification model


The following code illustrates how to build a classification model on data stored in a location that is expressed as a URI.
// Create the physical representation of the data
(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dmeConn.getFactory( javax.datamining.data.PhysicalDataSet );
(2) PhysicalDataSet pd = pdsFactory.create( \autosales.data, true );
(3) dmeConn.saveObject( myPD, pd, false );
// Create the logical representation of the data from physical data
(4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(
javax.datamining.data.LogicalData );
(5) LogicalData ld = ldFactory.create( pd );
// Specify how attributes are to be used by the algorithm
(6) LogicalAttributeFactory laFactory = (LogicalAttributeFactory) dmeConn.getFactory( javax.datamining.data.LogicalAttribute );
(7) LogicalAttribute zipcode = ld.getAttribute( zip_code );
(8) zipcode.setAttributeType( AttributeType.categorical );
(9) dmeConn.saveObject( myLD, ld, false );
// Create a settings object to be used for a classification model
(10) ClassificationSettingsFactory csFactory = (ClassificationSettingsFactory) dmeConn.getFactory( javax.datamining.supervised.classification.ClassificationSettings );
(11) ClassificationSettings settings = csFactory.create();
(12) settings.setTargetAttributeName( purchase_car );
(13) settings.setCostMatrixName( salesCost ); // predefined cost matrix
// Create the AlgorithmSettings and add it to the BuildSettings
(14) NaiveBayesSettingsFactory nbFactory = (NaiveBayesSettingsFactory) dmeConn.getFactory( javax.datamining.algorithm.naivebayes.NaiveBayesSettings );
(15) NaiveBayesSettings nbSettings = nbFactory.create();
(16) nbSettings.setSingletonThreshold( .01 );
(17) nbSettings.setPairwiseThreshold( .01 );
// Associate LogicalData and AlgorithmSettings with the BuildSettings
(18) settings.setAlgorithmSettings( nbSettings );
(19) settings.setLogicalDataName( myLD );
// Save the BuildSettings
(20) dmeConn.saveObject( myBS, settings, false );
// Create the build task and verify the task before execution
(21) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(
javax.datamining.task.BuildTask );
(22) BuildTask buildTask = btFactory.create( myPD, myBS, myModel );
(23) VerificationReport report = buildTask.verify();
(24) if( report != null ) {// either error or warning
June 22, 2005

75

JavaTM Data Mining (JDM)

Maintenance Release

(25)

ReportType reportType = report.getReportType ();

(26)

// check if its just a warning or an error

Version 1.1

(27) }
(28) dmeConn.saveObject( myBuildTask, buildTask, false );
// Execute the task and block until finished
(29) ExecutionHandle handle = dmeConn.execute( myBuildTask );
(30) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done
// Access the model if model was successfully built
(31) ExecutionStatus status = handle.getLatestStatus();
(32) if( ExecutionState.success.equals( status.getState() ) ) {
(33)
(34)

ClassificationModel model = (ClassificationModel) dmeConn.retrieveObject( myModel, NamedObject.model );


// work with the model here...

(35) }

In this example, a classification model is built to identify customers to purchase cars. A


cost matrix is used to assess misclassification cost when building the model. Note that
CostMatrix is a named object in JDM and a predefined cost matrix is used in this example.
The resulting classification model will be used to predict the class value of the attribute
purchase_car.
This example shows how to change the default attribute type with the logical data. Suppose that the physical attribute zip_code is numerical in the physical data, but must be
treated as categorical by the algorithm during model build. When the physical data is created in line 2, the type of the attribute would be set to be numerical by default. In the logical data, however, the type of the attribute is changed to categorical (lines 7 and 8).
The Naive Bayes algorithm is specified to build a classification model (lines 14 to 18).
The build task gets verified before execution to avoid costly errors during model build
(lines 23 to 27). Task verification is a good practice before the task is submitted for execution. This practice is particularly useful for long-running tasks.

5.5 Testing a classification model


The following code illustrates how to test a classification model to determine its accuracy
using a data table.
// Create the physical representation of the input data for apply
(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory)
dmeConn.getFactory( javax.datamining.data.PhysicalDataSet );
(2) PhysicalDataSet testInputData = pdsFactory.create( uri, true );
(3) dmeConn.saveObject( testInputData, testInputData, false );
// Create a task to run test operation, the result is named myTestMetrics
(4) ClassificationTestTaskFactory cttFactory =
(ClassificationTestTaskFactory) dmeConn.getFactory( javax.datamining.supervised.classification.ClassificationTestTask );
(5) ClassificationTestTask testTask = cttFactory.create(
testInputData, myClassificationModel, "myTestMetrics" );
// Enable computation of confusion matrix as the result of test
(6) testTask.computeMetric(ClassificationTestMetricOption.confusionMatrix);
(7) dmeConn.saveObject( myTestTask, testTask, false );
// Execute the task asynchronously, but waits until done
(8) ExecutionHandle execHandle = dmeConn.execute( myTestTask );
(9) execHandle.waitForCompletion( Integer.MAX_VALUE );// wait until done

June 22, 2005

76

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

// Retrieve the test metrics


(10) ClassificationTestMetrics testMetrics = (ClassificationTestMetrics)
dmeConn.retrieveObject( myTestMetrics, NamedObject.testMetrics );
(11) Double accuracy = testMetrics.getAccuracy();
(12) ConfusionMatrix matrix = testMetrics.getConfusionMatrix();

In lines 1 through 2, we create the PhysicalDataSet object from a URI. This object is
populated with physical attributes that come directly from the specified data. In line 3, we
save the data in the mining server through the connection.
In lines 4 through 7, we create the test task object with the input classification model, test
data, and the test metrics name. In line 6, the test task is specified to produce a confusion
matrix as the result. Other optional test metrics include lift and receiver operating characteristics.
In line 8, we execute the test operation using the connection. In line 9, we wait for the
completion of the test task until it is completed. In line 10, we retrieve the classification
test metrics object by name. In line 11, we get the accuracy for this model as computed
from the input test data. In line 12, we get the confusion matrix.

5.6 Building and extracting rules from a tree model


The following code illustrates how to build a tree model and how to extract rules from the
resulting tree model.
// Create the physical representation of the input data for build
(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory)
dmeConn.getFactory( javax.datamining.data.PhysicalDataSet );
(2) PhysicalDataSet treeData = pdsFactory.create( uri, true );
(3) dmeConn.saveObject( myTreeBuildData, treeData, false );
// Create the logical representation of the input data
(4) LogicalDataFactory ldFactory = (LogicalDataFactory)
dmeConn.getFactory( javax.datamining.data.LogicalData );
(5) LogicalData logicalData = ldFactory.create( treeData );
(6) dmeConn.saveObject( myTreeLD, logicalData, false );
// Create the settings to build a tree classification model
(7) ClassificationSettingsFactory csFactory =
(ClassificationSettingsFactory) dmeConn.getFactory( javax.datamining.supervised.classification.ClassificationSettings );
(8) ClassificationSettings buildSettings = csFactory.create();
(9) buildSettings.setLogicalDataName( myTreeLD );
(10) buildSettings.setTargetAttributeName( Buy_Product );
// Create Tree algorithm settings to build a classification model
(11) TreeSettingsFactory tsFactory = (TreeSettingsFactory) dmeConn.getFactory( javax.datamining.algorithm.tree.TreeSettings );
(12) TreeSettings treeSettings = tsFactory.create();
(13) treeSettings.setMaxDepth( 5 );
(14) buildSettings.setAlgorithmSettings( treeSettings );
(15) dmeConn.saveObject( myTreeSettings, buildSettings, false );
// Create build task and submit for execution
(16) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(
javax.datamining.task.BuildTask );
(17) BuildTask buildTask = btFactory.create(
myTreeBuildData, myTreeSettings, "myTreeModel" );

June 22, 2005

77

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

(18) dmeConn.saveObject( myBuildTask, buildTask, false );


(19) ExecutionHandle execHandle = dmeConn.execute( myTreeBuildTask );
(20) execHandle.waitForCompletion( Integer.MAX_VALUE );// wait until done
(21) ExecutionStatus lastStatus = execHandle.getLatestStatus();
// If the status is success then retrieve the tree rules
(22) ExecutionState lastState = lastStatus.getState();
(23) if( lastState.equals( ExecutionState.success ) ) { // get the model
(24)

ClassificationModel treeModel = (ClassificationModel)


dmeConn.retrieveObject( "myTreeModel", NamedObject.model );
//Get the tree representation from the model

(25)

TreeModelDetail treeDetail =
(TreeModelDetail) treeModel.getModelDetail();

(26)

Collection rulesCollection = treeDetail.getRules();

(27)
(28)
(29)

Iterator ruleIterator = rulesCollection.iterator();


while( ruleIterator.hasNext() ) {
Rule rule = (Rule) ruleIterator.next();
// Translate the rule into the default format

(30)

String ruleString = rule.translate();

(31)

Predicate antecedent = rule.getAntecedent();

(32)

Predicate consequent = rule.getConsequent();

(33)

double support = rule.getSupport();

(34)
(35)

// do something here with the data from the rule


} // End of while

(36) } // End of If

In this example, a classification model is built using the tree algorithm. In line 14, we
specify the tree algorithm using tree settings. Once the model is built successfully (lines
19 to 23), the rules are extracted from the resulting decision tree model (lines 24 to 35).
In lines 1 and 2, we create the PhysicalDataSet object from a URI. This physical data set
object is populated with physical attributes that come directly from the specified physical
data. In line 3, we save the data in the mining server.
In lines 4 through 6, we create the logical data object using the physical data and save it in
the server. Here, the default behavior is to create a LogicalAttribute instance for each
physical attribute in the source data. Whether an attribute is of categorical or numerical
type is derived from its attribute (or column) data type and possibly the number of unique
values in the attribute. Note that the logical data needs to be persisted to be used for build
settings. However, the logical data may be omitted in the build settings if it is not supported by the mining function or all attributes are to be used with default behavior. In this
example, since no changes are made to the logical data after its content is populated from
a physical data, it can be omitted. In other words, the lines 4 to 6 as well as the line 9 are
not necessary for this example.
From lines 7 through 15, we create the classification build settings object with the tree
algorithm settings. In lines 16 through 18, we create a build task object and in line 19, we
submit it for asynchronous execution through the connection. In line 20, we wait for the
completion of the task. From lines 21 through 23, we try get the last state of the build task
to check if the task has finished successfully, resulting in a tree classification model. From
lines 24 through 35, we retrieve the rules from the tree model.

June 22, 2005

78

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

5.7 Extracting rules from an association model


In this section we illustrate how to extract association rules from an association model
with various query criteria.

5.7.1 Get rules with minimum support


The following code illustrates how to extract association rules whose support values are at
least the specified minimum support from an association model.
// Restore an Association model to extract rules from
(1) AssociationModel assocModel = (AssociationModel)
dmeConn.retrieveObject( myAssocModel, NamedObject.model );
// Specify rule selection criteria (support >= 3%)
(2) RulesFilterFactory filterFactory =
(RulesFilterFactory) dmeConn.getFactory( javax.datamining.association.RulesFilterFactory );
(3) RulesFilter rulesFilter = filterFactory.create();
// Specify rule selection criteria (support >= 3%)
(4) rulesFilter.setRange( RuleProperty.support, 0.03, 1.0 );
// Specify rule ordering condition: ordered by support in descending order
(5) RuleProperty[] properties = new RuleProperty[]{ RuleProperty.support };
(6) SortOrder[] orders = new SortOrder[]{ SortOrder.descending };
(7) rulesFilter.setOrderingCondition( properties, orders);
// Extract rules from the model using the filtering criteria
(8) Collection rulesCollection = assocModel.getRules( rulesFilter );
(9) Iterator ruleIterator = rulesCollection.iterator();
(10) while( ruleIterator.hasNext() ) {
(11)

AssociationRule r = (AssociationRule) ruleIterator.next();

(12)

// work on the rule retrieved here...

(13) }

The range of the support values to be used as rule selection criterion is 0.03 (3%) to 1.0
(100%), which is the maximum value for support. The rules are retrieved in the order of
descending support value, i.e., the rules with the higher support are placed in the returned
collection before the rules with lower support.

5.7.2 Get rules with minimum support and confidence


The following code illustrates how to extract association rules with minimum support and
confidence values from an association model.
// Restore an Association model to extract rules from
(1) AssociationModel assocModel = (AssociationModel)
dmeConn.retrieveObject( myAssocModel, NamedObject.model );
// Specify rule selection criteria (support >= 2% AND confidence >= 95%)
(2) RulesFilterFactory filterFactory =
(RulesFilterFactory) dmeConn.getFactory( javax.datamining.association.RulesFilterFactory );
(3) RulesFilter rulesFilter = filterFactory.create();
(4) rulesFilter.setRange( RuleProperty.support, 0.02, 1.0 );
(5) rulesFilter.setRange( RuleProperty.confidence, 0.95, 1.0 );
// Specify ordering condition: ordered by confidence in descending order
(6) RuleProperty[] props = new RuleProperty[]{ RuleProperty.support,

June 22, 2005

79

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

RuleProperty.confidence };
(7) SortOrder[] orders = new SortOrder[]{ SortOrder.descending, SortOrder.descending };
(8) rulesFilter.setOrderingCondition( props, orders );
(9) rulesFilter.setMaxNumberOfRules( 100 );// maximum number of rules
// Extract rules from the model using the filtering criteria
(10) Collection rulesCollection = assocModel.getRules( rulesFilter );
(11) Iterator ruleIterator = rulesCollection.iterator();
(12) while( ruleIterator.hasNext() ) {
(13)

AssociationRule r = (AssociationRule) ruleIterator.next();

(14)

// work with the rule retrieved here...

(15) }

This example shows how two selection criteria can be specified to retrieve rules. The
range of the support values for the rule selection is 0.02 (2%) to 1.0 (100%). The range of
the confidence values for the rule selection is 0.95 (95%) to 1.0 (100%). Only the rules
that satisfy both conditions are returned.
The rules are retrieved in the order of descending support value and then descending confidence value if the support is equal. Only the first 100 rules that satisfy the selection criteria are returned if the number of selected rules exceeds 100.

5.7.3 Get rules containing certain items


The following code illustrates how to extract association rules that contain the specified
items from an association model. The items used in the selection criteria are { milk, coke,
diaper } in antecedent and { potato-chip, beer } in consequent part of the rules.
// Restore an Association model to extract rules from
(1) AssociationModel assocModel = (AssociationModel)
dmeConn.retrieveObject( myAssocModel, NamedObject.model);
// Specify rule selection criteria (rules that contain the specified items)
(2) Object[] antecedentItems = new String[] { milk, diaper, coke };
(3) Object[] consequentItems = new String[] { beer, potato-chip };
(4) RulesFilterFactory filterFactory =
(RulesFilterFactory) dmeConn.getFactory( javax.datamining.association.RulesFilterFactory );
(5) RulesFilter rulesFilter = filterFactory.create();
(6) rulesFilter.setItems( antecedentItems, RuleComponentOption.antecedent,
true ); // rules with antecedents containing the specified items
(7) rulesFilter.setItems( consequentItems, RuleComponentOption.consequent,
true ); // rules with consequents containing the specified items
// Specify ordering condition: ordered by support in descending order
(8) RuleProperty[] props = new RuleProperty[]{ RuleProperty.support };
(9) SortOrder[] orders = new SortOrder[]{ SortOrder.descending };
(10) rulesFilter.setOrderingCondition( props, orders );
// Extract rules from the model using the filtering criteria
(11) Collection rulesCollection = assocModel.getRules( rulesFilter );
(12) Iterator ruleIterator = rulesCollection.iterator();
(13) while( ruleIterator.hasNext() ) {
(14)

AssociationRule r = (AssociationRule) ruleIterator.next();

(15)

// work with the rule retrieved here...

(16) }

June 22, 2005

80

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

This example shows how to retrieve association rules whose antecedents contain any of {
milk, coke, diaper } and consequents contain any of { potato-chip, beer }. For example,
an association rule { milk, diaper, tomato => beer } will be extracted from the model.
Note that each component is a subset of the items specified for the component.
The rules are retrieved in the order of descending support value, i.e., the rules with the
higher support are placed in the returned collection before the rules with lower support.
For a more complicated rules retrieval, a range of support values and/or a range of confidence values can be specified to further restrict the rule selection.

5.7.4 Get rules that do not contain certain items


The following code illustrates how to extract association rules that do not contain the
specified items from an association model. The items to be excluded are { tv, dvd } in
either antecedent or consequent.
// Restore an Association model to extract rules from
(1) AssociationModel assocModel = (AssociationModel)
dmeConn.retrieveObject( myAssocModel, NamedObject.model );
// Specify rule selection criteria (rules that contain the specified items)
(2) RulesFilterFactory filterFactory =
(RulesFilterFactory) dmeConn.getFactory( javax.datamining.association.RulesFilterFactory );
(3) RulesFilter rulesFilter = filterFactory.create();
(4) Object[] items = new String[] { tv, dvd }; // items to be excluded
(5) rulesFilter.setItems( items, RuleComponentOption.antecedentOrConsequent, false ); // false means excluded items
(6) rulesFilter.setRange( RuleProperty.support, 0.05, 1.0 );
// Specify ordering condition: ordered by support in descending order
(7) RuleProperty[] props = new RuleProperty[]{ RuleProperty.support };
(8) SortOrder[] orders = new SortOrder[]{ SortOrder.ascending };
(9) rulesFilter.setOrderingCondition( props, orders );
// Extract rules from the model using the filtering criteria
(10) Collection rulesCollection = assocModel.getRules( rulesFilter );
(11) Iterator ruleIterator = rulesCollection.iterator();
(12) while( ruleIterator.hasNext() ) {
(13)

AssociationRule r = (AssociationRule) ruleIterator.next();

(14)

// work with the rule retrieved

(15) }

The filtering criteria used in this example identify the association rules with unusually
high support (5% or greater) that do not contain items { tv, dvd } in any component of the
rule. This example shows how a range of support or confidence values can be combined
with item containment.
The rules are retrieved in the order of ascending support value.

5.8 Importing and exporting a model


In this section we illustrate how to import objects into the DME and export JDM objects to
a location within a certain format.

June 22, 2005

81

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

5.8.1 Import an object using a URI


The following code illustrates how to import an XML string containing more than one
object.
// Create an import task
(1) ImportTaskFactory itFactory = (ImportTaskFactory) dmeConn.getFactory(
javax.datamining.task.ImportTask );
(2) ImportTask importTask = itFactory.create();
(3) importTask.setURI( // a vendor specific URI pointing to a XML string
this/uri/points/to/a/model/in/my/database/schema:myModelInXML );
// Peek into the import object to find info on the objects contained
(4) importTask.populateSummary();

// populate a summary on import object

(5) ImportSummary summary = importTask.getSummary();


(6) NamedObject[] objectTypes = summary.getObjectTypes();
(7) String[] objectNames = summary.getObjectNames();
(8) java.util.Map nameMap = new Map();

// object names mapped by index

// Treat the models differently


(9) for( int i=0; i<objectNames.length; i++) {
(10)

String objectName = null;

(11)

if( objectTypes[i].equals( NamedObject.model) )

(12)
(13)
(14)
(15)

objectName = new String( "Imp_Model" + i );


else
objectName = new String( "Imp_" + objectTypes[i].getEnum() + i );
nameMap.put( new Integer(i), objectName );

(16) }
(17) importTask.setObjectNamesMap( nameMap );
// Execute import synchronously without timeout
(18) ExecutionStatus status = dmeConn.execute( importTask, null );
(19) ExecutionState state = status.getState();
(20) if( state.equals( ExecutionState.success ) ) { // success
(21)

// do something here...

(22) }
(23) else {
(24)

// error while importing

// report error

(25) }

When an object is imported, its content may not be readily known. The user may lack the
information about the object format, the number of objects in it, the object names, and so
forth. When such information is not available, the user can obtain an import summary
from the object before executing import to avoid possible errors. In addition, this information allows the user to manage object names and creation dates.
In lines 1 through 3, we create an import task object and specify the location of the import
object as a URI. In line 4, an import summary is populated from the specified import
object, and the summary object is obtained in line 5. In lines 6 through 16, the types of the
contained objects are examined, and models are given a name Imp_Modelx where x
ranges between 0 and the number of contained objects minus 1, whereas all other objects
are given a name based on their type. In line 15, a map that contains index-name mappings
is specified to the import task. In lines 18 through 20, the import task is executed synchronously and its status is checked.
Note that the build settings, if any, will be imported together with the model by default,
and the creation date will be the time of import by default. The default behavior can be
June 22, 2005

82

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

altered using ImportTask.includeModelSettings and ImportTask.useOriginalCreationDates methods.

5.8.2 Export a model


The following code illustrates how to export a model to a PMML 2.0 string with the build
settings contained in the model.
// Create an export task
(1) ExportTaskFactory etFactory = (ExportTaskFactory) dmeConn.getFactory(
javax.datamining.task.ExportTask );
(2) ExportTask exportTask = etFactory.create();
// Designate the object to be exported and its target location
(3) exportTask.addObjectName( myModel, NamedObject.model );
(4) exportTask.setURI( target\URI\Location );
// Include all the settings objects contained in the model
(5) exportTask.setIncludeModelSettings( SettingsInclusionOption.all );
// Export the object into PMML 2.0
(6) exportTask.setFormat( ImportExportFormat.PMML2_0 );
// Execute export synchronously without timeout
(7) ExecutionStatus status = dmeConn.execute( exportTask, null );
(8) ExecutionState state = status.getState();
(9) if( state.equals( ExecutionState.success ) ) { // success
(10)

// do something here...

(11) }
(12) else { // error while exporting
(13)

// report error

(14) }

When JDM named objects are exported, the target location and at least one JDM named
object must be specified. If export format is not specified, the vendor default format is
selected. Note that multiple named objects can be exported into one location by invoking
addObjectName method multiple times. However, a single settings inclusion control
applies to all models specified with the method.

5.8.3 Export an object to a destination


The following code illustrates how to export a logical data object to a string in the default
format.
// export using system default format including the settings
(1) ExportTaskFactory etFactory = (ExportTaskFactory) dmeConn.getFactory(
javax.datamining.task.ExportTask );
(2) ExportTask export = etFactory.create();
(3) export.setURI( uri );
(4) export.addObjectName( myBuildSettings, NamedObject.buildSettings );
// execute synchronously without timeout
(5) ExecutionStatus status = dmeConn.execute( export, null );
(6) ExecutionState state = status.getState();
(7) if( state.equals( ExecutionState.success ) ) { // success
(8)

// do something here...

(9) }
(10) else { // error while exporting

June 22, 2005

83

JavaTM Data Mining (JDM)

Maintenance Release

(11)

Version 1.1

// report error

(12) }

Since the JDM named object to be exported is not a model, it is not necessary to set the
settings inclusion control with setIncludeModelSettings method. If the export format is
not specified, the vendor default format is selected.

5.9 Using reflection


The following code illustrates how to use the reflective capabilities in JDM to determine if
the classification function and decision tree algorithm are supported, along with specific
capabilities and default values.
//

determine if the Classification function is supported

(1) if( dmeConn.supportsCapability( MiningFunction.classification, null,


null ) ) {
(2)

ClassificationSettingsFactory classificationSettingsFactory =
(ClassificationSettingsFactory) dmeConn.getFactory( javax.datamining.supervised.classification.ClassificationSettings );
//

(3)
(4)

{ /* cost matrix is supported */ }


//

(5)
(6)

determine if cost matrix is supported

if( classificationSettingsFactory.supportsCapability(
ClassificationCapability.costMatrix ) )

determine if the tree algorithm is supported

if( dmeConn.supportsCapability( MiningFunction.classification, MiningAlgorithm.tree, null ) )


{ /* tree algorithm is supported */ }
//

determine if the TreeSettings supports use of surrogates

(7)

TreeSettingsFactory tsFactory = (TreeSettingsFactory) dmeConn.getFactory( javax.datamining.algorithm.tree.TreeSettings );

(8)

if( tsFactory.supportsCapability( TreeCapability.maxSurrogates ) ) {


//

(9)

return the max surrogates

return tsFactory.getMaxSurrogatesAllowed();

(10)

(11)

return 0;

(12) }
(13) else // report classification is not supported

In line 1, the capability of the DME is examined if it supports the classification function.
An alternative to this approach is to use Connection.getSupportedFunctions method
that returns an array of MiningFunction enums that are supported by the DME.
In lines 2 through 4, it is checked if the classification function supports cost matrix for
model build. In lines 5 and 6, the DME is checked again if it supports a tree algorithm for
classification. An alternative to this approach is to use Connection.getSupportedAlgorithms method that returns an array of MiningAlgorithm enums that are supported by the
DME, given a mining function.
In lines 7 and 8, we inquire if the tree settings supports maximum surrogates. Based on the
result of this inquiry, the code returns the number of maximum surrogates in lines 9 and
11.

June 22, 2005

84

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

5.10 Establishing a connection


The following example illustrates how to establish a connection to a JDM server using
JNDI approach.
(1) Hashtable env = new Hashtable();
(2) env.put( Context.INITIAL_CONTEXT_FACTORY,
"com.myCompany.javax.datamining.resource.initialContextFactoryImpl" );
(3) env.put( Context.PROVIDER_URL, "http://myHost:myPort/myService" );
(4) env.put( Context.SECURITY_PRINCIPAL, "user" );
(5) env.put( Context.SECURITY_CREDENTIALS, "password" );
(6) InitialContext jndiContext = new javax.naming.InitialContext( env );
// Perform JNDI lookup to obtain the connection factory
(7) javax.datamining.resource.ConnectionFactory jdmCFactory =
(ConnectionFactory) jndiContext.lookup(
"java:comp/env/jdm/MyServer");
// Create a data mining server connection
(8) ConnectionSpec svrConnSpec = (javax.datamining.resource.ConnectionSpec)
jdmCFactory.getConnectionSpec();
(9) svrConnSpec.setName( user );
(10) svrConnSpec.setPassword( password );
(11) svrConnSpec.setURI( serverURI );
(12) javax.datamining.resource.Connection dmeConn = (javax.datamining.resource.Connection) jdmCFactory.getConnection( svrConnSpec );

In lines 1 through 6, we create an InitialContent object used to access the mining server
connection factory. In line 7, we perform a lookup to obtain the connection factory. In
lines 8 through 11, we obtain a ConnectionSpec object and specify URI, user and password. In line 12, we create a Connection object using the connection spec obtained in
line 8.

5.11 Uniform resource identifiers


Uniform Resource Identifiers (URI) are used within JDM to reference physical location
and to access different objects:

The DME itself: its actual location may be specified by a URI in the connection specification.

Physical datasets, either input data (training or apply dataset) or output data created by
the DME are specified by a URI in the PhysicalDataSet object.

Imported objects and exported destinations are also specified using a URI in the
ImportTask and ExportTask.
URIs are defined by the RFC 2396 and Java 1.4 includes an implementation of URI representation and parsing in the package java.net.
The general syntax of an URI is given by:
[scheme:]scheme-specific-part [#fragment]
There are basically 2 sorts of URIs: opaque and hierarchical. Opaque URIs have a
scheme-specific part that does not begin with a slash. Hierarchical URIs have scheme-spe-

June 22, 2005

85

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

cific part that does begins with a slash. It can be either an absolute URI (if the scheme is
specified), or a relative URI (no scheme specified).
Hierarchical URI syntax can be further refined:
[scheme:][//authority][path][?query][#fragment]
The authority itself being generally expressed as:
[user-info@]host[:port]
While the URI specification and processing are vendor specific, in JDM, we recomend
some general guidelines.When accessing the DME, the ConnectionSpec object already
includes a user and password specification, hence no user specification should be included
in the URI. When accessing data sources requiring user authentication, if no user specification is included in the URI (either in the auhority or in the query part), the Connections
user specification may be used (single-sign-on).
User authentication could be used in some cases in the URI to differentiate the DME user
and the data access user (for example accessing a remote FTP location with a different
user than the DME authentified user). User specification could be set in such cases in the
user-info part (using a "user:password" structure), or using custom query fields, such as in
the scheme: //host/path?user=uname&password=id.
Vendors must clearly specify the schemes supported for the different URI usages (ConnectionSpec, PhysicalDataSet, ImportTask, and ExportTask). They should also indicate the
behaviour if relative URIs are specified: for example a relative URI may be used to specify a relative filename.
The expected behaviour of common schemes (file:, http:, ftp:, jdbc:, ...) must be respected
by the DME. Vendors are free to define and specify their own scheme.
See [URI], [URI-SCHEMES], and [Java-URI] for more information.

June 22, 2005

86

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

6. Conformance statement
Conformance to the JDM API standard is more flexible than most other standards. JDM is
conceived as an a la carte standard that allows vendors to implement functions and algorithms of the standard their product supports. For example, a vendor providing only neural
network algorithms for supervised learning would have no need for the clustering or association rules portions of the JDM specification. Adding functionality not specified in the
standard is enabled through interface and class specialization.

6.1 Required and optional features


JDM conformance is based on required packages, i.e., those that must be implemented,
and optional packages, i.e., those that a vendor may choose to implement. If a package is
required, or supported, all factories and methods must be implemented unless otherwise
specified1. If a package is optional, and not supported, none of its content is provided by
the vendor. However, supporting one or more optional packages may require the implementation of one or more other optional packages as specified below.
JDM provides an introspection/reflection interface for determining which features are supported through supports capability methods. If a supportsCapability method returns false
for a particular feature, methods related to that capability must throw unsupported feature
exception, if not, their return values are undefined. It is vendor specific whether values
provided to set methods are persisted. If a particular interface in a supported package is
optional and not implemented, the Connection.getFactory method throws the unsupported
feature exception.
In data mining, there are many sub-algorithms or techniques that may be used to control
part of the mining process. For example, a neural network can specify one of many possible activation functions to affect the output of a neuron. JDM specifies system default and
system determined enumeration values, with a select few of the most common values to be
part of the standard. Whereas system default and system determined must be implemented
where specified, vendors may choose to implement a subset of those explicitly specified.
Using the reflection/introspection interface, an application can determine which of the
options is supported by a vendor implementation. The system default is constant for an
implementation and documented by the vendor. When system determined is used, the
implementation selects an enumeration value at runtime, possibly based on other settings
or the input data. However, this may be the same as the system default. Vendors extend
enumeration classes with additional values using subclassing.
It is expected that vendors will augment the enumeration attributes according to the features present in their specific products. JDM may specify additional enumeration
attributes that have a name and well-defined behavior. If a vendor uses a JDM-specified
name, the vendor must also implement the specified behavior. Subtle variations, such as
text case, use of dashes, dots or underbars, would violate this condition, and are not permitted. For example, vendors will not be allowed to have names HyperbolicTangent,
hyperbolic_tangent, and hyperBolic.TangenT for the JDM-specified activation function hyperbolicTangent, even if the behavior of those differs from the JDM-specified
name.

1. Some interfaces in a package may be partially implemented, for example, for the model apply (scoring)
engine (see Section 6.4.3 on page 91).
June 22, 2005

87

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

6.2 Vendor extensions


The design of JDM allows vendors to readily extend the JDM specification with vendorspecific features. JDM employs several techniques to this end:

Package organization mining functions, algorithms, and model detail are provided
in separate packages. This readily allows a vendor to choose which packages to provide in a compliant product, or which additional packages to provide. Vendors may add
new packages supporting proprietary algorithms for standard or other mining functions.

Subclassing vendors may extend the functionality defined in this specification


through inheritance of the existing interfaces and classes, e.g., subclassing Algorithm
to add new algorithms. However, such extensions must be provided in a separate package.

Reflective capabilities with the ability to add new functionality, or to limit which
packages are supported, comes the need to determine which capabilities are supported
by the vendor. JDM supports identifying which packages are present in an implementation, and which capabilities are supported at a class and enumeration level.
It is recommended that vendors who extend JDM have their new interfaces conform to the
JMI specification to ensure a consistent API for end users as well as the JDM framework.
Consider the following example. JDM defines the interface TreeSettings as non-abstract,
i.e., users can get a factory and create an instance implementing the TreeSettings interface.
However, tree is not a specific data mining algorithm. A vendor supporting multiple tree
algorithms may have to introduce implementations for specific tree algorithms, e.g.,
CART_TreeSettings and C45_TreeSettings. A vendor could just implement the generic
TreeSettings interface for either of these, or provide a specific CART_TreeSettings or
C45_TreeSettings interface which inherits from TreeSettings, or provide specific interfaces that inherit from Algorithm only. Vendors must provide corresponding factories for
extension interfaces. The Connection.getFactory (objectName) method must return these
factories. For example, if a vendor subclasses TreeSettings, the vendor would document
the name strings for users to specify in the getFactory method.

6.3 Compliance points


The following compliance points apply to all implementations intending to satisfy the
TCK:
1. A JDM implementation must support J2SE 1.4 or greater.
2. An implementation must support the API for one or more functional areas and pass the
TCK for those functional areas.
3. An implementation may support algorithm settings for specific algorithms enabling a
supported functional area. It must pass the TCK for those specific algorithms.
4. Vendors may add attributes to the standard enumerated classes. The enumerated values
systemDefault and systemDetermined must be implemented where specified. The
systemDefault must be documented. The systemDetermined may be documented
as to how the vendor selects an enumerated value.
5. If a vendor uses an enumerated name that is documented in the JDM standard the
implementation must implement the same semantics specified by JDM.
6. All enumerated values must conform to the JDM camel-case naming convention,
i.e., lower case, no blanks, words separated by initial capital, e.g., Quasi-Newton
becomes quasiNewton.

June 22, 2005

88

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

7. For all user-provided strings, vendor implementations must support the minimum
string length of 1 character. Each vendor must allow a minimum of 8 characters for all
named objects. It is recommended that each implementation have a maximum string
length defined, however, JDM does not specify this maximum.
8. Vendors cannot add methods to standard interfaces in a JDM package. Vendors may
subclass JDM classes as necessary in a separate package.
9. Vendors need only support the definition of certain interfaces as input metadata, but do
not have to use that metadata when performing mining operations. For example, the
use of a weight attribute is possible for all mining functions and algorithms. However,
it is a vendors option to leverage a weight attribute in any of these areas. An application can use the reflection/introspection interface to determine if the use of weight
attribute is supported.
10. Vendors may subclass the JDM exception class to provide more specific error or warning feedback. However, no other top level exceptions should be introduced. Vendors
have the option to wrap internally raised exceptions as JDM exceptions, e.g., class cast
exceptions can be wrapped inside a JDM exception.
11. Vendors may subclass the VerificationReport to provide more specific verification
feedback.
12. Synchronous execution of tasks must be supported, however, asynchronous execution
is optional. If not implemented, the asynchronous execute method must throw the
unsupported feature exception.
13. Named objects are defined to enable referencing objects by name in methods, as well
as for applications to reuse the objects within or across sessions. However, a vendor
must specify the degree to which persistence is supported, using transient and persistent options.
14. Vendors must minimally support the import and export of models in some format, perhaps a native, proprietary format. Import and export of all other objects and formats are
optional and subject to introspection.

6.4 Determining conformance


All vendors must support the following required packages:

javax.datamining
javax.datamining.base
javax.datamining.data
javax.datamining.resource
javax.datamining.task
javax.datamining.statistics

6.4.1 Function level conformance


Vendors who support Classification, also support the packages:

javax.datamining.supervised
javax.datamining.supervised.classification
Vendors who support Regression, also support the packages:

June 22, 2005

89

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

supervised
supervised.regression
Vendors who support Association Rules, also support the packages:

associationrules
Vendors who support Clustering, also support the packages:

clustering
rule (optional)
Vendors who support Attribute Importance, also support the packages:

attributeimportance
6.4.2 Algorithm level conformance
The packages listed below are optional for the specific algorithms. To support a given
algorithm, the vendor may choose to support packages from the list provided.
Vendors who support Tree Models, may also support the packages:

Classification and/or Regression functions packages


algorithm
algorithm.tree
rule
modeldetail.tree (optional)

Vendors who support Feedforward Neural Network Models, may also support the packages:

Classification and/or Regression functions packages


algorithm
algorithm.feedforwardneuralnet
modeldetail.feedforwardneuralnet (optional)

Vendors who support Naive Bayes Models, may also support the packages:

Classification and/or Regression functions packages


algorithm
algorithm.naivebayes
modeldetail.naivebayes (optional)

Vendors who support Clustering Models, may also support the packages:

Clustering function package


algorithm (optional if algorithm.kmeans package becomes optional)
algorithm.kmeans (optional)
rule (optional)

Vendors who support Association Rules Models, may also support the packages:
June 22, 2005

90

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Association Rules function package


algorithm (required if an association rules algorithm is chosen to be implemented)
6.4.3 Model apply engine conformance
Some vendors will be interested in providing a JDM compliant product that focuses solely
on applying a model to data, and in some cases, applying a model to individual records
only. The package structure of JDM is sliced according to function and algorithm, where
each mining operation is covered by each function slice.
To provide a minimal API to support a model apply engine, or scoring engine, we maintain the function and algorithm package organization, but allow implementation of specific subsets of package interfaces. Vendors who claim compliance to the model apply
engine must support the following interfaces:

Resource package - all interfaces must be supported to enable establishing a connection to the DME. However, a vendor may support only the synchronous interface Connection.execute (applyTask) and therefore need not implement ExecutionHandle.

Task package, ImportTask - the ImportTask interface and corresponding factory


must be supported to enable the import of models to be used for scoring. The implementation must be able to execute an ImportTask for all models with a mining function
f and an algorithm a where Connection.supportsCapability(f, a, null) returns TRUE.
There must be at least one supported mining algorithm.

Objects supporting Model, BuildSettings, AlgorithmSettings, ModelSignature for all mining functions f and algorithms a where Connection.supportsCapability (f, a,
null) returns TRUE, the implementation must be able to retrieve models and manipulate component objects for that function and algorithm.

Task package, Apply subpackage - the implementation must support one or both of
RecordApplyTask and DataSetApplyTask. The implementation must also support the
ApplySettings interface and any function-specific subclasses.

Data package, Physical Data - if an implementation supports RecordApplyTask and


corresponding factory, it must also support PhysicalDataRecord and corresponding
factory. If an implementation supports DataSetApplyTask and corresponding factory, it
must also support PhysicalDataSet and corresponding factory.

6.5 Claiming conformance


For vendors to claim conformance, they must do so according to the following scheme:
Full Implementation - A vendor may claim a full implementation if their product implements all packages, and does not return FALSE for any supportsCapability method. The
expert group recognizes that few, if any, vendors will produce a full implementation.
Qualified Implementation - A vendor claims qualified implementation if their product
implements all required packages and one or more optional packages. The vendor must
list of optional packages supported.
Fully Enabled - A vendor claims that their product is fully enabled on a per package basis
if the product does not return FALSE for any supportsCapability method in a given package.

June 22, 2005

91

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Partially Enabled - A vendor claims that their product is partially enabled on a per package basis if the product returns FALSE for any supportsCapability method in a given package.
As such, a vendor may claim Full Implementation, Fully Enabled if their product implements the entire standard. However, it is much more common for a vendor to claim a
Qualified Implementation, Partially Enabled.
An example of a vendors claim statement may appear as:
Product: MyMiningSystem
JDM: Qualified Implementation
Classification: Fully Enabled
Tree: Fully Enabled
NaiveBayes: Partially Enabled
Regression: Partially Enabled

June 22, 2005

92

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

7. Summary
JSR-73 originated in July of 2000. Like many JSRs, the expert group was optimistic to
complete the specification in roughly a years time. Now, sixteen face-to-face meetings
and over 150 conference calls later, JDM is seeing the light of day!
The expert group was encouraged by interest from the data mining community in our
progress and desire for public drafts as well as the reference implementation. The definition of Web services for data mining was a late addition as the expert group recognized the
importance for an XML-based representation and interface for the Java standard.
As noted earlier, there were several features that did not make it into this specification. At
the top of the list for version 2 enhancements are:
Sequential Patterns / Time Series - mining functions to address forecasting and modeling seasonal or periodic fluctuations in data.
Transformations interface - data preparation is a key aspect of any data mining solution.
A separate JSR for transformations is likely warranted. Having a close integration with
such a JSR and addressing transformations in the next version has high priority.
Ensemble models - define composite models structured with logic, e.g., boosting and
bagging approaches.
Apply for Association - augment specification to enable prediction based on association
rules.
Text Mining - enable mining of unstructured text data both by explicit feature extraction
and the accepting of text attributes as model predictors
Model Comparison - introduce ability to compare multiple models according to various
quality metrics, e.g., accuracy and lift for classification.
Multi-record real-time scoring - enable scoring of multiple records in the record apply
task as a performance optimization for applications.
Multi-target models - enable the specification of multiple targets for supervised models
as a model performance and representation optimization.
Other possible features under discussion include: multivariate statistics, mining stream
data, advanced statistical functions, algorithms for PCA and NMF in feature extraction,
integration with workflow, deviation detection, scoring multiple models in parallel with a
single pass over the data.

June 22, 2005

93

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Appendix A. Glossary
algorithm

A specific technique or procedure for producing a data mining model. An algorithm uses a
specific model representation and may support one or more functional areas. Examples
include CART and CHAID for decision trees, backpropagation neural networks, Naive
Bayes, and Apriori association.

algorithm settings
A collection of settings detailing algorithm-specific behavior to be used during model
building.
apply

The data mining operation that scores data, i.e., applies a model to data to produce apply
settings.

apply data

The data used as input when applying a model. Also referred to as score data, i.e., the data
to be scored.

apply settings
A user specification detailing the output desired from applying a model to data. This output may include predicted values, associated probabilities, key values, and other supplementary data.
association

A machine learning technique that identifies relationships among items.

association rules
Association rules capture co-occurrence of items among transactions. A typical rule is an
implication of the form A -> B, which means that the presence of itemset A implies the
presence of itemset B with certain support and confidence. The support of the rule is the
ratio of the number of transactions where the itemsets A and B are present to the total
number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset
A is present.
attribute

A generic column of data, minimally with a name and datatype. There are several specializations of attribute, see logical attribute, physical attribute, and signature attribute.
Attributes are used in statistics, machine learning, data mining, and other disciplines to
describe observations, objects, data records, and other entities. Sometimes attributes are
also referred to as variables, fields, dimensions, features, and properties. Attributes are
often categorized with regard to their mathematical properties, that is, in terms of the
intrinsic organization or structure of the associated values (or value range or scale).
Generally speaking, there are continuous or numerical attributes, and discrete or symbolic
attributes.

attribute assignment
The mapping of one attribute to another used to associate input data with a models
attributes, or a models output with an output table.
attribute importance
A measure of the importance of an attribute to a mining model. The measures of different
attributes in build data enables users to select the attributes that are found to be most relevant to a mining model.
attribute type

June 22, 2005

Commonly, four types of attributes are distinguished: nominal or categorical attributes,


ordinal or rank attributes, interval attributes, and real or real-valued attributes (also called

94

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

true measures). JDM restricts itself to three types: categorical, numerical, and ordinal.
attribute usage Specifies how a logical attribute is to be used when building a model, e.g., active vs. supplementary, suppressing automatic data preprocessing, and assigning a weight to a particular attribute.
build

The data mining operation that produces a model.

build data

The data used as input to building a model. Also referred to as the training data.

build settings

A collection of parameters specifying the high level input for building a data mining
model, consisting of mining function and algorithm specifications. Mining functions consist of key areas including: classification, regression, association, sequences, attribute
importance, and clustering.

case

A collection of related attribute values used as input to model building or scoring. In a


simple table, a case corresponds to an individual record. In transactional format data, a
case may be represented by multiple records, where columns play the roles of identifier,
attribute name, and attribute value. See also single record case and multi-record case.

categorical attribute
An attribute where the values correspond to discrete categories. For example, state is a
categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are
either non-ordered (nominal) like state, gender etc. or ordered (ordinal) such as high,
medium or low temperatures.
Categorical attributes tell us which of several unordered categories a thing belongs to. For
example, we can say that a beverage is BEER, LIQUOR, LEMONADE, or WINE. Categorical attributes exhibit the lowest degree of organization, since the set of values such an
attribute or variable may assume posses no systematic intrinsic organization or order. The
only relation between the values of such attributes is the identity relation. Because of the
lack of an order relation, it is not possible to tell if one attribute value is greater than
another, nor that one value is closer to a certain value than another. However, we can tell if
two values are equal or not equal.
For example, the categorical attribute beverage may be associated with the set, V, of possible attribute values, where V = {BEER, LIQUOR, LEMONADE, WINE}. Given this
variable, it is not possible to tell that LIQUOR is smaller than WINE, or that LIQUOR is
closer to BEER than WINE. However, we can tell that two values a and b are equal (identical) if, for example, a := BEER and b := BEER, then a = b.
category

Corresponds to a distinct value of a categorical attribute. Also referred to as a class.

category set

A named collection of related categories.

centroid

A cluster centroid is a vector that encodes, for each logical attribute, either the mean
(numerical attributes) or the mode (categorical attributes) of the cases in the build data
assigned to a cluster.

classification

The process of predicting the unknown value of the target attribute for new records using a
model built from records with known target values.

cluster

A collection of data objects that are similar to one another. Typically produced from a
clustering algorithm and stored with a clustering model.

clustering

Given a set of data points, each having a set of attributes, and a similarity measure among
them, clustering is the process of grouping the data points into different clusters such that
data points in the same cluster are more similar to one another and data points in different
clusters are less similar to one another.

cost matrix

A two-dimensional, N x N table that defines the cost associated with a prediction versus

June 22, 2005

95

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

the actual value. A cost matrix is typically used in classification models, where N is the
number of distinct values in the target, and the columns and rows are labeled with target
values.
cross validation A method of evaluating the accuracy of a classification or regression model. The build
data is divided into several parts, with each part in turn being used to evaluate a model
built using the remaining parts.
data mining

The process of discovering hidden, previously unknown and usable information from a
large amount of data. This information is represented in a compact form, often referred to
as a model.

data mining engine


The component in the JDM architecture that implements the algorithms to support data
mining. The data mining engine may also support the persistent MOR. This is distinguished from the data mining server as the JDM implementation may not have a separate
server component, but support the API and client application directly.
data mining server
The component in the JDM architecture that implements the data mining engine and persistent MOR. This is distinguished from the data mining engine since a server implies a
separate component as in a client-server architecture.
data preparation status
An indication of whether a logical attribute provided as input to a build operation has been
prepared by the user, or if the user expects the algorithm to perform automatic data preparation on the input data. A user may specify a logical attribute as prepared or unprepared.
descriptive data mining
Data mining that results in a description of a data set in a concise and summary manner,
and presents interesting general properties of the data. See also predictive data mining.
DME

See Data Mining Engine.

DMS

See Data Mining Server.

enterprise information system


Generically, the application or enterprise system that supports a set of business processes
and information technology infrastructure. The business processes are provided as a set of
services. In support of data mining, an instance of an enterprise information system can be
the backend component(s) that provide data mining functionality to the enterprise.
EIS

See Enterprise Information System.

export

The operation that supports taking mining objects from within the DME and exporting
them to an external system such as a file or database table cell.

extension

A feature that is not covered by any of the relevant specifications or a non-standard implementation of a feature that is covered.

functional area A subset of the data mining API that corresponds to a particular class of algorithm.
feature selection
Given a data set with lots of attributes, feature selection is the process of selecting the features (attributes) that are more important to the data mining model. Feature selection is
done based on the importance computed using attribute importance algorithms. See also
Attribute Importance.
import

The operation that supports taking mining objects from an external system such as a file or
database table cell and importing them to the DME and MOR.

item

An element that can be compared against another to determine if they are different. Typi-

June 22, 2005

96

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

cally used in the context of Association Rules model.


itemset

A set of items, typically used as an antecedent or consequent in a rule, as produced from


an Association Rules model. No item in an itemset can appear more than once. Itemsets
can be compared to determine if they are different.

Java Data Mining


A Java Community Process based standard supporting data mining, Java Specification
Request 73.
Java Specification Request
Java Specification Requests (JSRs) are the actual descriptions of proposed and final specifications for the Java platform following SUNs Java Community Process. See
www.jcp.org.
JDM

See Java Data Mining.

JDM implementation
A JDM technology-enabled client API, resource adapter, and supporting data mining
engine. The resource adapter may provide support for features not implemented by the
supporting engine. It may also provide the mapping between standard syntax/semantics
and the native API implemented by the engine.
JMI

Java Metadata Interface (JSR-40)

JMS

Java Messaging Service (JSR-914)

JMX

Java Management Extension (JSR-3)

JOLAP

Java Online Analytical Processing (JSR-69)

JSR

See Java Specification Request.

lift

A measure of how much better prediction results are using a model than could be obtained
by chance. For example, consider that 2% of the customers mailed a catalog without using
the model would make a purchase. However, using the model to select catalog recipients,
10% would make a purchase. Then the lift is 10/2 or 5. Lift may also be used as a measure
to compare different data mining models. Since lift is computed using a dataset with actual
outcomes, lift compares how well a model performs with respect to this dataset on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on
new data.

logical attribute
A description of a domain of data used as input to mining operations. Logical attributes
may be categorical, ordinal, or numerical.
logical data

A set of mining attributes used as input to building a mining model

mining function A major subdomain of data mining that shares common high level characteristics. Functions include: classification, regression, attribute importance, association, and clustering.
mining model

The result of building a model from a mining build settings. The representation of the
model is specific to the algorithm specified by the user or selected by the underlying DMS
and defined by a ModelDetail object. A model can be used for direct inspection, e.g., to
examine the rules produced from a decision tree or association rules, or to score data.

mining object repository


The logical or physical architectural component that stores JDM mining objects, e.g.,
tasks, models, settings, and their components.
mining result

June 22, 2005

The end product(s) of a mining operation. For example, a build task produces a mining

97

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

model, a test task produces a test metrics object.


missing value

Data value that is missing because it was not measured, not answered, was unknown or
was lost. Data mining methods vary in the way they treat missing values. Typically, they
ignore the missing values, or omit any records containing missing values, or replace missing values with the mode or mean, or infer missing values from existing values.

model

An algorithm produces a compressed representation of input data called a model. A model


can be descriptive or predictive. A descriptive model helps in understanding underlying
processes or behavior. For example, an association model describes consumer behavior. A
predictive model is an equation or set of rules that makes it possible to predict an unseen
or unmeasured value (the dependent variable or target) from other, known values (independent variables or predictors).

model detail

The specific representation of a model that may be algorithm dependent. For example, a
classification model has some common Model object state, however, a decision tree is
specific model detail that may have resulted from using the tree algorithm settings.

model signature
A collection of signature attributes, derived from the logical data used to build a model.
The input data to a model must be compatible with the model signature.
MOF

Meta Object Facility.

MOR

Mining Object Repository.

multi-record case
A representation of physical data that uses multiple records to store a single case. The data
is typically has three columns with roles of sequence id, attribute name, and value.
numerical attribute
An attribute whose values are numbers. The numeric value can be either an integer or a
real number. Numerical attribute values are continuous as opposed to discrete or categorical values. See also Categorical Attribute and Ordinal Attribute.
OLAP

Online Analytical Processing.

ordinal attribute
An ordinal attribute is similar to a categorical attribute except that there is an order defined
on the discrete categorical values. For example, temperature where the discrete values are
high, medium and low. There is an order defined on the values; i.e., high > medium > low.
Ordinal attributes allow us to put things in order, because the set of values associated with
an ordinal attribute possesses an intrinsic organization, which is defined by a total order
relation. Therefore we can tell if one value is bigger or smaller than another, but we can
normally not tell or measure the difference or distance between to values (unlike with
interval attributes or variables). For example, if x, y, and z are ranked, 5, 6, and 7, we can
tell x < y < z, but not if (z - y) < (y - x). The set of values associated with an ordinal
attribute possesses an intrinsic organization, which is defined by a total order relation.
The ordinal attribute speed may take any of the following ranked values: STATIONARY,
SLOW, FAST, VERY FAST, where rank(STATIONARY) = 1, rank(SLOW) = 2,
rank(FAST) = 3, and rank(VERY FAST) = 4. This organization of the ordinal attribute
values allows us, for example, to tell that SLOW represents a smaller speed value than
FAST. However, it is not possible to tell if, for example, the difference between two adjacent values is the same or not. For example, we cannot tell if the difference between
SLOW and FAST is equal to, smaller or greater than the difference between the values
FAST and VERY FAST.
outlier
June 22, 2005

A data item that does not (or is not thought to have) come from the typical population of
98

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

data, in other words, data items that fall outside the boundaries that enclose most other
data items in the data.
percentage

A value between 0 and 100 that represents a part of a whole. For example, 75% indicates
three quarters of a whole.

physical attribute
An object that corresponds to a field in a formatted file, or column in a database table.
Using tasks, physical attributes can be mapped to logical attributes of a models signature
or logical data of a build settings object.
physical data set Identifies data as a set of cases to be used as input to data mining. Through the use of
attribute assignment, attributes of the physical data are mapped to logical attributes of a
models logical data. The data referenced by a physical data set object can be used in
model building, model application (scoring), lift computation, statistical analysis, etc.
physical data record
A collection of named attribute values used as input and output for single record scoring.
predictor

A logical attribute used as input to a supervised model or algorithm to build a model.

predictive data mining


Data mining that results in the construction of one or a set of models by performing inference on the available set of data, and attempting to predict outcomes for new data sets.
prior probabilities
The set of prior probabilities specifies the distribution of the various classes in the presampled data set. Also referred to as priors, these could be different from the distribution
observed in the data set.
probability

A value between zero and one (0..1) that indicates the likelihood of an event. Zero indicates there is no chance of the event occurring. One indicates it is probabilistically certain
the event will occur.

quality of fit

In clustering, a value between zero and one that is a measure of how well a given case fits
in the predicted cluster. Values closer to zero indicate a poor fit, values closer to one indicate a good fit.

receiver operating characteristics


ROC is a measure of comparison between individual models to determine thresholds
which yield a high proportion of positive hits. ROC curves aid users in selecting samples
by minimizing error rates. ROC was originally used in signal detection theory to gauge the
true hit versus false alarm ratio when sending signals over a noisy channel.
reference implementation
A software implementation of a JSR specification that validates the interface for practical
implementation and usage. It must meet the tests defined in the TCK.
regression

A mining function and class of supervised algorithms that predicts continuous targets.

ROC

Receiver Operating Characteristics.

ROI

Return On Investment.

rule

An expression of the general form if X, then Y. An output of certain models, e.g., association rules models or decision tree models. The X may be a compound predicate.

score data

See apply data.

session

The duration of an open connection to the DME.

settings

See build settings.

signature attribute
June 22, 2005

99

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

A type of attribute used to define one of the inputs to a model for test and apply. See model
signature.
single-record case
A representation of physical data that uses a single records to store a each case. Each column contains data to be mined that can correspond to a logical attribute.
specified feature
A feature of JDM that must meet the specification of as detailed in JDM.
supervised learning
The process of building data mining models using a known dependent variable, also
referred to as the target. All classification and regression techniques are supervised.
supported feature
A feature for which the JDM implementation supports standard syntax and semantic
intentions, informal semantics, intended meaning} for that feature as defined in the relevant specifications.
system default

For an enumeration class, a vendor-defined default value that corresponds to one of the
allowed values for the enumeration class. This default value may be different according to
the context. Vendors must document the system default for each context.

system determined
For an enumeration class, a user may request the vendor implementation to determine
what is the best value for this enumeration. The implementation-selected value may take
into account, e.g., other settings or data to determine an enumeration value. Vendors must
document the behavior users can expect.
target

In supervised learning, the identified logical attribute that is to be predicted.

taxonomy
A hierarchical grouping of the categorical values. For example, a geography taxonomy
groups cities into states, states into regions, regions into countries and so on.
task

A container within which to specify arguments to data mining operations to be performed


by the data mining system. Data mining tasks include: model building, testing, applying
(scoring), import, and export.

TCK

See Technology Compatibility Kit.

Technology Compatibility Kit


The suite of tests, tools, and documentation that allow a implementers of a Specification to
determine if their implementation is compliant with that Specification.
test

The data mining operation that determines the accuracy of a model. This is typically performed by using held-aside data identical in form to the build data, scoring that test data,
and comparing the actual target value with the predicted target value. Testing is only applicable for supervised models.

test data

The input data used for testing a model.

training

The step in the model building process that produces as possibly non-optimized from of
the model. For example, a tree algorithm may produce a full tree during training, but may
require an evaluation phase to effectively select the best subtree. See build.

training data

See build data.

transformation A function applied to data resulting in a new form or representation of the data. For exam-

June 22, 2005

100

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

ple, discretization and normalization are transformations on data.


UML

Unified Modeling Language

URI

Uniform Resource Identifier

unsupervised learning
The process of building data mining models without the guidance (supervision) of a
known, correct result. In supervised learning, this correct result is provided in the target
attribute. Unsupervised learning has no such target attribute. Clustering and association
are examples of unsupervised mining.
web service

A software application identified by a URI, whose interfaces and bindings are capable of
being defined, described, and discovered as XML artifacts. A Web service supports direct
interactions with other software agents using XML based messages exchanged via Internet-based protocols. [W3]

weight

A numeric value associated with an attribute or row. Weights associated with attributes
instruct the DME to consider the contribution of attributes with higher weights more
important than those with lower weights. Weights associated with rows, by identifying an
attributes as containing weight values, instructs the DME to consider the contribution of
rows with higher weights more important that those with lower weights.

wrapper

A type of algorithm that wraps others models to achieve better accuracy. Examples
include bagging and boosting.

June 22, 2005

101

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Appendix B. Requirements
This section discusses the major requirements for the data mining API. It focuses on data
mining domain requirements, use of foundation technologies and related data mining standards, and system behavior requirements. The detailed requirements are expressed in the
UML model and corresponding Javadoc documentation.
The last section discusses specific features excluded from this version of the standard.
These include both domain and system exclusions.

B.1. Domain requirements


The following are high level domain requirements for JDM.
Requirement 1:

Provide an extensible framework for data mining.

Requirement 2:

Separate high-level function specification for algorithm details,


thereby allowing non-expert data miners to use JDM-compliant
implementations effectively.

Requirement 3:

Separate high-level function model representation from the specific


algorithms model representation (model detail).

Requirement 4:

Support a representative set of data mining functionality for common usage of generally agreed upon algorithm interfaces.

Requirement 4.1:

Specify the mining functions Classification, Regression, Clustering,


Association, and Attribute Importance.

Requirement 4.2:

Specify function-level model representations for Classification,


Regression, Clustering, Association, and Attribute Importance.

Requirement 4.3:

Specify major behavior for function-level models as depicted in


Table 4.

Requirement 4.4:

Specify the algorithms Decision Trees, Feed Forward Neural Networks, SVM, and Naive Bayes for Classification and Regression;
and K-Means for Clustering.

Requirement 4.5:

Specify algorithm-specific model representations for Decision


Trees, Feed Forward Neural Networks, SVM, and Naive Bayes.

TABLE 4. Function-level model behavior


Build

Test

Apply

Classification

Regression

Association

Clustering

Attribute Importance

Requirement 5:

June 22, 2005

Support the import and export of JDM objects.

102

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Requirement 6:

Disallow ability to modify mining models, i.e., make them read


only.

Requirement 7:

Support computation of statistics on physical data and obtain those


statistics on a per attribute basis.

Requirement 8:

Facilitate the specification of seed models supporting incremental


learning.

Requirement 9:

Enable verification the correctness of a task specification prior to


execution.

B.2. Foundation technologies


JDM has the following requirements for foundation technologies, i.e., those that support
the design or infrastructure of JDM.
Requirement 10:

Target the J2EE and J2SE environments.

Requirement 11:

Conform to the approach taken by the Java Connector Architecture.

Requirement 12:

Conform to JMI naming conventions.

B.3. Data mining standards


Requirement 13:

Map JDM metadata closely to the PMML standard to facilitate the


generation of XML for mining models.

Requirement 14:

Map JDM metadata closely to the CWM 1.1 standard to facilitate


the generation of XML for mining objects such as build settings and
tasks.

Requirement 15:

Map the JDM API closely to the SQL/MM Data Mining standard to
facilitate a JDM implementation on top of SQL/MM.

B.4. System behavior


JDM has the following system behavior requirements. System behavior includes non-data
mining-specific functionality, such as object lifecycle, login, connections, data access, etc.

June 22, 2005

Requirement 16:

Application software should be portable without requiring significant application code modifications.

Requirement 17:

Enable users to access primary mining objects by name either within


a connection session (intra-session), or across connection session
boundaries (inter-session).

Requirement 18:

Support the synchronous and asynchronous execution of mining


operations.

Requirement 19:

Enable users to explicitly invoke a method to save, i.e., persist,


changes to objects in a repository.

Requirement 20:

Support creation, retrieval, renaming, and deletion of mining


objects.

103

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Requirement 20.1:

Support the retrieval and deletion of objects based on timestamp,


and metadata.

Requirement 20.2:

Uniquely name objects within a major object category, e.g., BuildSettings, Models, Results, etc.

Requirement 21:

Support reflection and introspection.

Requirement 22:

Specify a common set of error messages with associated exception


codes.

Requirement 23:

Support specification of character set encoding and language specification.

Requirement 24:

Facilitate real-time scoring.

B.5. Exclusions for version 1


B.5.1. Domain exclusions
B.5.1.1. Visualization
The JDM expert group concluded that an API to visualize data mining models and results
is not appropriate for a core data mining programmatic interface and potentially warrants a
separate JSR. However, as noted in Section 1, JDM does provide data objects that could
support a visualization interface.

B.5.1.2. Transformations
Data transformations are applicable beyond the realm of data mining, even though transformations are an important part of it. The expert group concluded that transformations are
beyond the scope of JDM version 1 and may deserve a separate JSR. As there are many
tools that support transformations, e.g., standalone applications and database management
systems, reproducing a small subset of commonly used mining transformations within
JDM seemed ill-advised. First, not all transformations could be covered. Second, users
would likely go outside JDM to include unsupported transformations.
Transformations are considered preprocessing. An algorithm may automatically transform
data internally, e.g., binning numerical data for Nave Bayes, however, the standard interface does not allow the specification of the number of bins or other binning options. Users
who want this level of control must preprocess the data before submitting it to JDM algorithms. Vendors who prefer to support some degree of preprocessing will find a natural
place within the specification to place such preprocessing.
Missing value treatment and outlier treatment are also viewed as a form of transformation.
Vendors who wish to include such transformations because their algorithms already provide such an option are free to provide vendor-specific algorithms settings.

B.5.2. System exclusions


The following system features are explicitly not addressed by the JDM specification, and
are considered out of scope.

Transactions - A transactional interface is not specified within JDM. Defining transaction boundaries around long running data mining operations would overly complicate the standard and the ability for vendors to support this standard. As such, we
June 22, 2005

104

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

suggest that individual operations provide atomicity whenever possible to ensure correct execution across multiple concurrent invocations from a single or multiple users
and in the presence of failures. Transactions are an area where we expect vendors to
differentiate themselves.

Thread Safety - The level of thread safety is not specified within JDM. The extent to
which multiple threads operate correctly is up to each vendors implementation as
multi-threaded applications may not be required in many domains.

Scheduling - The ability to perform sophisticated task scheduling is not defined within
JDM. The execution of multiple tasks, related tasks, or dependencies among tasks are
better handled by existing mechanisms, e.g., workflow systems, operating system support, etc. The ability to store and reference tasks for later execution, however, directly
supports applications.

Security - JDM does not address security issues except for specifying that some form
of login validation occur for access to the DME. Similar login information is provided
for accessing data such as files and database tables. Beyond this, vendors may address
security as part of their respective implementations.

Remote Method Invocation (RMI) - JDM does not specify the architecture, e.g., client-server, or implementation technique supporting client-server communication.

Serializable Objects - JDM does not specify techniques for transferring Java objects
inter- or intra-system. The Java serialized object feature, while commonly used and
well integrated into the Java framework, has alternatives such as XML representations.

Enterprise Java Beans (EJBs) - JDM strives to provide a straightforward Java API
that may be used in many contexts. Leveraging a technology such as EJBs places certain demands on an implementation that may not be necessary for a particular use. The
JDM API does not preclude being exposed through EJBs, but this is specific to the
vendors implementation.

June 22, 2005

105

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Appendix C. Optional Methods


The methods listed in Table 5 are optional. Vendors may choose not to implement the features represented by these methods. However, each such method must throw UnsupportedFeatureException if invoked.

TABLE 5. JDM optional methods for models and model details


Interface

Optional Method

Model

getAttributeStatistics
getBuildDuration
getEffectiveBuildSettings
getModelDetail
getUniqueIdentifier

AssociationModel

getAverageTransactionSize
getItems
getItemsets(int)
getMaxAbsoluteSupport
getMaxTransactionSize
getMinAbsoluteSupport
getNumberOfItems
getNumberOfTransactions
getRules(RulesFilter)

ClusteringModel

getRules
getSimilarity

Cluster

getCentroidCoordinate(String)
getCentroidCoordinate(String,
Object)
getName
getRule
getSplitPredicate
getStatistics

ClassificationModel

getClassificationError

RegressionModel

getRSquared

NaiveBayesModelDetail

getCount(String, Object)
getPairCount(String, Object,
Object)
getPairProbability(String,
Object, Object)
getTargetCount(Object)
getTargetProbability(Object)

SVMModelDetail

getNumberOfBoundedVectors
getNumberOfUnboundedVectors

TreeNode

getNodeStatsitics
getSurrogates

June 22, 2005

106

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Appendix D. Exceptions
Exceptions can be either checked or unchecked (runtime). For checked exceptions, where
the application can take appropriate actions for anticipated errors, JDM provides the
JDMException class which inherits from the standard Java Exception. All JDM methods
accepting parameters automatically include JDMException in their signature; others are as
specified in the interface documentation. JDM provides subclasses of JDMException to
allow specialized exception handling in applications.
Unchecked exceptions result from unanticipated application execution failure and may
require stopping the application. For unchecked exceptions, vendors may choose to throw
standard Java RuntimeException instances, wrap these as appropriate in JDMException or
JDMRuntimeException instances, or throw the JDM subclass of a Java runtime exception.
To keep the number of JDMException and JDM runtime exception subclasses relatively
small, yet still provide meaningful feedback to applications and developers, JDM defines
standard exception messages and error codes to support code portability. Vendors can
embed their specific error codes within the JDM exception-related classes, as well as wrap
other Java exceptions as appropriate. The table below lists standard JDM exception error
codes and their mapping to specific JDM exception subclasses.
Note that JDMException error codes are defined in the range 1000-1499, JDMRuntimeException error codes are defined in the range 1500-1999. Error codes in the range 20009999 are reserved for vendor-specific error codes.
Standard exception messages and codes are necessary for code portability. Vendors can
embed their specific error codes within the JDMException, as well as wrap other exceptions as appropriate. The tables below lists standard JDM exception error codes and corresponding Exception classes.
JDMException has the following subclasses:

ConnectionFailureException
IncompatibleSpecificationException
InvalidURIException
TaskException
InvalidObjectException
EntryNotFoundException
DuplicateEntryException
ObjectNotFoundException
ObjectExistsException
JDM defines the following runtime exceptions:

JDMUnsupportedFeatureException
inherits from java.lang.UnsupportedOperationException

JDMIllegalArgumentException
inherits from java.lang.IllegalArgumentException

June 22, 2005

107

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

TABLE 6. JDMException codes and messages


Exception
Class

Code

JDM Exception

Title

Message

Remarks

1000

GenericError

Generic Error.

Query the vendor specific error


from the exception using provided
methods. For example, disk space
or memory exhausted, back end
data connection failure, database
failure, etc.

ConnectionFailureException

1001

ConnectionFailure

Operation {0} could not be


completed due to connection
failure.

Arg 0 is the operation name.

ConnectionFailureException

1002

ConnectionOpenFailed

Unable to open a connection


to the data mining engine {0}.

Arg 0 is the DME name.

ConnectionFailureException

1003

ConnectionClosedFailed

Unable to close a connection


to the data mining engine {0}.

Arg 0 is the DME name.

EntryNotFoundException

1004

EntryNotFound

The entry {0] does not exist


in the {1] {2].

Arg 0 is the entry name, arg1 is the


object type, and arg 2 is hte object
name. Entries may be attribute
names in LogicalData.

0-999

Reserved for JDM.

1005

Reserved.

DuplicateEntryException

1006

DuplicateEntry

Multiple occurrences of {0}


in object {1} {2}.

Arg 0 is the duplicate entry name,


arg 1 is the object type, and arg 2 is
the object name. E.g., an attribute
may be referenced in a LogicalData
twice.

InvalidURIException

1007

InvalidURI

Invalid URI specification


{0}.

Arg 0 is the URI name. When the


object cannot be located in {1}.

InvalidURIException

1008

InaccessibleURI

URI {0} cannot be accessed.

Arg 0 is the URI specified.

IncompatibleSpecificationException

1009

IncompatibleArgumentSpecification

Invalid specification {0} for


argument {0}.

Arg 0 is the attribute name. E.g., a


categorical attribute of type integer cannot be ordered by alphabetical.

IncompatibleSpecificationException

1010

IncompatibleSpecification

{0} {1} is not compatible


with {2} {3}

Arg 0 and arg 2 are object types.


Arg 1 and arg 3 are object names.
E.g., comparison between two
model signatures or model signature and physical data.

IncompatibleSpecificationException

1011

InvalidUsage

Attribute usage {0} cannot be


used for unsupervised mining functions.

Most likely, arg 0 is target.

IncompatibleSpecificationException

1012

InvalidSettings

Invalid settings to {0} model


{1}.

E.g., when a settings object is


invalid. Arg 0 is the operation
(build, apply), Arg 1 is the model
name.

ObjectNotFoundException

1013

ObjectNotFound

{0} {1} was not found.

Occurs when named object is not


found in MOR. Arg 0 is the type
and arg 1 is the object name.

ObjectExistsException

1014

ObjectExists

{0} {1}already exists.

Occurs when object name already


exists in MOR. Arg 0 is the type.
Arg 1 is the object name.

June 22, 2005

108

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

TABLE 6. JDMException codes and messages


Exception
Class

Code

Title

Message

Remarks

TaskException

1015

TaskExecuting

Task {0} is currently executing. Cannot re-execute before


completion.

When the task arg 0 is still running


when another execute is tried.

TaskException

1016

TaskNotExecuting

Task {1} is not currently executing.

When terminate is tried on a task


that is not executing.

TaskException

1017

TaskFailed

{0} mining operation for


model {1} failed. {2}.

Arg 0 is an operation name: build,


apply, test, etc. Arg 1 is the model
name, arg 2 is a vendor specific
message.

10171499

Reserved for future JDMException


Error Codes.

TABLE 7. JDM runtime exceptions, codes, and messages


Exception
Class

Code

Title

Message

Remarks

JDMRuntimeException

1500

GenericError

Generic Error.

Query the vendor specific error


from the exception using provided
methods.

JDMUnsupportedFeatureException

1501

UnsupportedFeature

Unsupported {0} {1}.

An unknown or unsupported feature is specified. Arg 0 is factory,


method or enum. Arg 1 is the value
of the feature. Arg 1 is optional

JDMIllegalArgumentException

1502

NullArgument

The required argument {0} is


null. Supply a non-null value.

Arg 0 is the argument name.

JDMIllegalArgumentException

1503

ArrayMismatch

Mismatch between {0] and


{1] in size.

Args 0 and 1 are object names, e.g.,


when two arrays must be the same
in size.

JDMIllegalArgumentException

1504

InvalidArgument

The argument {0} is invalid.


The value must be {1}.

Arg 0 is the argument name. Arg 1


is, e.g., >=0, 1<=x<=100, one of
{a,b,c}.

JDMIllegalArgumentException

1505

InvalidStringArgument

Provided string {0} is


invalid for argument {1}

When the provided string arg 0 is


not acceptable input to the specified argument in arg 1.

JDMIllegalArgumentException

1506

StringTooLong

Provided string {0} is too


long. Maximum string length
is {1}.

When the provided string arg 0


exceeds the maximum string length
specified in arg 1.

JDMIllegalArgumentException

1507

InvalidClassName

Class {0] is invalid.

The named class is not available.

JDMIllegalArgumentException

1508

InvalidDataType

Invalid data type {0} in


object {1}. The expected type
is {2}.

E.g., if integer is specified where


string is expected. Arg 1 is the
class type where this happens.

JDMIllegalArgumentException

1509

ArraySizeExceeded

Size of input array {0]


exceeds the maximum {1}.

Vendor specific array size limit.

JDMIllegalArgumentException

1510

InvalidObjectType

Mining object {0} is an


instance of {1}. An instance
of {2} is expected.

Arg 0 is the name of the object. Arg


1 and 2 are the class types.

JDMIllegalArgumentException

1511

InvalidObject

Mining object {0} is invalid.

Arg 0 is the name of the object.

15121999

June 22, 2005

Reserved for future JDMRuntimeExceptions

109

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

Appendix E. Web services


E.1. Introduction
Web services are growing in popularity for supporting distributed, loosely coupled applications. Data mining Web services provide an opportunity to facilitate integration of multiple data mining software implementations in a single application, enable a serviceoriented architecture for leveraging multiple specific vendor implementations, as well as
to enable language-independent development.
We include in the JDM specification a Web services definition for data mining based on
the JDM UML model. This enables vendors to leverage their investment in a JDM server
for both the Java and Web service interfaces: common metadata, object structure, and
capabilities. Since Java and XML are easily interchanged through standard protocols, e.g.,
JAXB, JDM vendors can provide a high degree of interoperability between the two interfaces.
By introducing a few strategically placed Java methods to JDM, applications can use the
Java interface to generate and consume XML objects more easily than the import and
export tasks. JDMs design to keep large inputs and results at the JDM server makes it naturally amenable to Web service design.
This specification includes the JDM WSDL types for document style Web services and the
XML schema definition for objects defined in JDM to enable JDM vendors to support the
Web services definition. Non-JDM vendors may also use this specification to enable their
data mining applications as Web services.
In the JDM XML Schema, we define a data model consistent with the JDM object model.
This schema follows a similar inheritance hierarchy of the JDM UML model and defines a
complex type for each object in JDM. Since JDM Java methods often do not imply an
implementations object structure, the XML schema introduces the necessary structure.
For enumeration extensibility, we use the XML Schema simple types combined with a
union of member types: standard and enumeration extension option.
This JDM WSDL and XML Schema specification is being validated, yet is not part of the
JDM reference implementation and technology compatibility kit. Through the rigors of
implementation, we expect to make minor changes with input from vendors. We encourage users needing enhancements or modifications to this Web services specification to
work with the expert group to address those needs.
We expect that entities created via Java are accessible via Web services and vice versa. For
this version of JDM Web services, we make several assumptions, based on the availability
of other Web service standards in these areas:

User knows the location of their data and specifies a URL


Data is available and accessible to the DME
Security is dealt with separately as part of overall Web services framework
As we evolve JDM Web services, the expert group recognizes the efforts of the Web services Interoperability Organization (WS-I) [WS-I] and has as a goal to be compliant with
WS-I recommendations.

June 22, 2005

110

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

E.2. Methods
JDM defines the following SOAP methods to communicate with a DME. Note that JDM
Web services follow document literal style for better interoperability.

listContents
getCapabilites
getObject
saveObject
removeObject
renameObject
getSubObjects
verifyObject
executeTask
getExecutionStatus
terminateTask
For JDM SOAP methods, http://www.jsr73.org/2004/webservices is used as the
namespace. Each of these methods is detailed in the examples below.

E.2.1. WSDL Document Structure


The structure of the WSDL document is as follows:
<definitions name="DataMiningService"
targetNamespace="http://www.jsr-73.org/2004/webservices/"
xmlns:tns="http://www.jsr-73.org/2004/webservices/"
xmlns="http://schemas.xmlsoap.org/wsdl/"
xmlns:ns2="http://www.jsr-73.org/2004/webservices/types"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/">
<types>
<schema targetNamespace="http://www.jsr-73.org/2004/webservices/types"
xmlns:tns="http://www.jsr-73.org/2004/webservices/types"
xmlns:soap11-enc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:jdm="http://www.jsr-73.org/2004/webservices/">
<import namespace="http://schemas.xmlsoap.org/soap/encoding/"/>
...type definitions included in subsequent sections...
<complexType name="faultResponse">
<sequence>
<element name="exception" type="JDMException"/>
</sequence>
</complexType>
<element name="saveObjectElement" type="tns:saveObject"/>
<element name="saveObjectResponseElement" type="tns:saveObject"/>
...method element definitions...
</schema>
</types>
June 22, 2005

111

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<message name="IDataMining_saveObject">
<part name="parameters" element="ns2:saveObjectElement"/>
</message>
<message name="IDataMining_saveObjectResponse">
<part name="result" element="ns2:saveObjectResponseElement"/>
</message>
<portType name="IDataMining">
<operation name="saveObject">
<input message="tns:IDataMining_saveObject"/>
<output message="tns:IDataMining_saveObjectResponse"/>
<fault message="tns:IDataMining_exception"/>
</operation>
...messages definitions...
</portType>
<binding name="IDataMiningBinding" type="tns:IDataMining">
<soap:binding transport="http://schemas.xmlsoap.org/soap/http"
style="document"/>
<operation name="saveObject">
<input>
<soap:body use="literal"/>
</input>
<output>
<soap:body use="literal"/>
</output>
<soap:operation soapAction=""/>
</operation>
...method bindings...
</binding>
<service name="DataMiningService">
<port name="IDataMiningPort" binding="tns:IDataMiningBinding">
<soap:address location="http://www.jsr-73.org/2004/webservices/DataMiningService"/>
</port>
</service>
</definitions>

E.2.2. Listing DME Contents


This method is used to list objects in a data-mining engine. It accepts the DME connection
details and an object filter specification.
Method
listContents ([in] object filter
[out] mining object header(s) )

WSDL Type
<complexType name="listContents">
<sequence>
<element name="objectFilter" type="ObjectFilter"/>
</sequence>
</complexType>
<complexType name="listContentsResponse">
<sequence>
<element name="object" type="MiningObjectHeader"
maxOccurs="unbounded"/>
</sequence>
</complexType>

June 22, 2005

112

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<complexType
<attribute
<attribute
<attribute
<attribute
<attribute
<attribute
<attribute
<attribute
<attribute

name="ObjectFilter">
name="name" type="xsd:string" use="optional"/>
name="type" type="xsd:string" use="optional"/>
name="function" type="xsd:string" use="optional"/>
name="algorithm" type="xsd:string" use="optional"/>
name="creatorInfo" type="xsd:string" use="optional"/>
name="createdBefore" type="xsd:date" use="optional"/>
name="createdAfter" type="xsd:date" use="optional"/>
name="objectIdentifier" type="xsd:string" use="optional"/>
name="requestedContent" type="ObjectContentType"
use="optional"/>
</complexType>
<simpleType name="ObjectContentType">
<restriction base="string">
<enumeration value="modelSignature"/>
<enumeration value="buildSettings"/>
<enumeration value="effectiveBuildSettings"/>
<enumeration value="statistics"/>
<enumeration value="modelDetail"/>
<enumeration value="logicalData"/>
<enumeration value="physicalData"/>
<enumeration value="costMatrix"/>
<enumeration value="applySettings"/>
</restriction>
</simpleType>

Example
SOAP Request:
<SOAP-ENV:Envelope
xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope
xmlns:xsi=http://www.w3c.org/2001/XMLSchema-instance
xmlns:xsd=http://www.w3c.org/2001/XMLSchema
>
<SOAP-ENV:Header>
<connectionSpec xmlns= http://www.jsr-73.org/2004/JDMSchema>
<userName>miningGuru</userName>
<password>mine</password>
<uri>www.jsr-73.org</uri>
</connectionSpec>
</SOAP-ENV:Header>
<SOAP-ENV:Body>
<listContents xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<objectFilter type=CostMatrix/>
</listContents>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<listContentsResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<object xsi:type=CostMatrix name=myCostMatrix creatorInfo=jdmExpert>
</object>
</listContentsResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
June 22, 2005

113

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

E.2.3. Introspection / Reflection


This method is used to determine the capabilties supported by the DME. It returns a complete list of capabilities.
Method Signature
getCapabilities ([out] capabilities report )

WSDL Type
<complexType name="getCapabilities"/>
<complexType name="getCapabilitiesResponse">
<sequence>
<element name="report" type="CapabilitiesReport"/>
</sequence>
</complexType>
<xsd:complexType name="CapabilitiesReport">
<xsd:sequence>
<xsd:element name="capability" type="Capability" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Capability">
<xsd:attribute name="task" type="MiningTask" use="optional"/>
<xsd:attribute name="function" type="MiningFunction" use="optional"/>
<xsd:attribute name="algorithm" type="MiningAlgorithm" use="optional"/>
<xsd:attribute name="enumName" type="xsd:string" use="optional"/>
<xsd:attribute name="enumValue" type="xsd:string" use="optional"/>
<xsd:attribute name="isSupported" type="xsd:boolean" use="required"/>
</xsd:complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ...>
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<getCapabilities xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
</getCapabilities>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ...>
<SOAP-ENV:Body>
<getCapabilitiesResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<report enumeration=ActivationFunction>
<capability task=Build function=Regression isSupported=true/>
...
</report>
</getCapabilitiesResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

June 22, 2005

114

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

E.2.4. Saving objects


This method is used to save any mining object in the specified Data Mining Engine
(DME). It requires the DME connection details and mining object details as input.
Method Signature
saveObject (
[in] object name
[in] overwrite flag
[in] verify flag
[in] mining object details
[out] verification report )

WSDL Type
<complexType name="saveObject">
<sequence>
<element name="object" type="MiningObject"/>
</sequence>
<attribute name="objectName" type="xsd:string" use="required"/>
<attribute name="overwrite" type="xsd:boolean" use="optional"/>
<attribute name="verify" type="xsd:boolean" use="optional"/>
</complexType>
<complexType name="saveObjectResponse">
<sequence>
<element name="report" type="VerificationReport" minOccurs="0"/>
</sequence>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<saveObject xmlns=http://www.jsr73.org/2004/webservices/
xmlns:jdm= http://www.jsr73.org/2004/JDMSchema
name=myClassificationSettings-1 overwrite=true verify=true>
<object xsi:type=ClassificationSettings miningFunction="classification">
<algorithmSettings algorithm=naiveBayes pairwiseThreshold="0.1" singletonThreshold="0.1"/>
<buildAttribute attributeName="income" usage="active"
outlierTreatment="asMissing"/>
<buildAttribute attributeName="age" usage="active" outlierTreatment="asIs"/>
<buildAttribute attributeName="numChildren" usage="active"
outlierTreatment="asIs"/>
<buildAttribute attributeName="ss#" usage="inactive"/>
</classificationSettings>
</object>
</saveObject>
</SOAP-ENV:Body>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<saveObjectResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<verificationReport reportType=warning>
June 22, 2005

115

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<reportText>Details of report...</reportText>
</verificationReport>
</saveObjectResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.5. Retrieving objects


This method is used to get the mining object from the specified Data Mining Engine. It
requires the DME connection details and mining object name as input and it returns the
mining object details as output.
Method Signature
getObject (
[in] object name
[in] object type
[out] mining object(s) )

WSDL Type
<complexType name="getObject">
<attribute name="objectName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>
<complexType name="getObjectResponse">
<sequence>
<element name="object" type="NamedObject"/>
</sequence>
</complexType>
<xsd:complexType name="NamedObject">
<xsd:sequence>
<xsd:choice>
<xsd:element name="task" type="Task"/>
<xsd:element name="buildSettings" type="BuildSettings"/>
<xsd:element name="model" type="Model"/>
<xsd:element name="logicalData" type="LogicalData"/>
<xsd:element name="physicalDataSet" type="PhysicalDataSet"/>
<xsd:element name="testMetrics" type="TestMetrics"/>
<xsd:element name="taxonomy" type="Taxonomy"/>
<xsd:element name="costMatrix" type="CostMatrix"/>
<xsd:element name="applySettings" type="ApplySettings"/>
</xsd:choice>
</xsd:sequence>
</xsd:complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<getObject xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema
name=Census_A_ClassificationSettings
type=BuildSettings />
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:

June 22, 2005

116

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<SOAP-ENV:Envelope ... >


<SOAP-ENV:Body>
<getObjectResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<object xsi:type=ClassificationSettings
name="Census_A_ClassificationSettings"
miningFunction="classification"
targetAttributeName=income>
<algorithmSettings algorithm=naiveBayes pairwiseThreshold="0.1"
singletonThreshold="0.1"/>
<buildAttribute attributeName="income" usage="active"
outlierTreatment="asMissing"/>
<buildAttribute attributeName="age" usage="active"
outlierTreatment="asIs"/>
<buildAttribute attributeName="numChildren" usage="active"
outlierTreatment="asIs"/>
<buildAttribute attributeName="ss#" usage="inactive"/>
</object>
</getObjectResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.6. Removing objects


This method is used to remove any named mining object in the specified Data Mining
Engine (DME). It requires the DME connection details and mining object details as input.
Method Signature
removeObject (
[in] object name
[in] object type
[out] object name
[out] object type )

WSDL Types
<complexType name="removeObject">
<attribute name="objectName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>
<complexType name="removeObjectResponse">
<attribute name="objectName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<removeObject xmlns=http://www.jsr73.org/2004/webservices/
xmlns:jdm= http://www.jsr73.org/2004/JDMSchema
objectName=myClassificationSettings
objectType=BuildSettings>
</removeObject>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

June 22, 2005

117

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<removeObjectResponse xmlns=http://www.jsr73.org/2004/webservices/
xmlns:jdm= http://www.jsr73.org/2004/JDMSchema
objectName=myClassificationSettings
objectType=BuildSettings>
</removeObjectResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.7. Renaming objects


This method is used to remove any named mining object in the specified Data Mining
Engine (DME). It requires the DME connection details and mining object details as input.
Method Signature
renameObject (
[in] from name
[in] to name
[in] object type
[out] from name
[out] to name
[out] object type )

WSDL Types
<complexType name="renameObject">
<attribute name="fromName" type="xsd:string" use="required"/>
<attribute name="toName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>
<complexType name="renameObjectResponse">
<attribute name="fromName" type="xsd:string" use="required"/>
<attribute name="toName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<renameObject xmlns=http://www.jsr73.org/2004/webservices/
xmlns:jdm= http://www.jsr73.org/2004/JDMSchema
fromName=myClassificationSettings toName=settings1>
</renameObject>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<renameObjectResponse xmlns=http://www.jsr73.org/2004/webservices/
xmlns:jdm= http://www.jsr73.org/2004/JDMSchema
fromName=myClassificationSettings toName=settings1>
</renameObjectResponse>
June 22, 2005

118

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.8. Retrieving Object Components


This method is used to retrieve subobjects of a given named object. This is used, e.g., for
retrieving the model signature from a model for apply.
Method Signature
getSubObjects (
[in] content type
[in] object name
[in] object type
[out] subobjects )

WSDL Type
<complexType name="getSubObjects">
<sequence>
<element name="contentType" type="ObjectContentType"
maxOccurs="unbounded"/>
</sequence>
<attribute name="objectName" type="xsd:string" use="required"/>
<attribute name="objectType" type="NamedObjectType" use="required"/>
</complexType>
<complexType name="getSubObjectsResponse">
<sequence>
<element name="object" type="SubObjectResult" maxOccurs="unbounded"/>
</sequence>
</complexType>
<xsd:complexType name="SubObjectResult">
<xsd:sequence>
<xsd:element name="header" type="MiningObject"/>
<xsd:choice>
<xsd:element name="modelSignature" type="ModelSignature"/>
<xsd:element name="buildSettings" type="BuildSettings"/>
<xsd:element name="effectiveBuildSettings" type="BuildSettings"/>
<xsd:element name="statistics" type="AttributeStatisticsSet"/>
<xsd:element name="modelDetail" type="ModelDetail"/>
<xsd:element name="logicalData" type="LogicalData"/>
<xsd:element name="physicalDataSet" type="PhysicalDataSet"/>
<xsd:element name="taxonomy" type="Taxonomy"/>
<xsd:element name="costMatrix" type="CostMatrix"/>
<xsd:element name="applySettings" type="ApplySettings"/>
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="objectCount" type="xsd:int"/>
</xsd:complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<getSubObjects xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema
contentType=modelSignature objectName=myFavoriteModel
objectType=model>
</getSubObjects>

June 22, 2005

119

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<getSubObjectsResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<object>
<header name=myFavoriteModel creatorInfo=jdmExpert/>
<modelSignature>
<attribute name="caseID" attributeType="notSpecified"
datatype="string"/>
<attribute name="age" attributeType="categorical"
datatype="integer"/>
<attribute name="income" attributeType="numerical"
datatype="double"/>
<attribute name="numChildren" attributeType="numerical"
datatype="integer"/>
</modelSignature>
<object>
</getSubObjectsResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.9. Verify Object


This method is used to determine if a mining object contains any errors or problems that
would inhibit execution or use.
Method Signature
verifyObject (
[in] object name (or) object
[in] object to be verified
[out] verification report)

WSDL Type
<complexType name="verifyObject">
<sequence>
<choice>
<element name="objectName" type="xsd:string"/>
<element name="object" type="MiningObject"/>
</choice>
</sequence>
<attribute name="objectType" type="xsd:string" use="optional"/>
</complexType>
<complexType name="verifyObjectResponse">
<sequence>
<element name="report" type="VerificationReport"/>
</sequence>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>

June 22, 2005

120

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<verifyObject xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<objectName>mySettings</objectName>
<objectType>buildSettings</objectType>
</verifyObject>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope ... >

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<verifyObjectResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<verificationReport reportType=warning>
<reportText>Details of report...</reportText>
</verificationReport>
</verifyObjectResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.10. Executing tasks


The build task can be constructed in two ways: using named object references, and embedding specifications such as the function settings. This first example highlights using object
references.
This method is used to execute a mining task in the specified Data Mining Engine. It
requires the DME connection details and mining task name as input.
Method Signature
executeTask (
[in] task name (or) task
[out] execution status )

WSDL Type
<complexType name="executeTask">
<sequence>
<choice>
<element name="taskName" type="xsd:string"/>
<element name="task" type="Task"/>
</choice>
</sequence>
</complexType>
<complexType name="executeTaskResponse">
<sequence>
<choice>
<element name="status" type="ExecutionStatus"/>
<element name="recordValue" type="jdm:RecordElement"
maxOccurs="unbounded"/>
</choice>
</sequence>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />

June 22, 2005

121

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<SOAP-ENV:Body>
<executeTask xmlns="http:" www.jsr73.org="2004" http:="www.jsr-73.org"/>
<task xsi:type=BuildTask name="myBuildTask-1">
<objectName>CensusBuildTask_A</objectName>
<modelName>Census_A</modelName>
<buildDataName>CensusBuild</buildDataName>
<buildSettingsName>Census_A_ClassificationSettings
</buildSettingName>
</task>
</executeTask>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<executeTaskResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<executionStatus state=queued timestamp=April 16, 2004 13:21:33/>
</executeTaskResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

The example above highlights a model build task. The following example provides a task
specification for single record apply involving two predictors for a churn classification
model.
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<executeTask xmlns="http:" www.jsr73.org="2004" http:="www.jsr-73.org"/>
<task xsi:type="RecordApplyTask"
modelName="ChurnClassification32">
<recordValue name="CustomerAge" value="23"/>
<recordValue name="CustomerIncome" value="50000"/>
<recordValue name="CustomerID" value="1003-2203-120"/>
<applySettingsName xsi:type="ClassificationApplySettings">
<sourceDestinationMap sourceAttrName="CustomerID"
destinationAttrName="CustId"/>
<applyMap content="predictedCategory" destPhysAttrName="churn"
rank="1"/>
<applyMap content="probability" destPhysAttrName="churnProb"
rank="1"/>
</applySettingsName>
</task>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<executeTaskResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<recordValue name="CustID" value="1003-2203-120"/>
<recordValue name="churn" value="1"/>
<recordValue name="churnProb" value=".87"/>
</executeTaskResponse>
</SOAP-ENV:Body>
June 22, 2005

122

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

</SOAP-ENV:Envelope>

E.2.11. Getting execution status


This method is used to get the status of a mining task in the specified Data Mining Engine.
It requires the DME connection details and mining task name as input and it returns the
execution handle object details as output.
Method Signature
getExecutionStatus (
[in] task name
[out] execution status )

WSDL Types
<complexType name="getExecutionStatus">
<attribute name="taskName" type="xsd:string" use="required"/>
</complexType>
<complexType name="getExecutionStatusResponse">
<sequence>
<element name="status" type="ExecutionStatus"/>
</sequence>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<getExecutionStatus xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema
taskName=myBuildTask>
</getExecutionStatus >
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<getExecutionStatusResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<executionStatus state=queued timestamp=April 16, 2004 13:21:33/>
</getExecutionStatusResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.2.12. Terminating Tasks


This method is used to terminate a queued or executing task.
Method Signature
terminateTask (
[in] task name
[out] execution status )
June 22, 2005

123

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

WSDL Types
<complexType name="terminateTask">
<attribute name="taskName" type="xsd:string" use="required"/>
</complexType>
<complexType name="terminateTaskResponse">
<sequence>
<element name="status" type="ExecutionStatus"/>
</sequence>
</complexType>

Example
SOAP Request:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Header ... />
<SOAP-ENV:Body>
<terminateTask xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema
taskName=myBuildTask>
</terminateTask >
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

SOAP Response:
<SOAP-ENV:Envelope ... >
<SOAP-ENV:Body>
<terminateTaskResponse
xmlns=http://www.jsr-73.org/2004/webservices/
xmlns:jdm= http://www.jsr-73.org/2004/JDMSchema>
<executionStatus state=terminating timestamp=April 16, 2004
13:21:33/>
</terminateTaskResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

E.3. Java methods supporting XML


Since the XML Schema definition follows JDM closely, providing methods that produce
and consume an XML representation of JDM objects can enhance API ease of use. JDM
does provide an import and export capability, which uses JDM 1.0 as a valid format; however, methods that allow more immediate programmatic access to XML, outside of a task,
seem warranted.
Each named object and other major objects may be augmented with a method
toXML ( ): String.
Similarly, on corresponding Factory classes, the method
fromXML (String): <objectType> translates the XML string into an instance of the specific objectType.
Lastly, to enable generic JDM object creation from an XML string, the Connection class
provides the method fromXML (String): Object. Here the string may contain any valid
JDM XML. The resulting objects type is determined programmatically.

June 22, 2005

124

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

E.4. XML Schema Definition


The following XML Schema is based on the JDM UML model.

E.4.1. JDM Document


The JDM document definition supports object import and export using the JDM XML
representation for the ImportTask and ExportTask, respectively.
<xsd:element name="JDM">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="header" type="Header"/>
<xsd:element name="object" type="NamedObject"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="version" type="xsd:string" use="required"/>
</xsd:complexType>
</xsd:element>
<xsd:complexType name="Header">
<xsd:sequence>
<xsd:element name="copyright" type="xsd:string" minOccurs="0"/>
<xsd:element name="timestamp" type="xsd:date" minOccurs="0"/>
<xsd:element name="applicationName" type="xsd:string" minOccurs="0"/>
<xsd:element name="applicationVersion" type="xsd:string" minOccurs="0"/>
<xsd:element name="description" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>

E.4.2. Task
<xsd:complexType name="Task">
<xsd:complexContent>
<xsd:extension base="MiningObject">
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="BuildTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:choice>
<xsd:element name="buildDataName" type="xsd:string"/>
<xsd:element name="buildData" type="PhysicalDataSet"/>
</xsd:choice>
<xsd:choice>
<xsd:element name="buildSettingsName" type="xsd:string"/>
<xsd:element name="buildSettings" type="BuildSettings"/>
</xsd:choice>
<xsd:choice>
<xsd:element name="validationDataName" type="xsd:string"
minOccurs="0"/>
<xsd:element name="validationData" type="PhysicalDataSet"
minOccurs="0"/>
</xsd:choice>
<xsd:element name="modelDescription" type="xsd:string" minOccurs="0"/>
<xsd:element name="buildDataMap" type="LogicalAttrNameMap"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="validationDataMap" type="AttributeNameMap"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>

June 22, 2005

125

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:attribute name="modelName" type="xsd:string" use="required"/>


<xsd:attribute name="inputModelName" type="xsd:string"
use="optional"/>
<xsd:attribute name="applicationName" type="xsd:string"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="NameMap">
<xsd:attribute name="sourceName" type="xsd:string" use="required"/>
<xsd:attribute name="destinationName" type="xsd:string" use="required"/
>
</xsd:complexType>
<xsd:complexType name="AttributeNameMap">
<xsd:attribute name="sourceAttrName" type="xsd:string" use="required"/>
<xsd:attribute name="destinationAttrName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:complexType name="LogicalAttrNameMap">
<xsd:attribute name="physAttrName" type="xsd:string" use="required"/>
<xsd:attribute name="logicalAttrName" type="xsd:string" use="required"/
>
</xsd:complexType>
<xsd:complexType name="SignatureAttrNameMap">
<xsd:attribute name="physAttrName" type="xsd:string" use="required"/>
<xsd:attribute name="signatureAttrName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:complexType name="TestTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:choice>
<xsd:element name="testDataName" type="xsd:string"/>
<xsd:element name="testData" type="PhysicalDataSet"/>
</xsd:choice>
<xsd:element name="testDataMap" type="SignatureAttrNameMap"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="modelName" type="xsd:string" use="required"/>
<xsd:attribute name="testMetricsName" type="xsd:string"
use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ClassificationTestTask">
<xsd:complexContent>
<xsd:extension base="TestTask">
<xsd:sequence>
<xsd:element name="computeMetric"
type="ClassificationTestMetricOption"
maxOccurs="unbounded"/>
<xsd:choice>
<xsd:element name="costMatrixName" type="xsd:string" minOccurs="0"/>
<xsd:element name="costMatrix" type="CostMatrix" minOccurs="0"/>
</xsd:choice>
<xsd:element name="positiveTargetValue" type="DataValueType"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="numberOfLiftQuantiles" type="xsd:int"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

June 22, 2005

126

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:complexType name="RegressionTestTask">
<xsd:complexContent>
<xsd:extension base="TestTask"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ImportTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:element name="objectName" type="NameMap" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="uri" type="xsd:anyURI" use="required"/>
<xsd:attribute name="includeModelSettings" type="xsd:boolean"
use="optional"/>
<xsd:attribute name="useOriginalCreationDates" type="xsd:boolean"
use="optional"/>
<xsd:attribute name="populateSummary" type="xsd:boolean"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ImportSummary">
<xsd:sequence>
<xsd:element name="objectName" type="xsd:string" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="objectCount" type="xsd:int" use="required"/>
<xsd:attribute name="creationDate" type="xsd:string" use="required"/>
<xsd:attribute name="format" type="ImportExportFormat" use="required"/>
</xsd:complexType>
<xsd:complexType name="ExportTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:element name="objectName" type="xsd:string" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="uri" type="xsd:anyURI" use="required"/>
<xsd:attribute name="format" type="xsd:string" use="required"/>
<xsd:attribute name="includeModelSettings" type="xsd:boolean"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="ImportExportFormatStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="JDM1_0"/>
<xsd:enumeration value="PMML1_0"/>
<xsd:enumeration value="PMML2_0"/>
<xsd:enumeration value="PMML2_1"/>
<xsd:enumeration value="PMML3_0"/>
<xsd:enumeration value="CWM1_0"/>
<xsd:enumeration value="CWM1_1"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="SettingsInclusionOption">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="systemDefault"/>
<xsd:enumeration value="none"/>
<xsd:enumeration value="settings"/>
<xsd:enumeration value="effectiveSettings"/>
<xsd:enumeration value="settingsOnly"/>
<xsd:enumeration value="effectiveSettingsOnly"/>
<xsd:enumeration value="allSettingsOnly"/>

June 22, 2005

127

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<xsd:enumeration value="all"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ComputeStatisticsTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:choice>
<xsd:element name="physicalDataName" type="xsd:string"/>
<xsd:element name="physicalData" type="PhysicalDataSet"/>
</xsd:choice>
<xsd:choice>
<xsd:element name="logicalDataName" type="xsd:string" minOccurs="0"/>
<xsd:element name="logicalData" type="LogicalData" minOccurs="0"/>
</xsd:choice>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

E.4.3. Task.Apply
<xsd:complexType name="DataSetApplyTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:choice>
<xsd:element name="applyDataName" type="xsd:string"/>
<xsd:element name="applyData" type="PhysicalDataSet"/>
</xsd:choice>
<xsd:choice>
<xsd:element name="applySettingsName" type="xsd:string"/>
<xsd:element name="applySettings" type="ApplySettings"/>
</xsd:choice>
<xsd:element name="applyDataMap" type="SignatureAttrNameMap"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="modelName" type="xsd:string" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="RecordApplyTask">
<xsd:complexContent>
<xsd:extension base="Task">
<xsd:sequence>
<xsd:element name="recordValue" type="RecordElement"
maxOccurs="unbounded"/>
<xsd:choice>
<xsd:element name="applySettingsName" type="xsd:string"/>
<xsd:element name="applySettings" type="ApplySettings"/>
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="modelName" type="xsd:string" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="RecordElement">
<xsd:sequence>
<xsd:element name="value" type="DataValueType"/>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="required"/>
</xsd:complexType>
<xsd:complexType name="ApplySettings">
June 22, 2005

128

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="sourceDestinationMap" type="AttributeNameMap"
minOccurs=0 maxOccurs=unbounded/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

E.4.4. Data
<xsd:complexType name="PhysicalDataSet">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="uri" type="xsd:anyURI"/>
<xsd:element name="physicalAttribute" type="PhysicalAttribute"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="attributeStatistics" type="AttributeStatisticsSet"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="attributeCount" type="xsd:int" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="PhysicalDataRecord">
<xsd:sequence>
<xsd:element name="entry" type="PhysicalAttributeValue" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attributeCount" type="xsd:int" use="optional"/>
</xsd:complexType>
<xsd:complexType name="PhysicalAttributeValue">
<xsd:sequence>
<xsd:element name="attribute" type="PhysicalAttribute"/>
<xsd:element name="value" type="DataValueType"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="DataValueType" abstract="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="DecimalValue">
<xsd:complexContent>
<xsd:extension base="DataValueType">
<xsd:attribute name="decimal" type="xsd:double" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="StringValue">
<xsd:complexContent>
<xsd:extension base="DataValueType">
<xsd:sequence>
<xsd:element name="string" type="xsd:string"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="PhysicalAttribute">
<xsd:complexContent>
<xsd:extension base="Attribute">
<xsd:attribute name="dataType" type="AttributeDataType"
use="required"/>
<xsd:attribute name="role" type="PhysicalAttributeRole"
June 22, 2005

129

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="Attribute">
<xsd:attribute name="name" type="xsd:string" use="required"/>
<xsd:attribute name="description" type="xsd:string" use="optional"/>
</xsd:complexType>
<xsd:simpleType name="AttributeDataTypeStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="unknownType"/>
<xsd:enumeration value="stringType"/>
<xsd:enumeration value="doubleType"/>
<xsd:enumeration value="integerType"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="PhysicalAttributeRoleStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="taxonomyParentId"/>
<xsd:enumeration value="taxonomyChildId"/>
<xsd:enumeration value="attributeValue"/>
<xsd:enumeration value="attributeName"/>
<xsd:enumeration value="caseId"/>
<xsd:enumeration value="data"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="LogicalData">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="logicalAttribute" type="LogicalAttribute"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attributeCount" type="xsd:int" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="LogicalAttribute">
<xsd:complexContent>
<xsd:extension base="Attribute">
<xsd:sequence>
<xsd:element name="categorySet" type="CategorySet" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="attributeType" type="AttributeType"
use="optional"/>
<xsd:attribute name="dataPreparationStatus" type="DataPreparationStatus"
use="optional"/>
<xsd:attribute name="isDiscrete" type="xsd:boolean" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="AttributeTypeStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="notSpecified"/>
<xsd:enumeration value="numerical"/>
<xsd:enumeration value="ordinal"/>
<xsd:enumeration value="categorical"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="DataPreparationStatusStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="prepared"/>
<xsd:enumeration value="unprepared"/>
</xsd:restriction>

June 22, 2005

130

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

</xsd:simpleType>
<xsd:complexType name="CategorySet">
<xsd:sequence>
<xsd:element name="categoryValue" type="CategoryValue" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="dataType" type="AttributeDataType" use="required"/
>
<xsd:attribute name="size" type="xsd:int" use="optional"/>
<xsd:attribute name="name" type="xsd:string" use="required"/>
</xsd:complexType>
<xsd:simpleType name="CategoryPropertyStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="missing"/>
<xsd:enumeration value="unknown"/>
<xsd:enumeration value="error"/>
<xsd:enumeration value="valid"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="CategoryValue">
<xsd:sequence>
<xsd:element name="categoryValue" type="DataValueType"/>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="optional"/>
<xsd:attribute name="index" type="xsd:integer" use="optional"/>
<xsd:attribute name="property" type="CategoryProperty" use="optional"/>
</xsd:complexType>
<xsd:complexType name="Interval">
<xsd:sequence>
<xsd:element name="startPoint" type="xsd:double"/>
<xsd:element name="endPoint" type="xsd:double"/>
<xsd:element name="intervalClosure" type="IntervalClosure"/>
</xsd:sequence>
</xsd:complexType>
<xsd:simpleType name="IntervalClosure">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="openOpen"/>
<xsd:enumeration value="openClosed"/>
<xsd:enumeration value="closedOpen"/>
<xsd:enumeration value="closedClosed"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="Taxonomy">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:choice>
<xsd:choice>
<xsd:element name="dataReference" type="PhysicalDataSet"/>
<xsd:element name="dataReferenceName" type="xsd:string"/>
</xsd:choice>
<xsd:element name="elements" type="TaxonomyElement" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:choice>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="TaxonomyElement">
<xsd:sequence>
<xsd:element name="parent" type="DataValueType"/>
<xsd:element name="child" type="DataValueType" maxOccurs="unbounded"/
>

June 22, 2005

131

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="ModelSignature">
<xsd:sequence>
<xsd:element name="attribute" type="SignatureAttribute" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="SignatureAttribute">
<xsd:complexContent>
<xsd:extension base="Attribute">
<xsd:attribute name="attributeType" type="AttributeType"
use="required"/>
<xsd:attribute name="dataType" type="AttributeDataType"
use="required"/>
<xsd:attribute name="rank" type="xsd:int" use="optional"/>
<xsd:attribute name="importanceValue" type="xsd:double"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="CategoryMatrixElement">
<xsd:sequence>
<xsd:element name="predictedCategory" type="DataValueType"/>
<xsd:element name="actualCategory" type="DataValueType"/>
</xsd:sequence>
<xsd:attribute name="value" type="xsd:double" use="required"/>
</xsd:complexType>

E.4.5. Supervised
<xsd:complexType name="SupervisedSettings" abstract="true">
<xsd:complexContent>
<xsd:extension base="BuildSettings">
<xsd:attribute name="targetAttributeName" type="xsd:string"
use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="TestMetrics">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:choice>
<xsd:element name="testDataName" type="xsd:string"/>
<xsd:element name="testData" type="PhysicalDataSet"/>
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="taskIdentifier" type="xsd:string"
use="optional"/>
<xsd:attribute name="modelName" type="xsd:string" use="required"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="SupervisedAlgorithmSettings">
<xsd:complexContent>
<xsd:extension base="AlgorithmSettings"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="SupervisedModel" abstract="true">
<xsd:complexContent>
<xsd:extension base="Model">
<xsd:attribute name="targetAttributeName" type="xsd:string"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
June 22, 2005

132

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

E.4.6. Supervised.Classification
<xsd:complexType name="ClassificationTestMetrics">
<xsd:complexContent>
<xsd:extension base="TestMetrics">
<xsd:sequence>
<xsd:element name="confusionMatrix" type="ConfusionMatrix"
minOccurs="0"/>
<xsd:element name="lift" type="Lift" minOccurs="0"/>
<xsd:element name="ROC" type="ReceiverOperatingCharacteristics"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="accuracy" type="xsd:double" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="ClassificationTestMetricOptionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="confusionMatrix"/>
<xsd:enumeration value="lift"/>
<xsd:enumeration value="receiverOperatingCharacteristics"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ConfusionMatrix">
<xsd:sequence>
<xsd:element name="category" type="DataValueType" minOccurs="2"
maxOccurs="unbounded"/>
<xsd:element name="countElement" type="CategoryMatrixElement"
minOccurs="4" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="accuracy" type="xsd:decimal" use="optional"/>
<xsd:attribute name="error" type="xsd:decimal" use="optional"/>
<xsd:attribute name="numberOfPredictions" type="xsd:int"
use="optional"/>
</xsd:complexType>
<xsd:complexType name="CostMatrix">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="category" type="DataValueType" minOccurs="2"
maxOccurs="unbounded"/>
<xsd:element name="costElement" type="CategoryMatrixElement"
minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ReceiverOperatingCharacteristics">
<xsd:sequence>
<xsd:element name="elements" type="ROCElement" maxOccurs="unbounded"/
>
</xsd:sequence>
<xsd:attribute name="numberOfThresholdCandidates" type="xsd:int"/>
</xsd:complexType>
<xsd:complexType name="ROCElement">
<xsd:attribute name="index" type="xsd:int"/>
<xsd:attribute name="probabilityThreshold" type="xsd:double"/>
<xsd:attribute name="hitRate" type="xsd:double"/>
<xsd:attribute name="falseAlarmRate" type="xsd:double"/>
<xsd:attribute name="truePositiveCount" type="xsd:int"/>
<xsd:attribute name="trueNegativeCount" type="xsd:int"/>
<xsd:attribute name="falsePositiveCount" type="xsd:int"/>
<xsd:attribute name="falsePositiveDount" type="xsd:int"/>
</xsd:complexType>
<xsd:complexType name="Lift">
<xsd:sequence>

June 22, 2005

133

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:element name="liftElement" type="LiftElement" minOccurs="2"


maxOccurs="unbounded"/>
<xsd:element name="positiveTargetValue" type="DataValueType"/>
</xsd:sequence>
<xsd:attribute name="numberOfQuantiles" type="xsd:int" use="required"/>
<xsd:attribute name="totalCases" type="xsd:int" use="required"/>
<xsd:attribute name="totalPositiveCases" type="xsd:int" use="required"/
>
<xsd:attribute name="targetAttributeName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:complexType name="LiftElement">
<xsd:sequence>
<xsd:element name="cumulativeCases" type="xsd:int" minOccurs="0"/>
<xsd:element name="numberOfPositiveCases" type="xsd:int" minOccurs="0"/>
<xsd:element name="cumulativePositiveCases" type="xsd:int" minOccurs="0"/>
<xsd:element name="numberOfNegativeCases" type="xsd:int" minOccurs="0"/>
<xsd:element name="cumulativeNegativeCases" type="xsd:int" minOccurs="0"/>
<xsd:element name="percentageSize" type="xsd:decimal" minOccurs="0"/>
<xsd:element name="cumulativePercentageSize" type="xsd:double"
minOccurs="0"/>
<xsd:element name="targetDensity" type="xsd:double" minOccurs="0"/>
<xsd:element name="cumulativeTargetDensity" type="xsd:double"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="quantileIndex" type="xsd:int" use="required"/>
<xsd:attribute name="lift" type="xsd:double" use="required"/>
<xsd:attribute name="cumulativeLift" type="xsd:double" use="optional"/>
<xsd:attribute name="cases" type="xsd:int" use="optional"/>
</xsd:complexType>
<xsd:complexType name="ClassificationApplySettings">
<xsd:complexContent>
<xsd:extension base="ApplySettings">
<xsd:sequence>
<xsd:choice>
<xsd:element name="costMatrixName" type="xsd:string"
minOccurs="0"/>
<xsd:element name="costMatrix" type="CostMatrix"
minOccurs="0"/>
</xsd:choice>
<xsd:element name="rankMap" type="ClassificationApplyMap"
maxOccurs="unbounded"/>
<xsd:element name="categoryMap" type="ClassificationCategoryMap"
maxOccurs="unbounded"/>
<xsd:element name="predictionMap" type="PredictionMap"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ClassificationApplyMap">
<xsd:attribute name="content" type="ClassificationApplyContent"
use="required"/>
<xsd:attribute name="destPhysAttrName" type="xsd:string"
use="required"/>
<xsd:attribute name="rank" type="xsd:int" use="optional"/>
<xsd:attribute name="category" type="xsd:string" use="optional"/>
</xsd:complexType>
<xsd:simpleType name="ClassificationApplyContentStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="predictedCategory"/>

June 22, 2005

134

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:enumeration value="probability"/>
<xsd:enumeration value="cost"/>
<xsd:enumeration value="nodeId"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ClassificationSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedSettings">
<xsd:sequence>
<xsd:choice>
<xsd:element name="costMatrixName" type="xsd:string" minOccurs="0"/>
<xsd:element name="costMatrix" type="CostMatrix" minOccurs="0"/>
</xsd:choice>
<xsd:element name="priorProbabilities" type="PriorProbabilities"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="usePriors" type="xsd:boolean" default="false"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="PriorProbabilities">
<xsd:sequence>
<xsd:element name="entry" type="PriorsEntry" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attributeName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:complexType name="PriorsEntry">
<xsd:sequence>
<xsd:element name="attributeValue" type="DataValueType"/>
<xsd:element name="priorProbability" type="xsd:double"/>
</xsd:sequence>
</xsd:complexType>

E.4.7. Supervised.Regression
<xsd:complexType name="RegressionSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedSettings"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="RegressionTestMetrics">
<xsd:complexContent>
<xsd:extension base="TestMetrics">
<xsd:attribute name="meanPredictedValue" type="xsd:decimal"
use="optional"/>
<xsd:attribute name="meanActualValue" type="xsd:decimal"
use="optional"/>
<xsd:attribute name="meanAbsoluteError" type="xsd:decimal"
use="optional"/>
<xsd:attribute name="rmsError" type="xsd:decimal" use="optional"/>
<xsd:attribute name="rSquared" type="xsd:decimal" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="RegressionApplySettings">
<xsd:complexContent>
<xsd:extension base="ApplySettings">
<xsd:sequence>
<xsd:element name="applyMap" type="RegressionApplyMap"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:extension>
June 22, 2005

135

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="RegressionApplyMap">
<xsd:attribute name="content" type="RegressionApplyContent"
use="required"/>
<xsd:attribute name="destPhysAttrName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:simpleType name="RegressionApplyContentStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="predictedValue"/>
<xsd:enumeration value="confidence"/>
</xsd:restriction>
</xsd:simpleType>

E.4.8. Clustering
<xsd:complexType name="ClusteringApplySettings">
<xsd:complexContent>
<xsd:extension base="ApplySettings">
<xsd:sequence>
<xsd:element name="rankMap" type="ClusteringApplyMap" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="clusterIdentifierMap" type="ClusterIdentifierMap"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="ClusterMap" type="ClusteringApplyMap" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ClusteringApplyMap">
<xsd:sequence>
<xsd:element name="destPhysicalAttrName" type="xsd:string"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="content" type="ClusteringApplyContent"
use="required"/>
<xsd:attribute name="fromTop" type="xsd:boolean" use="required"/>
</xsd:complexType>
<xsd:complexType name="ClusterIdentifierMap">
<xsd:attribute name="clusterID" type="xsd:int" use="required"/>
<xsd:attribute name="content" type="ClusteringApplyContent"
use="required"/>
<xsd:attribute name="destPhysicalAttrName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:complexType name="ClusterMap">
<xsd:attribute name="content" type="ClusteringApplyContent"
use="required"/>
<xsd:attribute name="baseDestPhysicalAttrName" type="xsd:string"
use="required"/>
</xsd:complexType>
<xsd:simpleType name="ClusteringApplyContentStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="clusterIdentifier"/>
<xsd:enumeration value="probability"/>
<xsd:enumeration value="qualityOfFit"/>
<xsd:enumeration value="distance"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ClusteringSettings">
June 22, 2005

136

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:complexContent>
<xsd:extension base="BuildSettings">
<xsd:sequence>
<xsd:element name="aggregationFunction" type="AggregationFunction"
minOccurs="0"/>
<xsd:element name="maxClusterCaseCount" type="xsd:int" minOccurs="0"/>
<xsd:element name="maxLevels" type="xsd:int" minOccurs="0"/>
<xsd:element name="maxNumberOfClusters" type="xsd:int" minOccurs="0"/>
<xsd:element name="minClusterCaseCount" type="xsd:int" minOccurs="0"/>
<xsd:sequence minOccurs="0" maxOccurs="unbounded">
<xsd:element name="attrCompLogicalAttr" type="xsd:string"/>
<xsd:element name="attributeComparisonFunction"
type="AttributeComparisonFunction"/>
</xsd:sequence>
<xsd:sequence minOccurs="0" maxOccurs="unbounded">
<xsd:element name="similarityMatrixLogicalAttr"
type="xsd:string"/>
<xsd:element name="similarityMatrix" type="SimilarityMatrix"/>
</xsd:sequence>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="AggregationFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="binarySimilarity"/>
<xsd:enumeration value="tanimoto"/>
<xsd:enumeration value="jaccard"/>
<xsd:enumeration value="simpleMatching"/>
<xsd:enumeration value="minkowski"/>
<xsd:enumeration value="cityBlock"/>
<xsd:enumeration value="chebychev"/>
<xsd:enumeration value="squaredEuclidean"/>
<xsd:enumeration value="euclidean"/>
<xsd:enumeration value="systemDetermined"/>
<xsd:enumeration value="systemDefault"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="AttributeComparisonFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="similarityMatrix"/>
<xsd:enumeration value="equal"/>
<xsd:enumeration value="delta"/>
<xsd:enumeration value="gaussSim"/>
<xsd:enumeration value="absDiff"/>
<xsd:enumeration value="systemDetermined"/>
<xsd:enumeration value="systemDefault"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="SimilarityMatrix">
<xsd:sequence>
<xsd:element name="category" type="DataValueType" minOccurs="2"
maxOccurs="unbounded"/>
<xsd:element name="similarityElement" type="CategoryMatrixElement"
minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="ClusteringSignatureAttribute">
<xsd:complexContent>
<xsd:extension base="SignatureAttribute">
<xsd:sequence>
<xsd:element name="attributeComparisonFunction"

June 22, 2005

137

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

type="AttributeComparisonFunction" minOccurs="0"/>
<xsd:element name="similarityMatrix" type="SimilarityMatrix"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="similarityScale" type="xsd:double"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

E.4.9. Association
<xsd:complexType name="AssociationSettings">
<xsd:complexContent>
<xsd:extension base="BuildSettings">
<xsd:sequence>
<xsd:element name="includedItem" type="DataValueType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="excludedItem" type="DataValueType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="attributeTaxonomy" type="AttributeTaxonomy"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="maxNumberOfRules" type="xsd:int"
use="optional"/>
<xsd:attribute name="maxRuleLength" type="xsd:int" use="optional"/>
<xsd:attribute name="maxAntecedentComponentLength" type="xsd:int"
use="optional"/>
<xsd:attribute name="maxConsequentComponentLength" type="xsd:int"
use="optional"/>
<xsd:attribute name="minConfidence" type="xsd:double"
use="optional"/>
<xsd:attribute name="minSupport" type="xsd:double" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="AttributeTaxonomy">
<xsd:attribute name="attributeName" type="xsd:string"/>
<xsd:attribute name="taxonomyName" type="xsd:string"/>
</xsd:complexType>

E.4.10. AttributeImportance
<xsd:complexType name="AttributeImportanceSettings">
<xsd:complexContent>
<xsd:extension base="BuildSettings">
<xsd:attribute name="maxAttributeCount" type="xsd:int"
use="optional"/>
<xsd:attribute name="targetAttributeName" type="xsd:string"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="AttributeImportanceModel">
<xsd:complexContent>
<xsd:extension base="Model">
<xsd:sequence>
<xsd:element name="attribute" type="AttributeImportance"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="AttributeImportance">

June 22, 2005

138

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<xsd:attribute name="attributeName" type="xsd:string" use="required"/>


<xsd:attribute name="attributeRank" type="xsd:int" use="required"/>
<xsd:attribute name="importanceValue" type="xsd:double" use="optional"/
>
</xsd:complexType>

E.4.11. Statistics
<xsd:complexType name="AttributeStatisticsSet">
<xsd:sequence>
<xsd:element name="attrStatistics" type="UnivariateStatistics"
minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="statisticsTimestamp" type="xsd:time"
use="optional"/>
<xsd:attribute name="numberOfCases" type="xsd:int" use="optional"/>
</xsd:complexType>
<xsd:complexType name="UnivariateStatistics">
<xsd:sequence>
<xsd:element name="continuousStatistics" type="ContinuousStatistics"
minOccurs="0"/>
<xsd:element name="discrete
" type="DiscreteStatistics"
minOccurs="0"/>
<xsd:element name="numericalStatistics" type="NumericalStatistics"
minOccurs="0"/>
<xsd:element name="frequencies" type="xsd:int" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="probabilities" type="xsd:double" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="values" type="DataValueType" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attributeName" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="ContinuousStatistics">
<xsd:sequence>
<xsd:element name="intervals" type="Interval" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="frequencies" type="xsd:int" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="sum" type="xsd:double" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="sumOfSquares" type="xsd:double" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="numberOfIntervals" type="xsd:int"/>
</xsd:complexType>
<xsd:complexType name="DiscreteStatistics">
<xsd:sequence>
<xsd:element name="modalValue" type="DataValueType" minOccurs="0"/>
<xsd:element name="discreteValues" type="DataValueType" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="frequencies" type="xsd:int" minOccurs="0"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="NumericalStatistics">
<xsd:sequence>
<xsd:element name="minimumValue" type="xsd:double" minOccurs="0"/>
<xsd:element name="maximumValue" type="xsd:double" minOccurs="0"/>
<xsd:element name="meanValue" type="xsd:double" minOccurs="0"/>
<xsd:element name="medianValue" type="xsd:double" minOccurs="0"/>
<xsd:element name="variance" type="xsd:double" minOccurs="0"/>
<xsd:element name="standardDeviation" type="xsd:double" minOcJune 22, 2005

139

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

curs="0"/>
<xsd:element name="quantile" type="xsd:double" minOccurs="0"/>
<xsd:element name="quantileLimits" type="xsd:double" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="interQuartileRange" type="xsd:double" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>

E.4.12. Algorithm
<xsd:complexType name="NaiveBayesSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedAlgorithmSettings">
<xsd:attribute name="pairwiseThreshold" type="xsd:double"
use="optional"/>
<xsd:attribute name="singletonThreshold" type="xsd:double"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="SVMClassificationSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedAlgorithmSettings">
<xsd:attribute name="cStrategy" type="xsd:double" use="optional"/>
<xsd:attribute name="complexityFactor" type="xsd:double"
use="optional"/>
<xsd:attribute name="kernelCacheSize" type="xsd:int" use="optional"/
>
<xsd:attribute name="kernelFunction" type="KernelFunction"
use="optional"/>
<xsd:attribute name="polynomialDegree" type="xsd:int"
use="optional"/>
<xsd:attribute name="standardDeviation" type="xsd:double"
use="optional"/>
<xsd:attribute name="tolerance" type="xsd:double" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="SVMRegressionSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedAlgorithmSettings">
<xsd:attribute name="cStrategy" type="xsd:double" use="optional"/>
<xsd:attribute name="complexityFactor" type="xsd:double"
use="optional"/>
<xsd:attribute name="epsilon" type="xsd:double" use="optional"/>
<xsd:attribute name="kernelCacheSize" type="xsd:int" use="optional"/
>
<xsd:attribute name="kernelFunction" type="KernelFunction"
use="optional"/>
<xsd:attribute name="polynomialDegree" type="xsd:int"
use="optional"/>
<xsd:attribute name="standardDeviation" type="xsd:double"
use="optional"/>
<xsd:attribute name="tolerance" type="xsd:double" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="KernelFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="sigmoid"/>
<xsd:enumeration value="hypertangent"/>
<xsd:enumeration value="polynomial"/>
<xsd:enumeration value="kGaussian"/>
<xsd:enumeration value="kLinear"/>
June 22, 2005

140

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:enumeration value="systemDetermined"/>
<xsd:enumeration value="systemDefault"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="TreeSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedAlgorithmSettings">
<xsd:attribute name="buildHomogeneityMetric"
type="TreeHomogeneityMetric" use="optional"/>
<xsd:attribute name="computeNodeStatistics" type="xsd:boolean"
use="optional"/>
<xsd:attribute name="determineMaxDepth" type="xsd:boolean"
use="optional"/>
<xsd:attribute name="maxDepth" type="xsd:int" use="optional"/>
<xsd:attribute name="maxSplits" type="xsd:int" use="optional"/>
<xsd:attribute name="maxSurrogates" type="xsd:int" use="optional"/>
<xsd:attribute name="maximumPValue" type="xsd:double"
use="optional"/>
<xsd:attribute name="minDecreaseInImpurity" type="xsd:double"
use="optional"/>
<xsd:attribute name="minNodeSize" type="xsd:double" use="optional"/>
<xsd:attribute name="minNodeSizeUnit" type="SizeUnit"
use="optional"/>
<xsd:attribute name="pruningHomogeneityMetric"
type="TreeHomogeneityMetric" use="optional"/>
<xsd:attribute name="treeSelectionMethod" type="TreeSelectionMethod"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="TreeHomogeneityMetricStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="misclassificationRatio"/>
<xsd:enumeration value="entropy"/>
<xsd:enumeration value="gini"/>
<xsd:enumeration value="meanAbsoluteDeviation"/>
<xsd:enumeration value="meanSquaredError"/>
<xsd:enumeration value="systemDefault"/>
<xsd:enumeration value="systemDetermined"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="TreeSelectionMethodStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="oneStandardErrorTree"/>
<xsd:enumeration value="minimumErrorTree"/>
<xsd:enumeration value="systemDefault"/>
<xsd:enumeration value="systemDetermined"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="FeedForwardNeuralNetSettings">
<xsd:complexContent>
<xsd:extension base="SupervisedAlgorithmSettings">
<xsd:sequence>
<xsd:element name="neuralLayers" type="NeuralLayer" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:element name="learningAlgorithm" type="LearningAlgorithm"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="determineNumberOfNodesPerLayer" type="xsd:boolean"
use="optional"/>
<xsd:attribute name="maxNumberOfIterations" type="xsd:int"
use="optional"/>
<xsd:attribute name="minErrorTolerance" type="xsd:double"
use="optional"/>

June 22, 2005

141

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="LearningAlgorithm">
</xsd:complexType>
<xsd:complexType name="Backpropagation">
<xsd:complexContent>
<xsd:extension base="LearningAlgorithm">
<xsd:attribute name="learningRate" type="xsd:double"/>
<xsd:attribute name="momentum" type="xsd:double"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="NeuralLayer">
<xsd:attribute name="activationFunction" type="ActivationFunction"/>
<xsd:attribute name="numberOfNodes" type="xsd:decimal"/>
</xsd:complexType>
<xsd:simpleType name="ActivationFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="softMax"/>
<xsd:enumeration value="symmetricSign"/>
<xsd:enumeration value="sign"/>
<xsd:enumeration value="hyperbolicTangent"/>
<xsd:enumeration value="logistic"/>
<xsd:enumeration value="linearIdentity"/>
<xsd:enumeration value="systemDefault"/>
<xsd:enumeration value="systemDetermined"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="AssociationRulesAlgorithmSettings">
<xsd:complexContent>
<xsd:extension base="AlgorithmSettings"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="AttributeImportanceAlgorithmSettings">
<xsd:complexContent>
<xsd:extension base="AlgorithmSettings"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ClusteringAlgorithmSettings">
<xsd:complexContent>
<xsd:extension base="AlgorithmSettings"/>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="KMeansSettings">
<xsd:complexContent>
<xsd:extension base="ClusteringAlgorithmSettings">
<xsd:attribute name="distanceFunction" type="ClusteringDistanceFunction"
use="optional"/>
<xsd:attribute name="maxNumberOfIterations" type="xsd:int"
use="optional"/>
<xsd:attribute name="minErrorTolerance" type="xsd:double"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="ClusteringDistanceFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="euclidean"/>
<xsd:enumeration value="systemDefault"/>
<xsd:enumeration value="systemDetermined"/>
</xsd:restriction>
</xsd:simpleType>

June 22, 2005

142

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

E.4.13. Base
<xsd:complexType name="BuildSettings" abstract="true">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="algorithmSettings" type="AlgorithmSettings"
minOccurs="0"/>
<xsd:element name="weightAttribute" type="xsd:string"/>
<xsd:element name="buildAttribute" type="BuildAttribute" minOccurs="0"
maxOccurs="unbounded"/>
<xsd:choice>
<xsd:element name="logicalData" type="LogicalData" minOccurs="0"/>
<xsd:element name="logicalDataName" type="xsd:string" minOccurs="0"/>
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="miningFunction" type="MiningFunction"
use="required"/>
<xsd:attribute name="desiredExecutionTimeInMinutes" type="xsd:int"
use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="BuildAttribute">
<xsd:attribute name="attributeName" type="xsd:string"/>
<xsd:attribute name="usage" type="LogicalAttributeUsage"
use="optional"/>
<xsd:attribute name="outlierTreatment" type="OutlierTreatment"
use="optional"/>
<xsd:attribute name="weight" type="xsd:double" use="optional"/>
</xsd:complexType>
<xsd:complexType name="Model">
<xsd:complexContent>
<xsd:extension base="MiningObject">
<xsd:sequence>
<xsd:element name="signature" type="ModelSignature" minOccurs="0"/
>
<xsd:choice>
<xsd:element name="buildSettingsName" type="xsd:string"
minOccurs="0"/>
<xsd:element name="buildSettings" type="BuildSettings" minOccurs="0"/>
</xsd:choice>
<xsd:element name="effectiveBuildSettings" type="BuildSettings"
minOccurs="0"/>
<xsd:element name="attributeStatistics" type="AttributeStatisticsSet"
minOccurs="0"/>
<xsd:element name="modelDetail" type="ModelDetail" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="uniqueIdentifier" type="xsd:string"
use="optional"/>
<xsd:attribute name="version" type="xsd:string" use="optional"/>
<xsd:attribute name="majorVersion" type="xsd:string" use="optional"/
>
<xsd:attribute name="minorVersion" type="xsd:string" use="optional"/
>
<xsd:attribute name="providerName" type="xsd:string" use="optional"/
>
<xsd:attribute name="providerVersion" type="xsd:string"
use="optional"/>
<xsd:attribute name="applicationName" type="xsd:string"
use="optional"/>

June 22, 2005

143

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:attribute name="miningFunction" type="MiningFunction"


use="optional"/>
<xsd:attribute name="miningAlgorithm" type="MiningAlgorithm"
use="optional"/>
<xsd:attribute name="taskIdentifer" type="xsd:string"
use="optional"/>
<xsd:attribute name="buildDuration" type="xsd:int" use="optional"/>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="ModelDetail">
<xsd:sequence>
<xsd:any minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="format" type="ImportExportFormat" use="optional"/>
</xsd:complexType>
<xsd:complexType name="MiningObject">
<xsd:sequence>
<xsd:element name="description" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string" use="optional"/>
<xsd:attribute name="type" type="xsd:string" use="optional"/>
<xsd:attribute name="creatorInfo" type="xsd:string" use="optional"/>
<xsd:attribute name="creationDate" type="xsd:date" use="optional"/>
<xsd:attribute name="objectIdentifier" type="xsd:string"
use="optional"/>
</xsd:complexType>
<xsd:complexType name="MiningObjectHeader">
<xsd:complexContent>
<xsd:extension base="MiningObject">
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:simpleType name="LogicalAttributeUsageStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="inactive"/>
<xsd:enumeration value="supplementary"/>
<xsd:enumeration value="active"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="OutlierTreatmentStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="asMissing"/>
<xsd:enumeration value="asIs"/>
<xsd:enumeration value="systemDetermined"/>
<xsd:enumeration value="systemDefault"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="AlgorithmSettings" abstract="true">
<xsd:attribute name="miningAlgorithm" type="MiningAlgorithm"
use="required"/>
</xsd:complexType>
<xsd:simpleType name="NamedObjectType">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="task"/>
<xsd:enumeration value="buildSettings"/>
<xsd:enumeration value="model"/>
<xsd:enumeration value="logicalData"/>
<xsd:enumeration value="physicalDataSet"/>
<xsd:enumeration value="testMetrics"/>
<xsd:enumeration value="taxonomy"/>
<xsd:enumeration value="costMatrix"/>
<xsd:enumeration value="applySettings"/>
</xsd:restriction>
</xsd:simpleType>

June 22, 2005

144

JavaTM Data Mining (JDM)

Maintenance Release

Version 1.1

<xsd:simpleType name="MiningFunctionStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="classification"/>
<xsd:enumeration value="clustering"/>
<xsd:enumeration value="regression"/>
<xsd:enumeration value="attributeImportance"/>
<xsd:enumeration value="association"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="MiningAlgorithmStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="svmClassification"/>
<xsd:enumeration value="svmRegression"/>
<xsd:enumeration value="decisionTree"/>
<xsd:enumeration value="naiveBayes"/>
<xsd:enumeration value="kMeans"/>
<xsd:enumeration value="feedForwardNeuralNet"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="SizeUnit">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="percentage"/>
<xsd:enumeration value="count"/>
</xsd:restriction>
</xsd:simpleType>

E.4.14. Root
<xsd:complexType name="JDMException">
<xsd:sequence>
</xsd:sequence>
<xsd:attribute name="errorcode" type="xsd:int" use="required"/>
<xsd:attribute name="message" type="xsd:string" use="optional"/>
<xsd:attribute name="vendorErrorcode" type="xsd:int" use="optional"/>
<xsd:attribute name="vendorMessage" type="xsd:string" use="optional"/>
</xsd:complexType>
<xsd:complexType name="VerificationReport">
<xsd:sequence>
<xsd:element name="reportText" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="reportType" type="ReportType" use="required"/>
</xsd:complexType>
<xsd:simpleType name="ReportType">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="error"/>
<xsd:enumeration value="warning"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="SortOrder">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="ascending"/>
<xsd:enumeration value="descending"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ExecutionStatus">
<xsd:sequence>
<xsd:element name="description" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="state" type="ExecutionState" use="required"/>
<xsd:attribute name="timestamp" type="xsd:string" use="required"/>
<xsd:attribute name="containsWarning" type="xsd:boolean"
use="optional"/>
</xsd:complexType>
<xsd:simpleType name="ExecutionState">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="submitted"/>
June 22, 2005

145

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:enumeration value="executing"/>
<xsd:enumeration value="success"/>
<xsd:enumeration value="error"/>
<xsd:enumeration value="terminating"/>
<xsd:enumeration value="terminated"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="MiningTaskStd">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="buildTask"/>
<xsd:enumeration value="testTask"/>
<xsd:enumeration value="applyTask"/>
<xsd:enumeration value="computeStatisticsTask"/>
<xsd:enumeration value="exportTask"/>
<xsd:enumeration value="importTask"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:complexType name="ConnectionSpec">
<xsd:sequence>
<xsd:element name="userName" type="xsd:string"/>
<xsd:element name="password" type="xsd:string"/>
<xsd:element name="uri" type="xsd:anyURI"/>
<xsd:element name="locale" type="Locale"/>
</xsd:sequence>
</xsd:complexType>

E.4.15. Enumeration extension


<xsd:simpleType name="EnumerationExtension">
<xsd:restriction base="xsd:string">
<xsd:pattern value="ext:\S.*"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="ImportExportFormat">
<xsd:union memberTypes="ImportExportFormatStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="AttributeDataType">
<xsd:union memberTypes="AttributeDataTypeStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="PhysicalAttributeRole">
<xsd:union memberTypes="PhysicalAttributeRoleStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="AttributeType">
<xsd:union memberTypes="AttributeTypeStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="DataPreparationStatus">
<xsd:union memberTypes="DataPreparationStatusStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="CategoryProperty">
<xsd:union memberTypes="CategoryPropertyStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ClassificationTestMetricOption">
<xsd:union memberTypes="ClassificationTestMetricOptionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ClassificationApplyContent">
<xsd:union memberTypes="ClassificationApplyContentStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="RegressionApplyContent">
June 22, 2005

146

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

<xsd:union memberTypes="RegressionApplyContentStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ClusteringApplyContent">
<xsd:union memberTypes="ClusteringApplyContentStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="AggregationFunction">
<xsd:union memberTypes="AggregationFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="AttributeComparisonFunction">
<xsd:union memberTypes="AttributeComparisonFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="KernelFunction">
<xsd:union memberTypes="KernelFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="TreeHomogeneityMetric">
<xsd:union memberTypes="TreeHomogeneityMetricStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="TreeSelectionMethod">
<xsd:union memberTypes="TreeSelectionMethodStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ActivationFunction">
<xsd:union memberTypes="ActivationFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ClusteringDistanceFunction">
<xsd:union memberTypes="ClusteringDistanceFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="LogicalAttributeUsage">
<xsd:union memberTypes="LogicalAttributeUsageStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="OutlierTreatment">
<xsd:union memberTypes="OutlierTreatmentStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="MiningFunction">
<xsd:union memberTypes="MiningFunctionStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="MiningAlgorithm">
<xsd:union memberTypes="MiningAlgorithmStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="MiningTask">
<xsd:union memberTypes="MiningTaskStd
EnumerationExtension"/>
</xsd:simpleType>
<xsd:simpleType name="ObjectContentType">
<xsd:union memberTypes="ObjectContentTypeStd
EnumerationExtension"/>
</xsd:simpleType>

June 22, 2005

147

Maintenance Release

JavaTM Data Mining (JDM)

Version 1.1

Appendix F. References
[Alur2001]

Deepak Alur, John Crupi, and Dan Malks, Core J2EE Patterns: Best Practices
and Design Strategies, Prentice Hall, 2001.

[BL1997]

Michael Berry and Gordon Linoff, Data Mining Techniques : For Marketing,
Sales, and Customer Support, 1997.

[CWM]

http://www.omg.org/technology/cwm

[CWM-DM]

http://cgi.omg.org/docs/ad/01-02-01.pdf

[Java-URI]

Class java.net.URI, http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html

[JSR16]

http://jcp.org/jsr/detail/16.jsp

[JSR40]

http://jcp.org/jsr/detail/40.jsp

[Marinescu2002] Marinescu, F., EJB Design Patterns, Wiley, 2002.


[Gamma1994]

E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Pattersn: Elements of


Reusable Object-Oriented Software, Addison-Wesley, 1994.

[Mitchell1997]

Tom Mitchell, Machine Learning, McGraw-Hill, 1997.

[PMML]

http://www.dmg.org

[Sharma2001]

Rahul Sharma, Beth Stearns, Tony Ng, J2EE Connector Architecture and Enterprise Application Integration, Addison Wesley, 2001.

[SQL/MM-DM] http://www.sql-99.org/SC32/WG4/Progression_Documents/
Informal_working_drafts/wd-datamining-2000-07.pdf
[SUN-Blueprints1]http://java.sun.com/blueprints/guidelines/
designing_enterprise_applications_2e/deployment/deployment4.html
[SUN-Blueprints2]http://java.sun.com/blueprints/guidelines/
designing_enterprise_applications_2e/web-tier/web-tier5.html
[URI]

RFC2396: Uniform Resource Identifiers (URI): Generic Syntax, http://


www.ietf.org/rfc/rfc2396.txt

[URI-SCHEMES] Uniform Resource Identifier (URI) SCHEMES, http://www.iana.org/assignments/uri-schemes


[W3]

http://dev.w3.org/cvsweb/~checkout~/2002/ws/arch/glossary/wsa-glossary.html

[WS-I]

http://www.ws-i.org/

June 22, 2005

148

You might also like