Clementine Lab Handout

CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.
Lab Data Mining by Clementine
Aims
• To be familiar with Clementine programming interface
• To learn about the visualization functions in Clementine
• To predict iris types using Clementine modelling functions
Background
Clementine data mining tool kit was originally developed by the Integral
Solutions Limited. The Company was later merged by SPSS Inc in 1999.
SPSS (Statistical Package for the Social Sciences) is a software package for
comprehensive data mining (not its initial objective) and analytic applications
for enhanced decision making. The strong power of SPSS lays on the
statistical analysis – it contains a series systematic statistic functions, from
descriptive analysis, parametric and nonparametric tests, to nonlinear
regressions.
Clementine is regarded as a supply to SPSS by providing many intelligent

modelling functions (compared to the traditional statistical techniques). C5.0 is
one of such example. Clementine and SPSS run independently. However, for
enhancing Clementine’s speciality and avoiding loosing its generality in
statistic analysis, Clementine not only embeds most of SPSS functions into its
interface but also provides facility to export its process to SPSS.
As a data mining tool, Clementine follows the basic preprocessing-

modelling-postprocessing routine to reveal the information and knowledge
behind the data.
© The Robert Gordon University

School of Computing P1/7
Clementine Programming Interface

Start Clementine
Desktop => start => programs => Clementine Desktop 9.0 => Clementine
Desptop 9.0
The Clementine programming interface is shown in Fig. 1.
Fig. 1 Clementine programming interface
Seek Help
On the Clementine tool bar, clicking Help then HelpTopics will bring you
Clementine user manual. You may need to consult it occasionally during the
lab.
Or you can open individual topic like this: (CTRL+click to follow the link)
C:\Program Files\Clementine Desktop 9.0\Help\i18n\English_US\Clemhelp\clem_intro.htm
Build and Work With Stream
The Clementine programming interface is divided into blocks. The major one
is the biggest, blank space which is the place to place the virtual programs.
Generated models are automatically placed in the upper righthand block. The
nodes, representing operations to be performed on the data, are located in

seven palettes at the bottom of the interface. Nodes can be dragged into the
programming area and connected by links which indicate the direction of data
flow. Then this virtual program, consisting of nodes and links, is called a
stream.
In summary, to build a data stream:
• Add nodes to the stream pane

• Connect the nodes to form a stream
• Specify any node or stream options
• Execute the stream
When a stream is built up, it can be run from any node – this is useful to
debug the program step by step.
Master Clementine Nodes
Nodes are programming functions that represent different objects and actions.
They are grouped into 6 palettes according to their operations.
• Source nodes
Source nodes can be used to import the contents of various flat files and to
connect to data contained within ODBC-compliant relational databases.
Various file formats can be input through source nodes, such as txt, csv, tsv,
etc. The csv format can be produced by many data processing software, e.g.
spreadsheets.
Later in this lab, you will use the variable file node to read in a csv data file.
• Record operation nodes
Data sets are composed of records or cases or instances. These records can
be manipulated by record operation nodes.
We often need to split up a data-set up into subsets to allow the results of

learning to be evaluated, e.g. in hold out and multi-fold cross validation. In
Clementine this can be done using the sample node. Later in this lab, you will
use the sample node to split the data set into training and testing subsets.
Records can be purposely selected (select node) according to their similarity.

When new attributes are obtained for the records in the data set, merge node
allows you to extend the attribute space. If new records are collected to the
data set, append node can easily expand the data set into a large volume.
• Field operation nodes
Fields store values of each attribute. Most data pre-processing tasks are
conducted by field operation nodes. The type node is the most frequently

used node because it allows you to assign a data type, direction, and blank
handling for each field in the data. The last function of the type node is one of
two ways in Clementine to deal with missing values. When a small attribute
space is expected, e.g. after attribute selection processes, the filter node will
shield unexpected attributes or eliminate fields with a high proportion of
missing values. If new attributes are required based on the current attributes
(sales per transaction, for instance), the derive node can create new fields for
this purpose.
In this lab, you will use:

• type node to assign types and define input and output properties to
each field;
• filter node to choose proper attributes to make decision;
• derive node to generate a new field to support decision making.
• Graph nodes
The nodes in this palette play the major role of information visualization in
Clementine. Although the functions of graph nodes themselves are not very
‘colourful’ (compared to some pure visualization software package), they do
have the following desirable features:
(a) a graph node can be used in any stage of data mining (i.e. it may be
connected to any non-output node in the stream) to acquire information on
the intermediate results achieved so far;
(b) a displayed graph can be stored in several formats (.ghp, .bmp, .ps) that
can be used by other software or processes later on;
(c) the data in any stage of processing (stream) can be exported, and if
needed, can be visualized by powerful functions of other software i.e.
Excel, SPSS, or partners of Clementine, e.g. AISoft@re/Visualmine.
• Modelling nodes
Modelling nodes are the heart of the data mining process. They are taken
from machine learning, artificial intelligence, and statistics. The number of
nodes in this palette will increase in line with the development of new data
mining technology. The modelling nodes include Neural Net (a neural
network), C5.0 (the higher version of C4.5) and C&R Tree (a tree-based
classification and prediction model).
Later in this lab, you will use C5.0.
• Output nodes
The nodes in this palette can provide powerful analyses of the results from
further upstream. Output nodes are the terminals of the stream, which
means there are no further nodes downstream of them.
The most frequently used node to display the data at any stage of data mining
is Table node. Analysis node is another popular node to let you actually
access the result and assess the performance of data mining. Output nodes

also provide a mechanism for exporting data in various formats to interface

with other software tools.
Later in this lab, you will use both Table node and Analysis node.

Step by Step Exercises

Tasks
To predict the iris types by C5.0 and neural networks.
Dataset
In this lab, you will use the Iris dataset. This dates back to seminal work by
the eminent statistician R.A.Fisher in the mid-1930s. It contains 150
examples in total. There are 50 examples of each of three types of plants.
The classes are the types of plants. There are four attributes:
1. sepal length (sepallen)
2. sepal width (sepalwid)
3. petal length (petallen)
4. petal width (petalwid)
All four attributes are numeric and are measured in cm.
Downloading the dataset
Create a new folder called ‘clementine_lab’ inside your ‘H:\CMM510’ folder.

Download the data set from http://www.comp.rgu.ac.uk/staff/chb/teach.html.
Save it to your ‘H:\CMM510\clementine_lab’ folder. The data set is stored in
csv format with the first line the field name.
Data Set Input and Display
In Sources palette, click and drag variable file node to the programming
area. Double click the node, a new window appears. Set the correct path of
the data file (iris data) and according to the file format, take following options:
get field names from file, skip header characters 0, delimiter characters ‘,’,
delimiter on new line.
Now in output palette, click and drag the table node to the programming
area. Next create a link between your two nodes as follows. Click and hold
the middle button/scroll button of the mouse on your file node, then drag to
your table node. You should be able to see a link from the file node to the
table node. Double click the table node and choose execute (or right click the
node and hold then choose execute from the dropdown manual), a table
appear displaying the iris data.
Save your stream in the folder you created.
Define Type and Input/Output
In Field palette, choose type node and add it to the programming area. Draw
a link from your file node to the type node. Double clicking type node brings
you the option setting window. Make sure the types are correct for attributes

and iris type (class). Clicking the direction column allows you to change the
input/output mode. Set four attributes as IN and only class as OUT.
Feature Selection
Add a filter node from Field Ops palette after the type node. Determine how
to discard the sepal length attribute. Display the new data set by connecting
your filter node to a new table node.
Training and Testing Data
Add two sample nodes from Record Ops palette, one is set as
include sample and 1-in-2, another is set as discard sample and 1-in-2. The
first one is linked to the filter node, acting as the training branch. The latter is
also linked to the filter node, acting as testing branch.
C5.0 Prediction
Link the C5.0 node to the sample node in the training branch, then double
click the node to execute the stream. A model will be generated and placed in
the model palette. Click and drag the model (now is a node) to the
programming area and link it to the sample node in the testing branch. The
rule set can be viewed by double click the model.
Prediction Analysing
Add an analysis node to the testing branch after the model, double click it to
set the option, then execute. You should be able to see the prediction result.
Exercises
1. Use the nodes in Graphs palette to display the histogram of sepal
length, sepal width, petal length, and petal width.
2. Use the node in Graphs palette to display the scatterplot of some pairs
of attributes. Investigate why we discarded the sepal length attribute by
looking at pairs that include it.


Clementine Lab Handout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clementine Lab Handout

Uploaded by

Copyright:

Available Formats

CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

Lab Data Mining by Clementine

Clementine is regarded as a supply to SPSS by providing many intelligent

As a data mining tool, Clementine follows the basic preprocessing-

© The Robert Gordon University

Clementine Programming Interface

The Clementine programming interface is shown in Fig. 1.

Fig. 1 Clementine programming interface

C:\Program Files\Clementine Desktop 9.0\Help\i18n\English_US\Clemhelp\clem_intro.htm

Build and Work With Stream

© The Robert Gordon University

In summary, to build a data stream:

• Add nodes to the stream pane

Master Clementine Nodes

• Record operation nodes

We often need to split up a data-set up into subsets to allow the results of

Records can be purposely selected (select node) according to their similarity.

• Field operation nodes

© The Robert Gordon University

In this lab, you will use:

Later in this lab, you will use C5.0.

© The Robert Gordon University

also provide a mechanism for exporting data in various formats to interface

© The Robert Gordon University

Step by Step Exercises

To predict the iris types by C5.0 and neural networks.

Downloading the dataset

Create a new folder called ‘clementine_lab’ inside your ‘H:\CMM510’ folder.

Data Set Input and Display

Save your stream in the folder you created.

Define Type and Input/Output

© The Robert Gordon University

Training and Testing Data

© The Robert Gordon University

You might also like