Professional Documents
Culture Documents
Aims
• To be familiar with Clementine programming interface
• To learn about the visualization functions in Clementine
• To predict iris types using Clementine modelling functions
Background
Clementine data mining tool kit was originally developed by the Integral
Solutions Limited. The Company was later merged by SPSS Inc in 1999.
SPSS (Statistical Package for the Social Sciences) is a software package for
comprehensive data mining (not its initial objective) and analytic applications
for enhanced decision making. The strong power of SPSS lays on the
statistical analysis – it contains a series systematic statistic functions, from
descriptive analysis, parametric and nonparametric tests, to nonlinear
regressions.
Desktop => start => programs => Clementine Desktop 9.0 => Clementine
Desptop 9.0
Seek Help
On the Clementine tool bar, clicking Help then HelpTopics will bring you
Clementine user manual. You may need to consult it occasionally during the
lab.
Or you can open individual topic like this: (CTRL+click to follow the link)
The Clementine programming interface is divided into blocks. The major one
is the biggest, blank space which is the place to place the virtual programs.
Generated models are automatically placed in the upper righthand block. The
nodes, representing operations to be performed on the data, are located in
seven palettes at the bottom of the interface. Nodes can be dragged into the
programming area and connected by links which indicate the direction of data
flow. Then this virtual program, consisting of nodes and links, is called a
stream.
When a stream is built up, it can be run from any node – this is useful to
debug the program step by step.
Nodes are programming functions that represent different objects and actions.
They are grouped into 6 palettes according to their operations.
• Source nodes
Source nodes can be used to import the contents of various flat files and to
connect to data contained within ODBC-compliant relational databases.
Various file formats can be input through source nodes, such as txt, csv, tsv,
etc. The csv format can be produced by many data processing software, e.g.
spreadsheets.
Later in this lab, you will use the variable file node to read in a csv data file.
Data sets are composed of records or cases or instances. These records can
be manipulated by record operation nodes.
Fields store values of each attribute. Most data pre-processing tasks are
conducted by field operation nodes. The type node is the most frequently
used node because it allows you to assign a data type, direction, and blank
handling for each field in the data. The last function of the type node is one of
two ways in Clementine to deal with missing values. When a small attribute
space is expected, e.g. after attribute selection processes, the filter node will
shield unexpected attributes or eliminate fields with a high proportion of
missing values. If new attributes are required based on the current attributes
(sales per transaction, for instance), the derive node can create new fields for
this purpose.
• Graph nodes
The nodes in this palette play the major role of information visualization in
Clementine. Although the functions of graph nodes themselves are not very
‘colourful’ (compared to some pure visualization software package), they do
have the following desirable features:
(a) a graph node can be used in any stage of data mining (i.e. it may be
connected to any non-output node in the stream) to acquire information on
the intermediate results achieved so far;
(b) a displayed graph can be stored in several formats (.ghp, .bmp, .ps) that
can be used by other software or processes later on;
(c) the data in any stage of processing (stream) can be exported, and if
needed, can be visualized by powerful functions of other software i.e.
Excel, SPSS, or partners of Clementine, e.g. AISoft@re/Visualmine.
• Modelling nodes
Modelling nodes are the heart of the data mining process. They are taken
from machine learning, artificial intelligence, and statistics. The number of
nodes in this palette will increase in line with the development of new data
mining technology. The modelling nodes include Neural Net (a neural
network), C5.0 (the higher version of C4.5) and C&R Tree (a tree-based
classification and prediction model).
• Output nodes
The nodes in this palette can provide powerful analyses of the results from
further upstream. Output nodes are the terminals of the stream, which
means there are no further nodes downstream of them.
The most frequently used node to display the data at any stage of data mining
is Table node. Analysis node is another popular node to let you actually
access the result and assess the performance of data mining. Output nodes
Later in this lab, you will use both Table node and Analysis node.
Dataset
In this lab, you will use the Iris dataset. This dates back to seminal work by
the eminent statistician R.A.Fisher in the mid-1930s. It contains 150
examples in total. There are 50 examples of each of three types of plants.
The classes are the types of plants. There are four attributes:
1. sepal length (sepallen)
2. sepal width (sepalwid)
3. petal length (petallen)
4. petal width (petalwid)
All four attributes are numeric and are measured in cm.
In Sources palette, click and drag variable file node to the programming
area. Double click the node, a new window appears. Set the correct path of
the data file (iris data) and according to the file format, take following options:
get field names from file, skip header characters 0, delimiter characters ‘,’,
delimiter on new line.
Now in output palette, click and drag the table node to the programming
area. Next create a link between your two nodes as follows. Click and hold
the middle button/scroll button of the mouse on your file node, then drag to
your table node. You should be able to see a link from the file node to the
table node. Double click the table node and choose execute (or right click the
node and hold then choose execute from the dropdown manual), a table
appear displaying the iris data.
In Field palette, choose type node and add it to the programming area. Draw
a link from your file node to the type node. Double clicking type node brings
you the option setting window. Make sure the types are correct for attributes
and iris type (class). Clicking the direction column allows you to change the
input/output mode. Set four attributes as IN and only class as OUT.
Feature Selection
Add a filter node from Field Ops palette after the type node. Determine how
to discard the sepal length attribute. Display the new data set by connecting
your filter node to a new table node.
Add two sample nodes from Record Ops palette, one is set as
include sample and 1-in-2, another is set as discard sample and 1-in-2. The
first one is linked to the filter node, acting as the training branch. The latter is
also linked to the filter node, acting as testing branch.
C5.0 Prediction
Link the C5.0 node to the sample node in the training branch, then double
click the node to execute the stream. A model will be generated and placed in
the model palette. Click and drag the model (now is a node) to the
programming area and link it to the sample node in the testing branch. The
rule set can be viewed by double click the model.
Prediction Analysing
Add an analysis node to the testing branch after the model, double click it to
set the option, then execute. You should be able to see the prediction result.
Exercises
1. Use the nodes in Graphs palette to display the histogram of sepal
length, sepal width, petal length, and petal width.
2. Use the node in Graphs palette to display the scatterplot of some pairs
of attributes. Investigate why we discarded the sepal length attribute by
looking at pairs that include it.