Professional Documents
Culture Documents
Introduction
In predictive modeling and data mining one is often confronted with a large number of inputs
(explanatory variables). The number of potential inputs to choose from may be as large as 2000
or higher. Some of these inputs may not have any relation to the target. An initial screening is
therefore necessary to eliminate irrelevant variables to keep the number of inputs to a manageable
size. The Variable Selection node of SAS Enterprise Miner provides alternative methods for
eliminating irrelevant variables and selecting variables which have predictive power. In the
process of variable selection, the Variable Selection nodes creates binned variables from
interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned input is
more strongly correlated with the target variable than the original input, indicating a non-linear
relationship between the input and the target. The grouped variables are created by collapsing or
grouping the categories of a nominal inputs. With fewer categories, the grouped variables are
easier to use in modeling than the original ungrouped variables.
The predictive power of the inputs can sometimes be enhanced by making suitable
transformations. One can use the Transform Variables node to select the best mathematical
transformation for any given input, based on such criterion as maximizing normality or
maximizing correlation with the target. The Transform Variables node can also be used for
optimally binning the interval inputs and creating dummy variables from categorical inputs.
Variable selection and transformation is also done by the Decision Tree node. The inputs that
give significant splits in creating a decision tree are selected by the Decision Tree node and
passed to the next node which may be Regression or Neural Networks node. In addition to
variable selection, the Decision Tree node creates a special categorical variable which indicates
the leaf node to which a given record is assigned.
This paper discusses the details of the variable selection methods, transformations and the options
available in these three nodes.
NESUG 2007
In the R-Square method, variable selection is performed in two steps. In the first step R-Square
between the input and the target is calculated. All variables with a correlation above a specified
threshold are selected in the first step. Those variables which are selected in the first step enter
the second step of variable selection.
Step 1:
NESUG 2007
NESUG 2007
This transformation is available for binary targets only. The input is split into a
number of bins, and the splits are placed so as to make the distribution of the
target levels (for example, response and non-response) in each bin significantly
different from the distribution in the other bins.
Best Power Transformations
The Transform Variables node selects the best power transformations from among
X , log( X ), sqrt ( X ), e X , X 1/ 4 , X 2 , and X 4 , where X is the input. There are four
criteria of best available:
Maximum Normal: To find the transformation that maximizes normality, sample
quantiles from each of the transformations listed above are compared with the
theoretical quantiles of a normal distribution. The transformation that yields quantiles
that are closest to the normal distribution is chosen.
Suppose Y is obtained by applying one of the above transformations to X . For
example, the 0.75-sample quantile of the transformed variable Y is that value of Y at or
below which 75% of the observations in the data set fall. The 0.75-quantile for a
standard normal distribution is 0.6745 given by P ( Z 0.6745) = 0.75 , where Z is a
normal random variable with mean 0 and standard deviation 1. The 0.75-sample
quantile for Y is compared with 0.6745, and similarly the other quantiles are compared
with the corresponding quantiles of the standard normal distribution.
Maximum Correlation: This is available only for continuous targets. The
transformation that yields the highest linear correlation with the target is chosen.
Equalize Spread with Target Levels: This method requires a class target. The method
first calculates variance of a given transformed variable within each target class. Then
for each transformation it calculates the variances of these variances. It chooses the
transformation that yields the smallest variance of the variances.
Optimal Maximum Equalize Spread with Target Level: This method requires a
class target. It chooses the method that equalizes spread with the target.
Transformations of Class Inputs
For class inputs, two types of transformations are available.
Group Rare Levels transformation:
This transformation combines the rare levels into a separate group, _OTHER_. To define
a rare level, you define a cutoff value.
Dummy Indicators Transformation:
To choose one of these available transformations, select the Transform Variables node
and set the value of the Class Inputs property to the desired transformation.
NESUG 2007
NESUG 2007
Display 3
Display 4 shows the property settings of the Decision Tree node for variable selection
and variable transformation.
NESUG 2007
In order to use the Decision Tree node for variable selection and transformation, you
should specify the Variable Selection property to YES, Leaf Variable property to YES
and Leaf Role property to Input, as shown in Display 4. For a detailed discussion of the
Decision Tree node see Predictive Modeling with SAS Enterprise Miner by the
author of this paper.
NESUG 2007
In order that the data is available for the project, one has to first create a data source.
Creation of a data source is illustrated step-by-step in the book Predictive Modeling with
SAS Enterprise Miner. From the property panel shown in Display 5, it can be seen
that the name of the data set is NESUG2007 and it is in the library assigned to T1.
NESUG 2007
From the property panel shown in Display 6, it can be seen that 40% of the records are
allocated for training, 30% for validation and 30% for test and the data is split by the
default method. For binary targets the default method is stratified sampling.
Display 7 shows the properties panel for Variable Selection node.
NESUG 2007
10
NESUG 2007
The transformation chosen for Interval inputs in Display 8 is Maximum Normal for
interval inputs and Dummy Indicators for class inputs. These are the default methods.
However, one can open the Variables window of the Transform Variables node and
specify different transformations for different inputs.
Display 9 shows the transformations available for interval inputs in Enterprise Miner, and
Display 10 shows the transformations available for class inputs.
11
NESUG 2007
Reference
12