You are on page 1of 6

Knowledge-Based

C;YSTEMY
ELSEVIER Knowledge-Based Systems 9 (1996) 67-72

Letter
Dimensionality reduction via discretization
Huan Liu, Rudy Setiono
Department of Information Systems and Computer Science, National University of Singapore, Singapore 01 I, Singapore

Received 9 May 1995; revised 22 August 1995; accepted 25 August 1995

Abstract

The existence of numeric data and large numbers of records in a database present a challenging task in terms of explicit concepts
extraction from the raw data. The paper introduces a method that reduces data vertically and horizontally, keeps the discriminating
power of the original data, and paves the way for extracting concepts. The method is based on discretization (vertical reduction) and
feature selection (horizontal reduction). The experimental results show that (a) the data can be effectively reduced by the proposed
method; (b) the predictive accuracy of a classifier (C4.5) can be improved after data and dimensionality reduction; and (c) the
classification rules learned are simpler.

Keywords. Dimensionality reduction: Discretization; Knowledge discovery

1. Introduction amount of data. Hence, the original database, if viewed


as a large table, is shortened in terms of its vertical
The wide use of computers brings about the prolifera- dimension. Feature (attribute) selection is a process in
tion of databases. Without the aid of computers, little of which relevant attributes are chosen from many for a
this raw data will ever be seen and exploited by humans. certain problem [6-91. The selection is accomplished by
Knowledge discovery systems in databases [l] are retaining those attributes having more than one discrete
designed to analyze the data, find regularities in the value. Other attributes can be removed. Both discretiza-
data (knowledge) and present it to humans in under- tion and feature selection maintain the discriminating
standable formats. power of the processed data.
One of the goals of knowledge discovery in databases A data and dimensionality reduction (DDR) system is
is to extract explicit concepts from the raw data [l-5]. built according to the vertical and horizontal reduction
The existence of numeric data and large numbers 01 (VHR) method. The experimental results show that (a)
records in a database present a challenging task in the data can be effectively reduced by the VHR method;
terms of reaching this goal, due to the huge data space (b) the predictive accuracy of a classifier ((X.5 [lo]) can
determined by the numeric attributes. This paper intro- be improved after data and dimensionality reduction;
duces a method that reduces numeric data vertically and and (c) the classification rules learned are simpler. In
horizontally, keeps the discriminating power of the other words, the VHR method reduces the size of the
original data, and paves the way for extracting concepts. database, limits the possible search space for a classifier,
The method is based on discretization (vertical reduc- and produces simpler learned concepts.
tion) and feature selection (horizontal reduction). The
x2 distribution is employed to continue the discretization
of the numeric attributes until the original discriminating 2. DDR system for continuous attributes
power of the data cannot be maintained. This step sig-
nificantly reduces the possible data space from a con- The key algorithm of the proposed DDR system is the
tinuum to discreteness according to the characteristics VHR method (referred to hereafter as VHR), which is
of the data by merging attribute values. In addition, summarized below. VHR uses the x2 distribution. The
after discretization, duplicates may occur in the data. idea is to check the correlation between an attribute and
Removing these duplicates amounts to reducing the the class values, based on which VHR tries to merge the

0950-7051/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved


SSDI 0950-7051(95)01030-O
68 H. Liu, R. SerionojKnowledge-Based Systems 9 (1996) 67-72

ordered values of that attribute as much as is allowed by For the first aspect, it is sufficient to show that the
the x2 distribution for a given significance level. VHR absolute number of inconsistencies does not increase
begins with some significance level, e.g. 0.5, for all the after the reduction. VHR guarantees this property
numeric attributes of discretization. Each attribute i is since the number of inconsistencies is the stopping criter-
associated with a sigLevel[i], and they are merged in ion for VHR. Given a dataset, it is not difficult’ to com-
turn. Each attribute is sorted according to its values. pute the number of inconsistencies in the set. For the
Then the following is performed: (a) calculate the x2 second aspect, however, a pattern classifier is needed in
value for every pair of adjacent intervals (at the begin- the experiments. C4.5 [lo] is chosen because (a) it can
ning, each pattern is put into its own interval); and (b) handle both numeric and nominal data; and (b) it is well
merge the pair of adjacent intervals with the lowest x2 known, widely available, and works quite well in many
value. Merging continues until all the pairs of intervals domains. Therefore there is no need to explain it in
have x2 values exceeding the parameter determined by detail. The output of C4.5 is a decision tree. Whether a
the sigLeve1 (initially, the x2 value of 0.5 is 0.455 if the learned concept is simple or not can be linked to the size
degree of freedom is 1). The above process is repeated of a tree. In other words, if tree A is larger than tree B,
with a decremental sigLevel[i] until a given inconsistency then tree B is simpler.
rate 6 is exceeded in the discretized data. Consistency The experimental procedure for each dataset is as
checking is conducted after each attribute’s merging. If follows:
no inconsistency is found, sigLevel[i] is decremented for
(1) Apply VHR to reduce the data.
attribute i’s next round of merging; otherwise, attribute i
(2) Run C4.5 on both the original and the reduced data.
will not be involved in further merging. This process is
(3) Obtain results on the predictive accuracy and tree
continued until no attribute’s values can be merged. At
size.
the end, if an attribute is merged to only one value, it
simply means that this attribute is not relevant in repre- A DDR algorithm should do more than reduce the data;
senting the original dataset. As a result, when discretiza- an effective DDR algorithm can improve a pattern
tion ends, feature selection is also accomplished. classifier’s accuracy, and simplify the learned concepts
The VHR algorithm is as follows. as well as reduce the data. We want to show that VHR
possesses these features.
VHR algorithm.
3.1. Datasets
;et all sigLevel[i] = 0.5 for attribute i;
io until no-attribute-can-be-merged { The three datasets considered are the University of
for each mergeable attribute i { California at Irvine iris set, Wisconsin breast cancer
Sort(attribute i, data); set, and heart disease set2. They have different types of
chi-sq-initialization(attribute i, data); attributes. The iris data consists of continuous attributes,
do { the breast cancer data consists of ordinal discrete attrib-
chi-sq-calculation(attribute i, data); utes, and the heart disease data contains mixed attributes
} while (Merge(data) is TRUE) (numeric and discrete). The three datasets are described
if (Inconsistency(data) < S) briefly below.
sigLevel[i] = decreSigLevel(sigLevel[i]); The iris dataset contains 50 patterns each of the classes
else Iris setosa, Iris versicolor, and Iris virginica. Each pattern
attribute i is not mergeable; is described using four numeric attributes: sepal-length,
sepal-width, petal-length, and petal-width. The originally
odd-numbered data are selected for training, and the rest
The formula for computing the x2 value can be found for testing.
in any standard statistics books. The breast cancer dataset contains 699 samples of
breast fine-needle aspirates coll&ted at the University
of Wisconsin Hospital, USA. There are nine discrete
3. Experiments attributes valued on a scale of 1 to 10. The class value
is either ‘benign’ or ‘malignant’. The dataset is split ran-
In order to measure how data and dimensionality domly into two sets: 350 patterns for training and 349 for
reduction is achieved, we need to consider several testing.
aspects. First, the dimensionally reduced data should The heart disease dataset contains data on medical
still have the same discriminating power as the original;
second, the reduced data should have gains for a pattern ’The time required is O(n2).
classifier in terms of predictive accuracy as well as the 2 These can all be obtained from the University of California at Irvine,
simplicity of the learned concepts. USA, machine learning repository via anonymous ftp to ics.uci.edu.
H. Liu. R. SetionojKnowledqe-Based Systems Y i 1996) 67-72 69

Table I Table 3
Initial intervals, class frequencies, and 2’ values for sepal-length Intervals, class frequencies and k’ values for sepal-kngfh at final stage

Interval Class frequency Interval Class frequency i1

4.4 3 0 0 0.20 4.4 25 25 25


_____
4.6 1 0 0 0.20
4.1 I 0 0 0.20 k2 threshold: 50.6.
4.8 i 0 0 1.97
4.9 I 0 I 2.62
5.0 1 I 0 0.10 Table 4
5.1 ‘1 1 0 0.70 Intervals, class frequencies and 1’ values for vepul-aYrlth at Intermediate
5.2 1 0 0 0.20 stage
5.3 I 0 0 0.41
5.4 t 1 0 1.32 Interval Class frequency \:
5.5 I 2 0 1.66
5.6 I) 4 0 2.50 2.0 0 4 0 4.90
5.7 I 1 0 1.28 2.5 0 8 12 8.67
5.8 I 2 2 1.20 2.9 I 5 0 5.80
5.9 I) 1 0 0.54 3.0 6 8 II 6.14
6.0 I) 2 1 1.43 3.4 5 0 2 4.23
6.1 0 0 I 0.54 3.5 13 0 0
6.2 I) 1 2 0.14
6.3 0 2 3 0.14 k’ threshold: 3.22.
6.4 0 1 2 0.16
6.5 !) I 3 I .97
6.6 I) I 0 2.50
6.7 0 1 4 0.73 Table 5
6.8 0 I 1 0.10 Intervals, class frequencies and x’ values for sqml-widthat final stage
6.9 0 I I 0.85
7.0 0 1 0 2.10 Interval Class frequency k?
7.1 0 0 1 0.20
7.4 0 0 1 0.20 2.0 25 25 25
1.1 0 0 2
x’ threshold: 40.6.

cases of heart disease. It contains numerically valued


four attributes have the same minimum significance level
features; there are eight nominally valued attributes
(sigLeve1 = 0.2, x2 = 3.22) keeping the number of
and five numerically valued attributes. The two class
inconsistencies under the threshold 6(3 = 75 * 5%). The
values are ‘healthy heart’ and ‘diseased heart’. We
final stage is where no further attribute value merging is
removed patterns with missing attribute values, and
possible without sacrificing discriminating power.
used 299 patterns, in which one-third were randomly
Table 1 shows the intervals, class frequencies, and x2
chosen for testing, and the rest for training.
values for the sepal-length attribute after the data
initialization by VHR.
3.2. Detailed example The results for the four attributes at the two stages are
shown in Tables 2-9. With the xz threshold 3.22,
The two stages (intermediate and final) of VHR pro- for example, five discrete values are needed for the
cessing for the iris dataset are described to demonstrate sepal-length attribute: < 4.9 -+ 1, . . . < 6.1 -+ 4, and
the behavior of VHR. The intermediate stage is where all 26.1 --t 5. The last one means that, if a numeric value
is greater than or equal to 6.1, it is quantized to 5. When
Table 2
Intervals. class frequencies and 2” values for sepal-length at intermedi- Table 6
ate stage Intervals. class frequencies and k’ values for pcrtrl-kwgrh at intermedi-
ate stage
Interval Class frequency x?
Interval Class frequency x1
4.4 9 0 0 5.05
4.9 I 0 1 8.11 1.o 25 0 0 47.00
5.0 12 3 0 13.64 3.0 0 21 I 4.18
5.5 3 12 3 14.23 4.8 0 4 2 17.21
6.1 0 10 21 5.0 0 0 22

x2 threshold: 3.22 x’ threshold: 3.22.


70 H. Liu. R. SeiionojKnowledge-Based Systems 9 (1996) 67-72

Table 10
Table 7
Accuracy before and after using VHR
Intervals, class frequencies and X2 values for petal-length at final stage

Accuracy, %
Interval Class frequency x2
Before After
1.0 25 0 0 53.00
3.0 0 25 3 39.39
Iris data 94.7 94.7
5.0 0 0 22
Breast cancer dataset 92.6 94.6
Heart disease dataset 12.7 78.8
X2 threshold: 10.6.

Table 8 Table 11
Intervals, class frequencies and X2 values forpefal-width at intermediate Tree size before and after using VHR
stage
Tree size
Interval Class frequency X2
Before After
0.1 25 0 0 38.10
1.0 0 13 0 4.72 Iris data 5 5
1.4 0 2 1 3.37 Breast cancer dataset 21 11
1.5 0 9 0 11.35 Heart disease dataset 43 22
1.7 0 1 5 3.40
1.9 0 0 19

x2 threshold: 3.22. Table 12


Dataset size before and after using VHR

VHR terminates, the values of both the sepal-length and Dataset size

sepal-width attributes are merged into one value, so they


Before After
can be removed; the petal-length and petal-width attrib-
utes are discretized into three discrete values each. Iris data 75 6
Tables 2-9 summarize each attribute’s intervals, class Breast cancer dataset 350 75
frequencies, and x2 values. All the thresholds mentioned Heart disease dataset 198 173

are automatically determined by VHR. Tables 2-9 are


generated using the 75 training data patterns. If all the
150 patterns were used, the tables could be different. using VHR. The number of distinguishable patterns
Here we assume that the testing data is not available at decreases, and so does the number of attributes. The
the stage of dimensionality reduction. details are as follows:
?? Accuracy.. It was explained earlier that VHR preserves
3.3. Results
the discriminating power of the original data. How-
The results are summarized below, where ‘before’ or
ever, this preservation is not useful if a classifier cannot
‘after’ means that the result was obtained before or after
learn well from the dimensionally reduced data, that is,
using VHR. The data reduced by VHR preserves the
it is only useful if the use of VHR gives rise to a better
discriminating power of the original data. This is
or similar performance. For the three sets of data, the
measured by the number of inconsistencies before and
results with VHR are at least as good as those without
after VHR processing. The characteristics of the data
using VHR regardless of other considerations (to be
remain as well, since the accuracy of C4.5 for the data
discussed subsequently) (see Table 10).
processed (after) by VHR is at least as good as that for
the original data (before). The tree size is smaller after
Table 13
Table 9 Number of attributes before and after using VHR
Intervals, class frequencies and X2 values for petal-width at final stage
Number of attributes
Interval Class frequency X2
Before After
0.1 25 0 0 50.00
1.0 0 24 1 42.42 Iris data 4 2
1.7 0 1 24 Breast cancer dataset 13 10
Heart disease dataset 9 6
X2 threshold: 10.6.
H. Liu, R. SetionojKnowledge-Based Systems 9 c 19961 67-72 11

Tree size: An immediate benefit of applying a DDR the database becomes smaller while keeping the same
system is that the learned concept can be simpler. It discriminating power. The horizontal dimensionality
creates a smaller tree for a decision tree approach such reduction is achieved by feature selection that eliminates
as C4.5. For the datasets chosen, the tree size can be those discretized attributes having only one possible
reduced by as much as half of the original size (see value.
Table 11). The advantages of having dimensionally reduced data
Dataset size: This is defined by the number of items are fourfold: (a) it narrows down the search space deter-
(records) in the training data (a database). After VHR mined by the attributes; (b) it allows faster learning for a
processing, the number of nonduplicate items is classifier. (c) it helps a classifier produce simpler learned
reduced (see Table 12). As with the Iris data, only six concepts; and (d) it improves predictive accuracy. How-
distinct items remain with one inconsistency. For such ever, it has its limitations. As of now. we do not see any
cases, even an exhaustive search method can be straightforward way to extend the method to handle
employed to produce high quality classification rules higher order correlations in data regardless of the com-
without resorting to monothetic methods such as C4.5. putational cost of the permutation of multiple attributes.
Two sets of rules are presented here to illustrate the Since the possibility of having high order correlated data
point. Ruleset A is produced by C4.5, and ruleset B is cannot be ruled out, further work should be done in this
produced by a rule generator that induces rules from a direction. Another possible extension is to data with
small dataset heuristically [ 111. The accuracies of the mixed nominal and ordinal attributes. Since the nominal
two rulesets for the training data are 97% and 99%. attributes are masked out in VHR. the inconsistency
and for the testing data they are 94% and 97%. checking of VHR can be done with or without the
respectively. masked attributes, which leads to either under- or over-
discretization. Underdiscretization is caused by the pos-
Ruleset A: sibility that some masked attributes could be irrelevant.
Overdiscretization is due to the fact that masked attrib-
petal-length < 1.9 --+ 1
utes do contribute to discriminating one record from
petal-length > 1.9 &
another. More study is needed. Another line of
petal-width < 1.6 -+ 2
research is to investigate the relationship between the
petal-width > 1.6 + 3
discriminating power of a database and its real distri-
default - 1
bution. In the present work, we have used an indirect
Ruleset B. measure: predictive accuracy. That is, a high accuracy
means the dimensionally reduced data keeps the origi-
petal-length < 3.0 -+ 1 nal distribution. The VHR method has been success-
petal-length < 5.0 & fully applied to many problems. With these extensions,
petal-width < 1.7 + 2 the VHR method can be more flexible and more gen-
default -+ 3 erally applicable.
Number
?? of attributes: One of the most important
advantages of VHR is that it can reduce the number
of attributes (see Table 13). Only relevant attributes are References
chosen and irrelevant ones are deleted. This will be a
great help in reducing work and minimizing resource [I] W. Frawley, G. Piatetsky-Shapiro and C. Matheus, Knowledge
discovery in databases: an overview, AZ Magazine (Fall 1992).
use in future data collection and classification. It also [2] IEEE Transactions on Knowledge and Data Engineering, 5(6)
helps human experts and data analysts to focus on the (1993) (special issue on learning and discovery in databases).
important dimensions. [3] J. Han. Y. Cai and H. Cercone, Knowledge discovery in data-
bases: an attribute oriented approach, in Pro<. VLDB Conf:
1992, pp. 547-559.
4. Conclusions [4] C.J. Matheus, P.K. Chan and G. Piatesky-Shapiro, Systems for
knowledge discovery in databases, IEEE Transactions on Knowl-
edge and Data Engineering, 5(6) (1993).
We have introduced a DDR system based on the VHR
[5] International Journal of Intelligent Systems, 7(7) (1992) (special
algorithm. The key idea is to apply techniques of dis- issue on knowledge discovery in databases).
cretization and feature selection to data and dimension- [6] H. Almuallim and T.G. Dietterich, Learning boolean concepts in
ality reduction in the context of numeric attributes. the presence of many irrelevant features. Art$ciai Intelligence, 69
Discretization merges the values of each attribute, and (1994) 279-305.
[7] U.M. Fayyad and K.B. Irani, The attribute selection problem
it thus significantly decreases the number of values a
in decision tree generation, in Proc. AAAI-92: Ninth National
continuous attribute can take, and reduces the data in Conf: Artificial Intelligence, MIT Press, USA. 1992, pp. 1044
a vertical dimension. Normally this process will generate 110.
some duplicates in the data; by removing the duplicates, [8] H. Ragavan and L. Rendell, Lookahead feature construction
72 H. Liu, R. SetionojKnowledge-Based Systems 9 (1996) 67-72

for learning hard concepts, in Proc. Seventh Int. Conf. Muchine [IO] J.R. Quinlan, C4.5: Programs jbr Machine Learning, Morgan
Learning Morgan Kaufmann, USA, 1993, pp. 252-259. Kaufmann, 1993.
[9] N. Wyse, R. Dubes and A.K. Jain, A critical evaluation of intrinsic [I I] H. Liu and ST. Tan, X2r: a fast rule generator, in Proc. IEEE Int.
dimensionality algorithms, in E.S. Gelsema and Kanal L.N. (eds.) Conf. Systems, Man and Cybernetics, IEEE, 1995.
Pattern Recognition in Practice, Morgan Kaufmann, USA, 1980,
415-425.

You might also like