Professional Documents
Culture Documents
C;YSTEMY
ELSEVIER Knowledge-Based Systems 9 (1996) 67-72
Letter
Dimensionality reduction via discretization
Huan Liu, Rudy Setiono
Department of Information Systems and Computer Science, National University of Singapore, Singapore 01 I, Singapore
Abstract
The existence of numeric data and large numbers of records in a database present a challenging task in terms of explicit concepts
extraction from the raw data. The paper introduces a method that reduces data vertically and horizontally, keeps the discriminating
power of the original data, and paves the way for extracting concepts. The method is based on discretization (vertical reduction) and
feature selection (horizontal reduction). The experimental results show that (a) the data can be effectively reduced by the proposed
method; (b) the predictive accuracy of a classifier (C4.5) can be improved after data and dimensionality reduction; and (c) the
classification rules learned are simpler.
ordered values of that attribute as much as is allowed by For the first aspect, it is sufficient to show that the
the x2 distribution for a given significance level. VHR absolute number of inconsistencies does not increase
begins with some significance level, e.g. 0.5, for all the after the reduction. VHR guarantees this property
numeric attributes of discretization. Each attribute i is since the number of inconsistencies is the stopping criter-
associated with a sigLevel[i], and they are merged in ion for VHR. Given a dataset, it is not difficult’ to com-
turn. Each attribute is sorted according to its values. pute the number of inconsistencies in the set. For the
Then the following is performed: (a) calculate the x2 second aspect, however, a pattern classifier is needed in
value for every pair of adjacent intervals (at the begin- the experiments. C4.5 [lo] is chosen because (a) it can
ning, each pattern is put into its own interval); and (b) handle both numeric and nominal data; and (b) it is well
merge the pair of adjacent intervals with the lowest x2 known, widely available, and works quite well in many
value. Merging continues until all the pairs of intervals domains. Therefore there is no need to explain it in
have x2 values exceeding the parameter determined by detail. The output of C4.5 is a decision tree. Whether a
the sigLeve1 (initially, the x2 value of 0.5 is 0.455 if the learned concept is simple or not can be linked to the size
degree of freedom is 1). The above process is repeated of a tree. In other words, if tree A is larger than tree B,
with a decremental sigLevel[i] until a given inconsistency then tree B is simpler.
rate 6 is exceeded in the discretized data. Consistency The experimental procedure for each dataset is as
checking is conducted after each attribute’s merging. If follows:
no inconsistency is found, sigLevel[i] is decremented for
(1) Apply VHR to reduce the data.
attribute i’s next round of merging; otherwise, attribute i
(2) Run C4.5 on both the original and the reduced data.
will not be involved in further merging. This process is
(3) Obtain results on the predictive accuracy and tree
continued until no attribute’s values can be merged. At
size.
the end, if an attribute is merged to only one value, it
simply means that this attribute is not relevant in repre- A DDR algorithm should do more than reduce the data;
senting the original dataset. As a result, when discretiza- an effective DDR algorithm can improve a pattern
tion ends, feature selection is also accomplished. classifier’s accuracy, and simplify the learned concepts
The VHR algorithm is as follows. as well as reduce the data. We want to show that VHR
possesses these features.
VHR algorithm.
3.1. Datasets
;et all sigLevel[i] = 0.5 for attribute i;
io until no-attribute-can-be-merged { The three datasets considered are the University of
for each mergeable attribute i { California at Irvine iris set, Wisconsin breast cancer
Sort(attribute i, data); set, and heart disease set2. They have different types of
chi-sq-initialization(attribute i, data); attributes. The iris data consists of continuous attributes,
do { the breast cancer data consists of ordinal discrete attrib-
chi-sq-calculation(attribute i, data); utes, and the heart disease data contains mixed attributes
} while (Merge(data) is TRUE) (numeric and discrete). The three datasets are described
if (Inconsistency(data) < S) briefly below.
sigLevel[i] = decreSigLevel(sigLevel[i]); The iris dataset contains 50 patterns each of the classes
else Iris setosa, Iris versicolor, and Iris virginica. Each pattern
attribute i is not mergeable; is described using four numeric attributes: sepal-length,
sepal-width, petal-length, and petal-width. The originally
odd-numbered data are selected for training, and the rest
The formula for computing the x2 value can be found for testing.
in any standard statistics books. The breast cancer dataset contains 699 samples of
breast fine-needle aspirates coll&ted at the University
of Wisconsin Hospital, USA. There are nine discrete
3. Experiments attributes valued on a scale of 1 to 10. The class value
is either ‘benign’ or ‘malignant’. The dataset is split ran-
In order to measure how data and dimensionality domly into two sets: 350 patterns for training and 349 for
reduction is achieved, we need to consider several testing.
aspects. First, the dimensionally reduced data should The heart disease dataset contains data on medical
still have the same discriminating power as the original;
second, the reduced data should have gains for a pattern ’The time required is O(n2).
classifier in terms of predictive accuracy as well as the 2 These can all be obtained from the University of California at Irvine,
simplicity of the learned concepts. USA, machine learning repository via anonymous ftp to ics.uci.edu.
H. Liu. R. SetionojKnowledqe-Based Systems Y i 1996) 67-72 69
Table I Table 3
Initial intervals, class frequencies, and 2’ values for sepal-length Intervals, class frequencies and k’ values for sepal-kngfh at final stage
Table 10
Table 7
Accuracy before and after using VHR
Intervals, class frequencies and X2 values for petal-length at final stage
Accuracy, %
Interval Class frequency x2
Before After
1.0 25 0 0 53.00
3.0 0 25 3 39.39
Iris data 94.7 94.7
5.0 0 0 22
Breast cancer dataset 92.6 94.6
Heart disease dataset 12.7 78.8
X2 threshold: 10.6.
Table 8 Table 11
Intervals, class frequencies and X2 values forpefal-width at intermediate Tree size before and after using VHR
stage
Tree size
Interval Class frequency X2
Before After
0.1 25 0 0 38.10
1.0 0 13 0 4.72 Iris data 5 5
1.4 0 2 1 3.37 Breast cancer dataset 21 11
1.5 0 9 0 11.35 Heart disease dataset 43 22
1.7 0 1 5 3.40
1.9 0 0 19
VHR terminates, the values of both the sepal-length and Dataset size
Tree size: An immediate benefit of applying a DDR the database becomes smaller while keeping the same
system is that the learned concept can be simpler. It discriminating power. The horizontal dimensionality
creates a smaller tree for a decision tree approach such reduction is achieved by feature selection that eliminates
as C4.5. For the datasets chosen, the tree size can be those discretized attributes having only one possible
reduced by as much as half of the original size (see value.
Table 11). The advantages of having dimensionally reduced data
Dataset size: This is defined by the number of items are fourfold: (a) it narrows down the search space deter-
(records) in the training data (a database). After VHR mined by the attributes; (b) it allows faster learning for a
processing, the number of nonduplicate items is classifier. (c) it helps a classifier produce simpler learned
reduced (see Table 12). As with the Iris data, only six concepts; and (d) it improves predictive accuracy. How-
distinct items remain with one inconsistency. For such ever, it has its limitations. As of now. we do not see any
cases, even an exhaustive search method can be straightforward way to extend the method to handle
employed to produce high quality classification rules higher order correlations in data regardless of the com-
without resorting to monothetic methods such as C4.5. putational cost of the permutation of multiple attributes.
Two sets of rules are presented here to illustrate the Since the possibility of having high order correlated data
point. Ruleset A is produced by C4.5, and ruleset B is cannot be ruled out, further work should be done in this
produced by a rule generator that induces rules from a direction. Another possible extension is to data with
small dataset heuristically [ 111. The accuracies of the mixed nominal and ordinal attributes. Since the nominal
two rulesets for the training data are 97% and 99%. attributes are masked out in VHR. the inconsistency
and for the testing data they are 94% and 97%. checking of VHR can be done with or without the
respectively. masked attributes, which leads to either under- or over-
discretization. Underdiscretization is caused by the pos-
Ruleset A: sibility that some masked attributes could be irrelevant.
Overdiscretization is due to the fact that masked attrib-
petal-length < 1.9 --+ 1
utes do contribute to discriminating one record from
petal-length > 1.9 &
another. More study is needed. Another line of
petal-width < 1.6 -+ 2
research is to investigate the relationship between the
petal-width > 1.6 + 3
discriminating power of a database and its real distri-
default - 1
bution. In the present work, we have used an indirect
Ruleset B. measure: predictive accuracy. That is, a high accuracy
means the dimensionally reduced data keeps the origi-
petal-length < 3.0 -+ 1 nal distribution. The VHR method has been success-
petal-length < 5.0 & fully applied to many problems. With these extensions,
petal-width < 1.7 + 2 the VHR method can be more flexible and more gen-
default -+ 3 erally applicable.
Number
?? of attributes: One of the most important
advantages of VHR is that it can reduce the number
of attributes (see Table 13). Only relevant attributes are References
chosen and irrelevant ones are deleted. This will be a
great help in reducing work and minimizing resource [I] W. Frawley, G. Piatetsky-Shapiro and C. Matheus, Knowledge
discovery in databases: an overview, AZ Magazine (Fall 1992).
use in future data collection and classification. It also [2] IEEE Transactions on Knowledge and Data Engineering, 5(6)
helps human experts and data analysts to focus on the (1993) (special issue on learning and discovery in databases).
important dimensions. [3] J. Han. Y. Cai and H. Cercone, Knowledge discovery in data-
bases: an attribute oriented approach, in Pro<. VLDB Conf:
1992, pp. 547-559.
4. Conclusions [4] C.J. Matheus, P.K. Chan and G. Piatesky-Shapiro, Systems for
knowledge discovery in databases, IEEE Transactions on Knowl-
edge and Data Engineering, 5(6) (1993).
We have introduced a DDR system based on the VHR
[5] International Journal of Intelligent Systems, 7(7) (1992) (special
algorithm. The key idea is to apply techniques of dis- issue on knowledge discovery in databases).
cretization and feature selection to data and dimension- [6] H. Almuallim and T.G. Dietterich, Learning boolean concepts in
ality reduction in the context of numeric attributes. the presence of many irrelevant features. Art$ciai Intelligence, 69
Discretization merges the values of each attribute, and (1994) 279-305.
[7] U.M. Fayyad and K.B. Irani, The attribute selection problem
it thus significantly decreases the number of values a
in decision tree generation, in Proc. AAAI-92: Ninth National
continuous attribute can take, and reduces the data in Conf: Artificial Intelligence, MIT Press, USA. 1992, pp. 1044
a vertical dimension. Normally this process will generate 110.
some duplicates in the data; by removing the duplicates, [8] H. Ragavan and L. Rendell, Lookahead feature construction
72 H. Liu, R. SetionojKnowledge-Based Systems 9 (1996) 67-72
for learning hard concepts, in Proc. Seventh Int. Conf. Muchine [IO] J.R. Quinlan, C4.5: Programs jbr Machine Learning, Morgan
Learning Morgan Kaufmann, USA, 1993, pp. 252-259. Kaufmann, 1993.
[9] N. Wyse, R. Dubes and A.K. Jain, A critical evaluation of intrinsic [I I] H. Liu and ST. Tan, X2r: a fast rule generator, in Proc. IEEE Int.
dimensionality algorithms, in E.S. Gelsema and Kanal L.N. (eds.) Conf. Systems, Man and Cybernetics, IEEE, 1995.
Pattern Recognition in Practice, Morgan Kaufmann, USA, 1980,
415-425.