You are on page 1of 15

Density Conscious Subspace Clustering

for High-Dimensional Data


Yi-Hong Chu, Jen-Wei Huang, Kun-Ta Chuang,
De-Nian Yang, Member, IEEE, and Ming-Syan Chen, Fellow, IEEE
AbstractInstead of finding clusters in the full feature space, subspace clustering is an emergent task which aims at detecting
clusters embedded in subspaces. Most of previous works in the literature are density-based approaches, where a cluster is regarded
as a high-density region in a subspace. However, the identification of dense regions in previous works lacks of considering a critical
problem, called the density divergence problem in this paper, which refers to the phenomenon that the region densities vary in
different subspace cardinalities. Without considering this problem, previous works utilize a density threshold to discover the dense
regions in all subspaces, which incurs the serious loss of clustering accuracy (either recall or precision of the resulting clusters) in
different subspace cardinalities. To tackle the density divergence problem, in this paper, we devise a novel subspace clustering model
to discover the clusters based on the relative region densities in the subspaces, where the clusters are regarded as regions whose
densities are relatively high as compared to the region densities in a subspace. Based on this idea, different density thresholds are
adaptively determined to discover the clusters in different subspace cardinalities. Due to the infeasibility of applying previous
techniques in this novel clustering model, we also devise an innovative algorithm, referred to as DENCOS (DENsity COnscious
Subspace clustering), to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in
different subspace cardinalities. As validated by our extensive experiments on various data sets, DENCOS can discover the clusters in
all subspaces with high quality, and the efficiency of DENCOS outperformes previous works.
Index TermsData mining, data clustering, subspace clustering.

1 INTRODUCTION
C
LUSTERING has been recognized as an important and
valuable capability in the data mining field [8], [12],
[13]. For high-dimensional data, recent research have
reported that traditional clustering techniques may suffer
from the problem of discovering meaningful clusters due to
the curse of dimensionality. Specifically, the curse of dimen-
sionality [1], [3], [6], [16] refers to the phenomenon that as
the increase of the dimension cardinality, the distance of a
given point r to its nearest point will be close to the distance
of r to its farthest point. Due to the loss of the distance
discrimination in high dimensions, discovering meaningful,
separable clusters will be very challenging, if not impossible.
A common approach to cope with the curse of
dimensionality problem for mining tasks is to reduce the
data dimensionality by using the techniques of feature
transformation and feature selection [24]. The feature
transformation techniques, such as principal component
analysis (PCA) and singular value decomposition (SVD),
summarize the data in a fewer set of dimensions derived
from the combinations of the original data attributes.
However, the transformed features/dimensions have no
intuitive meaning any more and thus the resulting clusters
are hard to interpret and analyze [17]. On the other hand,
the feature selection methods [7], [20] reduce the data
dimensionality by trying to select the most relevant
attributes from the original data attributes. In such way,
only a particular subspace
1
is selected to discover the
clusters. However, in many real data sets, clusters may be
embedded in varying subspaces, and thus in the feature
selection approaches the information of data points clus-
tered differently in varying subspaces is lost [17].
Motivated by the fact that different groups of points may
be clustered in different subspaces, a significant amount of
research has been elaborated upon subspace clustering, which
aims at discovering clusters embedded in any subspace of
the original feature space [17]. The applicability of subspace
clustering has been demonstrated in various applications,
including gene expression data analysis, E-commerce, DNA
microarray analysis, and so forth [11], [17], [18], [21]. For
example, in the gene expression data, each data record
stores the expression levels, i.e., the intensity of the
expression, of a gene derived from different samples, which
may represent time slots. Clustering the genes in subspaces
may help to identify the genes whose expression levels are
similar in a subset of samples, where co-expressed genes
usually are functionally correlated [17]. Note that genes of
16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
. Y.-H. Chu and K.-T. Chuang are with the National Taiwan University,
BL 603, 106, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan.
E-mail: {yihong, doug}@arbor.ee.ntu.edu.tw.
. J.-W. Huang is with Yuan Ze University, 135 Yuan-Tung Road, Chung-
Li, Taiwan 32003, ROC. E-mail: jwhuang@saturn.yze.edu.tw.
. D.-N. Yang is with Academia Sinica, 128 Academia Road, Section 2,
Nankang, Taipei 115, Taiwan. E-mail: dnyang@cc.ee.ntu.edu.tw.
. M.-S. Chen is with Academia Sinica, 128 Academia Road, Section 2,
Nankang, Taipei 115, Taiwan, and the National Taiwan University,
BL 516, 106, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan.
E-mail: mschen@cc.ee.ntu.edu.tw.
Manuscript received 19 Feb. 2008; revised 10 Oct. 2008; accepted 13 Oct.
2008; published online 27 Oct. 2008.
Recommended for acceptance by D. Talia.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-02-0100.
Digital Object Identifier no. 10.1109/TKDE.2008.224.
1. A subspace of the original data space is the space with the dimensions
from a subset of the data dimensions [4].
1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society
different functionalities may be clustered in different
subsets of samples.
Most of previous subspace clustering works [4], [9], [17],
[23] discover the subspace clusters by regarding the clusters
as regions of higher densities than their surroundings in a
subspace. In addition, they identify the high-density regions
(clusters) by introducing a density threshold on region
densities such that a region is identified as dense if its
region density exceeds the density threshold. However, we
find that previous works may suffer from the difficulties in
achieving high qualities of the clusters in all subspaces since
the identification of high-density regions lacks of consider-
ing a critical problem, called the density divergence problem
in this paper. The density divergence problem refers to the
phenomenon that cluster densities vary in different sub-
space cardinalities.
2
Note that as the number of dimensions
increases, data points are spread out in a larger dimensional
space such that they will be more sparsely populated in
nature, thus showing the varying region densities in
different subspace cardinalities. This implies that extracting
clusters in higher subspaces should be with a lower density
requirement (otherwise we may lose true clusters in such
situations). Due to the requirement of varying density
thresholds for discovering clusters in different subspace
cardinalities, it is challenging for subspace clustering to
simultaneously achieve high precision and recall
3
for
clusters in different subspace cardinalities. More explicitly,
since previous works identify the dense regions by utilizing
a threshold on region densities, the trade-off between recall
and precision will be inevitably faced: a high threshold
leads to a high precision at the cost of recall; in contrast, a
low threshold leads to a high recall at the cost of precision.
To clearly illustrate the problems incurred by the density
divergence problem, we apply the previous subspace
clustering works [4], [9], which utilize the grid structure
for identifying the dense regions, in a two-dimensional
example data shown in Fig. 1a. In previous works adopting
the grid structure, the data space is first partitioned into a
number of equisized units, and then clusters are discovered
by grouping the connected dense units, where a unit is said
dense if the number of data points contained in it exceeds a
prespecified threshold t. Therefore, we apply the grid-
based subspace clustering algorithms on the example data
set in Fig. 1a by first partitioning the two-dimensional space
into 6 6 units. As can be seen, there are three clusters in
the data set: two one-dimensional clusters,
3
and 1
4
, and
one two-dimensional cluster,
3
1
3
[
3
1
4
. In Fig. 1b, we
show the counts of the units, i.e., the number of data points
contained in units, related to the clusters. Note that by
setting t as 54, the two one-dimensional clusters,
3
and 1
4
,
can be discovered with high qualities. However, in this
case, the density threshold t is set too high such that the
two-dimensional cluster
3
1
3
[
3
1
4
cannot be found.
That is, we get high precision and recall in the discovered
one-dimensional clusters, but low precision and recall in the
two-dimensional one. On the other hand, for the two-
dimensional cluster to be discovered, t should be set as 15,
but this may lead to low precision for the two one-
dimensional clusters. In this scenario, the one-dimensional
cluster
3
is discovered by joining
3
with one more unit
4
such that the precision is decreased. A similar result can be
derived for the one-dimensional cluster 1
4
, where it is
combined with one more unit 1
3
. Therefore, it is infeasible
for previous subspace clustering models to simultaneously
achieve high precision and recall for clusters in different
subspace cardinalities.
Considering the varying region densities in different
subspace cardinalities, we note that a more appropriate way
to determine whether a region in a subspace should be
identified as dense is by comparing its density with the
region densities in that subspace. Motivated by this idea, in
this paper, we devise a novel subspace clustering model,
which is based on the relative region densities to discover
the clusters. In our subspace clustering model, we regard
the clusters in a subspace as the regions which have
relatively high densities as compared to the average region
density in the subspace. To discover such clusters, we
introduce a novel density parameter c for users to specify
their expected relative rate of the densities of the dense
regions and the average region density in a subspace. Then,
when given a user-specified c value, due to the different
average region densities in different subspace cardinalities,
we adaptively determine different density thresholds to
discover clusters of relatively high densities in subspaces of
different subspace cardinalities.
Discovering clusters in different cardinalities with dif-
ferent density thresholds is useful but is quite challenging.
Note that due to the large searching space of the subspaces
in high-dimensional data, previous works constrain the
searching of dense regions based on the monotonicity
property, where a region in a subspace will not be extracted
for discovering the dense regions if there exist the projec-
tions of this region in lower subspaces which are not dense
regions. However, since in our model, different density
thresholds are utilized to discover dense regions in different
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 17
2. The cardinality of a subspace is defined to be the number of
dimensions forming this subspace.
3. For a cluster, recall is defined as the percentage of the data points in
a true cluster, which are identified in this cluster. Precision is defined as
the percentage of the data points in this cluster that really belong to the
true cluster.
Fig. 1. Illustration of the density divergence problem. (a) An example
of two-dimensional data set and (b) the unit counts related to the
subspace clusters.
subspace cardinalities, the monotonicity property no longer
exists, that is, if a /-dimensional region is dense, any
/ 1-dimensional projection of this region may not be
dense. Without the monotonicity property, the Apriori-like
generate-and-test scheme adopted in most previous works
to constrain the searching of dense regions is infeasible in
our model. A naive method would need to exhaustedly
examine all regions to discover the dense regions. For this
challenge, we devise an innovative algorithm, referred to as
DENsity COnscious Subspace clustering(abbreviated as
DENCOS), to efficiently discover the clusters satisfying
different density thresholds in different subspace cardinal-
ities. In DENCOS, the mechanism of computing the upper
bounds of region densities to constrain the search of dense
regions is devised, where the regions whose density upper
bounds are lower than the density thresholds will be pruned
away in identifying the dense regions. We compute the
region density upper bounds by utilizing a novel data
structure, DFP-tree (Density FP-tree), where we store the
summarized information of the dense regions. In addition,
from the DFP-tree, we also propose to calculate the lower
bounds of the region densities to accelerate the identification
of the dense regions. Therefore, in DENCOS, the dense
region discovery is devised as a divide-and-conquer
scheme. At first, the information of region densitys lower
bounds is utilized to efficiently extract the dense regions,
which are the regions whose density lower bounds exceed
the density thresholds. Then, for the remaining regions, the
searching of dense regions is constrained to the regions
whose upper bounds of region densities exceed the density
thresholds. By conducting on extensive data sets, the
experimental results reveal that our proposed algorithm
DENCOS has better performance in both clustering quality
and efficiency than previous works.
The remaining of the paper is organized as follows: In
Section 2, some related works in subspace clustering are
presented. In Section 3, we give the preliminary and the
formal definition of our proposed subspace clustering
model. Then, the proposed DENCOS algorithm will be
described in Section 4. In Section 5, we present the
experimental results. The paper concludes with Section 6.
2 RELATED WORK
Without loss of generality, the subspace clustering algo-
rithms can be divided into two categories according to
whether the grid structure is used or not. The CLIQUE [4]
algorithm and its successors, ENCLUS [9] and MAFIA [23],
are the first category of approaches to subspace clustering,
and they are all grid-based algorithms. In CLIQUE, the
data space is first partitioned into equal-size units (grids).
Then, subspace clusters are discovered by grouping the
connected dense units, which are defined as the ones
whose densities exceed a threshold t. However, as shown
in the conducted experiments in this paper, by utilizing an
absolute unit density as the threshold to discover the dense
units in all subspaces, it is difficult for CLIQUE to discover
high-quality clusters in all subspaces. This may be because
the unit densities vary in different subspace cardinalities
such that the identification of the dense units in all
subspaces by utilizing an absolute unit density threshold
may suffer from the trade-off between precision and recall
as stated in Section 1.
In addition, the algorithm ENCLUS [9] extends CLIQUE
and aims at accelerating the clustering process by introdu-
cing the concept of subspace entropy for selecting some
interesting subspaces to discover the clusters. However, in
ENCLUS, the same clustering model in CLIQUE is adopted
in discovering the clusters such that the density divergence
problem will be evidently faced. A more significantly
modification of CLIQUE is presented in MAFIA [23]. In
MAFIA, higher dimensional candidate dense units (CDU) are
formed by first identifying the variable-sized units (bins) in
each dimension and then grouping them into units in
higher subspaces. From these CDUs, dense units are
extracted with the thresholds defined for each variable-
sized bin in each dimension. However, CDUs in higher
subspaces may have much lower densities than the one-
dimensional bins such that the thresholds defined for the
one-dimensional bins may be hard to capture the varying
densities of CDUs in different subspace cardinalities. Thus,
MAFIA may also have the difficulties in discovering
clusters with high qualities in all subspace cardinalities.
Instead of using grids, the SUBCLU algorithm [17]
adopts the definition of density-connected clusters under-
lying the algorithm DBSCAN [10] to discover the clusters.
In SUBCLU, a cluster is defined as comprised of several
density-connected core objects and the surrounding border
objects. Two parameters c and i are utilized to define the
core objects, which are the data points containing at least
i data points in their c-neighborhood. The notion of core
objects also holds the monotonicity property, and thus an
Apriori-like algorithm is also devised in SUBCLU to
discover the core objects in all subspaces. However,
SUBCLU will also suffer from the density divergence
problem. Note that for a data point in the higher space,
the number of data points in its c-neighborhood would be
smaller than the one in lower space due to the sparse
distribution of the data points in higher subspaces. Thus,
SUBCLU will have the difficulties in identifying the core
objects in different subspace cardinalities by using the same
parameter setting, thus resulting in poor clustering results.
The algorithm FIRES [19] proposes a generic framework
to efficient subspace clustering, which allows to use any
clustering notation of choice and a variable cluster
criterion. For efficiency, they only discover the clusters of
maximal dimensionality; however, it may greatly loose the
information of the clusters in lower subspaces. The
algorithm DUSC [5] proposes a dimensionality unbiased
subspace clustering model, where the redundancy pruning
of removing some clusters that have portions of data points
overlapping with other clusters may result in incomplete
clustering results.
The problemof projectedclustering [2], [22], [25], [27], [28]
is also devised for coping with clustering high-dimensional
data. The problem of projected clustering aims at finding
/ 1-partition fC
1
. . . . . C
/
. Og of the data, where C
i
representing the ith cluster is a set of objects that are closely
correlated in a subspace, and O contains the outliers. While
projected clustering techniques also aim at discovering
correlations among data in various subspaces, their output
18 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
is significantly different from the output of the subspace
clustering [2], where projected clustering aims to partition
the data into disjoint sets and subspace clustering allows a
data point belong to different subspace clusters. Thus, the
projected clustering algorithms are not considered here.
3 DENSITY CONSCIOUS SUBSPACE CLUSTERING
3.1 Preliminary
In view of the density divergence problem as illustrated
in Section 1, we devise a novel subspace clustering model
to discover the clusters based on the relative region
densities in the subspaces, where the clusters are
regarded as regions whose densities are relatively high
as compared to the average region density in a subspace.
The grid structure [4] is adopted in our clustering model
to discover the subspace clusters. For ease of illustration,
let us first give some definitions. Let f
1
.
2
. . . . .
d
g
be the set of d attributes of the data set, and o

2

d
be the corresponding d-dimensional
data space. Any /-dimensional subspace of o is the space
with the / dimensions drawn from the d attributes, where
/ d. The cardinality of the subspace is defined as the
number of dimensions forming this subspace.
To discover clusters, the grid structure is derived by
partitioning the data space o into a number of nonoverlap-
ping rectangular units. These rectangular units are derived
by partitioning each attribute into c equal-length intervals
(where c is an input parameter) in such a way that a unit in
space o is the intersection of one interval from each of the
d attributes. Consider the projection of the data set in a
/-dimensional subspace. A /-diici:ioio| niit is defined as
the intersection of one interval from each of the / attributes.
In addition, the count value of a /-dimensional unit is
defined as the number of data points in this unit. Based on
these units, the clusters in a /-dimensional subspace are
discovered by first identifying the /-dimensional dense
units and then grouping the connected
4
ones into clusters.
In view of the density divergence problem, we identify
the dense units in a subspace by discovering the units
whose unit counts are relatively high as compared to the
average unit count in the subspace. Therefore, we devise a
novel density parameter c, called the unit strength factor,
to denote the user-expected relative rate of the unit counts
of the dense units and the average unit count in a subspace.
Then, based on the density parameter c, different density
thresholds are determined for discovering the dense units in
different subspace cardinalities, as shown in the following:
Definition 1 (density thresholds). Let t
/
denote the density
threshold for the subspace cardinality /, and let ` be the total
number of data points. The density threshold t
/
is defined as
t
/
c
`
c
/
. 1
Note that in (1), the term `,c
/
is the average unit count of
the units in a /-dimensional subspace. The idea of taking the
average unit count as the base value for identifying the
dense units is illustrated in the following. Note that when
the data points are uniformly distributed in a /-dimensional
subspace (which is the case that the data points have no
correlations in this subspace), the number of data points in
each of the c
/
/-dimensional units in this subspace will
approximately be `,c
/
, i.e., the average unit count. In this
scenario, there will be no clusters discovered because every
unit is of almost the same density, corresponding with the
fact that no clusters with correlated data points are in the
subspace. As the data are more compacted into clusters,
the units within clusters will be much denser and thus will
have larger count values than the average unit count.
Therefore, the input parameter c is introduced for users to
specify their perception of the relative rate of the unit counts
of the dense units and the average unit count. Then, the
threshold t
/
is so defined that a /-dimensional unit will be
identified as a dense one if its count value exceeds c times of
the average unit density `,c
/
. Since c is independent to the
subspace cardinalities, it is much easier for users to specify
the density parameter c to discover the clusters in all
subspaces because they need not consider the varying unit
counts in different subspace cardinalities. Thus, when given
a user-specified c, different density thresholds shown in
(1) are determined automatically to identify the dense units
in different subspace cardinalities.
In addition to the parameter c, in discovering the
clusters we also introduce the maximal subspace cardin-
ality, denoted as /
max
, to ensure that the density threshold
t
/
max
1. Setting the maximal cardinality to discover the
clusters is requisite. Note that as the subspace cardinality
increases, the average unit density `,c
/
would decrease.
When the subspace cardinality comes to some higher value,
it would derive the density threshold t
/
1. However,
using a density threshold, which is smaller than or equal to
one to identify dense units, is meaningless because in this
circumstance all the units containing at least one data point
will be identified as dense units, resulting in meaningless
clustering results. Therefore, the cluster discovery will be
performed in subspaces of cardinalities up to /
max
for
discovering meaningful clusters.
Problem Definition. Given the unit strength factor c and the
maximal subspace cardinality /
max
, for the subspaces of
cardinality / from 1 to /
max
, find the clusters in which each
is a maximal set of connected dense /-dimensional units whose
unit counts are larger than or equal to the density threshold t
/
.
4 THE DENCOS ALGORITHM
In this paper, we devise the DENCOS algorithm, standing
for DENsity COnscious Subspace clustering, to discover the
clusters with our proposed density thresholds. In DENCOS,
we model the problem of density conscious subspace
clustering to a similar problem of frequent itemset mining.
We note that by regarding the intervals in all dimensions as
a set of unique items in frequent itemset mining problem,
any /-dimensional unit can be regarded as a /-itemset, i.e.,
an itemset of cardinality /. Thus, to identify the dense units
satisfying the density thresholds in subspace clustering is
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 19
4. Two /-dimensional units n
1
and n
2
are connected if they have a
common face or if there exists another /-dimensional unit n
3
such that both
n
1
and n
2
are connected to n
3
[4].
similar to mine the frequent itemsets satisfying the
minimum support in frequent itemset mining.
However, our proposed density conscious subspace
clustering problem is significantly different from the
frequent itemset problem since different density thresholds
are utilized to discover the dense units in different subspace
cardinalities, and thus, the frequent itemset mining techni-
ques cannot be adopted here to discover the clusters.
Specifically, the monotonicity property no longer exists in
such situations, that is, if a /-dimensional unit is dense, any
/ 1-dimensional projection of this unit may not be
dense. Therefore, the a-priori-like candidate generate-and-
test scheme, which is adopted in most previous subspace
clustering works [4], [9], [17], is infeasible in our clustering
model. In addition, the FP-tree-based mining algorithm,
FP-growth [14] proposed to mine the frequent itemsets,
cannot be adopted to discover the clusters with multiple
thresholds since the theorem in [14] does not hold in such
an environment. The theorem
5
shows that when the FP-tree
contains only a path, the enumeration of all the combina-
tions of the nodes in the path are all frequent itemsets
(which are dense units in subspace clustering problem).
However, because the units in the enumeration of the
combinations of the nodes in the path are of various
cardinalities, they may not all be dense units in this problem
(since dense units of different cardinalities are defined to
satisfy different density thresholds), thus showing the
infeasibility of the FP-growth algorithm in our subspace
clustering model. For this challenge, we devise the
DENCOS algorithm in this paper to efficiently discover
the clusters with our proposed density thresholds.
The DENCOS algorithm focuses on discovering the dense
units because after the dense units are mined, we can follow
the procedure proposed in [4] to group the connected dense
units into clusters. In DENCOS, the dense unit discovery is
performed by utilizing a novel data structure DFP-tree
(Density FP-tree), which is constructed on the data set to store
the complete information of the dense units. From the DFP-
tree, we compute the lower bounds and upper bounds of the
unit counts for accelerating the dense unit discovery, and
these informations are utilized in a divide-and-conquer
scheme to mine the dense units. Therefore, DENCOS is
devised as a two-phase algorithm comprised of the pre-
processing phase and the discovering phase. The preproces-
sing phase is to construct the DFP-tree on the transformed
data set, where the data set is transformed with the purpose
of transforming the density conscious subspace clustering
problem into a similar frequent itemset mining problem.
Then, in the discovering phase, the DFP-tree is employed to
discover the dense units by using a divide-and-conquer
scheme. The details of the two phases will be described in
Sections 4.1 and 4.2, respectively.
4.1 Preprocessing Phase
In the preprocessing phase, we first transform the data set
by transforming each d-dimensional data point into a set of
d one-dimensional units, corresponding to the intervals
within the d dimensions it resides in. For illustration, let
i.,
denote the one-dimensional unit for the ,th interval of the
attribute
i
, where 1 i d and 1 , c.
Example 1. Consider transforming the four-dimensional
example data set in Fig. 2a. Suppose that each dimen-
sion has been normalized into 0. 60. Consider c is 6 and
thus each interval is of length 60,6. In this manner, the
data set is transformed as shown in the rightmost
column of Fig. 2a.
After transforming the data set, the DFP-tree is con-
structed to condense the transformed data set. In this
paper, we devise the DFP-tree by adding the extra feature
in the FP-tree [14] for discovering the dense units with
different density thresholds. Note that as stated in the
beginning of this section, the FP-tree-based algorithm,
FP-growth [14] proposed to mine the frequent itemsets,
cannot be adopted here to discover the dense units
satisfying different density thresholds in different cardin-
alities since the theorem in [14] does not hold in such an
environment. In this paper, we propose to compute the
upper bounds of unit counts for constraining the searching
of dense units such that we add extra features into the
DFP-tree for the computation. For ease of presentation, we
first illustrate the construction of the DFP-tree, and the
details of the extra feature added in the DFP-tree will be
described in the discovering phase in Section 4.2.2.
The DFP-tree is constructed by inserting each trans-
formed data as a path in the DFP-tree with the nodes storing
the one-dimensional units of the data. The paths with
common prefix nodes will be merged and their node counts
are accumulated. Before inserting the transformed data into
the DFP-tree, we first utilize the smallest threshold among
the thresholds for subspace cardinalities 1. /
max
, i.e., t
/
max
, to
remove the one-dimensional units in each data, whose total
occurrences in the data set do not exceed t
/
max
. It is because
these removed one-dimensional units cannot be extended to
generate higher dimensional dense units that satisfy thresh-
olds larger than t
/
max
such that they need not be inserted into
the DFP-tree for discovering the dense units.
Example 2. Fig. 3 shows the DFP-tree constructed on the
transformed data set in Fig. 2a. Suppose that /
max
is 3, t
2
is
12, andt
3
is 2. The construction process is showninFig. 3b.
Note that the surplus counts added in the header table
of the DFP-tree are the extra features added for dense unit
discovery, and will be described in Section 4.2.2.
4.2 Discovering Phase
In the discovering phase, the DFP-tree is employed to
discover the dense units of cardinalities from 1 to /
max
. The
one-dimensional dense units are directly discovered from
the header table by identifying the one-dimensional units
with the stored total unit counts exceeding t
1
. To discover
20 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
5. Lemma 3.2 in [14].
Fig. 2. A four-dimensional example data set. (a) Illustration of
transforming the data set and (b) illustration of sorting the transformed
data set in (a) for further inserting into the DFP-tree.
the higher dimensional dense units from the DFP-tree, a
naive method would be to first enumerate the combination
of the nodes in each path to generate the candidate units,
and then examine the total occurrences of these units in the
DFP-tree with the corresponding density thresholds. In this
paper, without the brute-force generation of the candidate
units from each path, in the paths only the candidate units
which are checked to have the possibilities to be dense units
are generated, thus resulting in a smaller set of candidate
units for the dense unit discovery. In the following, we first
utilize an example to illustrate the idea of constraining the
searching of dense units by generating a smaller set of
candidate units.
Example 3. Let us consider to discover the dense units from
the DFP-tree shown in Fig. 4 with /
max
3, t
2
30, and
t
3
5. Consider the set of prefix paths of the nodes
carrying the one-dimensional unit
22
: f
11
:4.
25
:4.

35
:4.
11
:32.
51
:32.
25
:4.
35
:4.
35
:12g. Note that
in each of these paths, all the prefix nodes of the node
22
adjust their node counts to
22
s node count, denoted as

22
.conit, since each of them co-occurs with
22
exactly

22
.conit times. In addition, these paths form a small
database of units which co-occurs with the one-dimen-
sional unit
22
such that from these paths we can discover
the dense units associated with
22
. In the following, we
consider to discover the two-dimensional dense units
related to
22
from these paths.
Within these paths, the prefix path
11
:32.
51
:32 is
derived from the node
22
whose node count, that is
32, exceeds t
2
. In this case, the two-dimensional units
which can be derived from this path will be dense
units because their unit counts are at least
22
.conit
(i.e., 32) acquired from this path and
22
.conit exceeds
t
2
. Based on this idea, we can directly utilize the prefix
path
11
:32.
51
:32 to discover the two-dimensional
dense units associated with
22
, and obtain two dense
units:
11

22
and
51

22
.
For the remaining
22
s prefix paths
11
:4.
25
:4.

35
:4,
25
:4.
35
:4, and
35
:12, we check whether
they can produce two-dimensional dense units when
taking them together into consideration. However, no
two-dimensional dense units associated with
22
can be
discovered. It is because the maximal possible unit count
of the two-dimensional units derived from these paths
6
is 204 4 12, which is the case that a unit can be
derived from all these paths so that it has unit count
acquired from these paths. Clearly, since this maximal
possible unit count is smaller than t
2
, there will be no
two-dimensional dense units generated from these paths.
Thus, we need not apply the candidate generate-and-test
process on these paths to generate the two-dimensional
candidate units for further testing their unit counts with
t
2
, thus showing the reduced candidate units generated
for discovering the dense units.
Based on the idea illustrated in Example 3, we devise a
divide-and-conquer scheme to discover the dense units
from the DFP-tree. Firstly, we discover the dense units by
considering the nodes whose node counts satisfy the
thresholds. The base idea is that for a node, the unit count
of any unit which can be derived from this nodes prefix
path will be at least equal to this nodes node count, that is
to say, we can take the nodes node count as the unit counts
lower bound for these units. Thus, if a nodes node count
exceeds t
/
, it is clear that the /-dimensional units generated
from this nodes prefix path will be dense ones. In this
paper, we call such dense units discovered by the nodes
satisfying the density thresholds, the inherent dense units.
Next, we consider the nodes which do not satisfy the
thresholds. The set of nodes carrying the same one-
dimensional unit and all having the node counts below t
/
are taken together into consideration in this stage to
discover the /-dimensional dense units. For these nodes,
we will first compute the unit counts upper bound of the
units derived from their prefix paths, i.e., the maximal
possible unit count of the units derived from these prefix
paths. Thus, only if the computed unit counts upper bound
exceeds t
/
, these paths will be taken into consideration in
discovering the /-dimensional dense units. In this paper, we
call the dense units discovered by utilizing the nodes which
do not satisfy the thresholds, the acquired dense units. The
details of the discovery of the inherent dense units and the
acquired dense units will be described in Sections 4.2.1 and
4.2.2, respectively.
4.2.1 Generation of Inherent Dense Units
In this discovering stage, we consider to utilize the nodes
satisfying the thresholds to discover the dense units. For the
nodes with node counts satisfying the thresholds for some set
of subspace cardinalities, we will take their prefix paths to
generate the dense units of their satisfied subspace cardin-
alities. However, a naive method to discover these dense
units would require each node to traverse its prefix path
several times togenerate the dense units for the set of satisfied
subspace cardinalities. In this paper, we have explored that
the set of dense units a node requires to discover from its
prefix path can be directly generated by utilizing the dense
units discovered by its prefix nodes, thus avoiding the
repeated scans of the prefix paths of the nodes. Therefore, by
a traversal of the DFP-tree, we can efficiently discover the
dense units for all nodes satisfying the thresholds.
Let us take a node i
i
1 i in a path of length in
the DFP-tree for illustrating the generation of the dense
units from a node, where the nodes in the path are labeled
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 21
6. Except for the dense units
11

22
and
51

22
which have been
discovered from the path
11
:32.
51
:32.
Fig. 3. (a) The DFP-tree constructed on the data set in Fig. 2 and (b) the
construction process.
in such an order that i

is the leaf node. For ease of


presentation, we first give some definitions for i
i
. Let
i
i
.conit denote the count value of the node i
i
. In addition,
let 1
/
ii
denote the set of /-dimensional units / i derived
by concatenating i
i
with any / 1 combination of its
prefix nodes. Notice that if / i. 1
/
i
i
is empty due to the fact
that i
i
has only i 1 prefix nodes.
It is noted that the prefix nodes of the node i
i
co-occur
with i
i
exactly i
i
.conit times. Therefore, the set of units 1
/
i
i
with 2 / i will have their unit counts gain i
i
.conit from
i
i
s prefix path such that their unit counts are at least
i
i
.conit. Thus, if the node count of the node i
i
exceeds the
threshold for the subspace cardinality /, the units 1
/
i
i
will
definitely be dense ones such that we can employ i
i
s prefix
path to discover the /-dimensional dense units 1
/
ii
. There-
fore, let /i
i
denote the subspace cardinality with
t
/i
i

i
i
.conit < t
/i
i
1
. That is, the node count i
i
.conit
satisfies the thresholds in subspace cardinalities above /i
i

due to the decreased thresholds in higher subspaces. If


/i
i
/
max
, we will take the prefix path of the node i
i
to
generate the dense units 1
/
ii
for all subspace cardinalities /
in the range /i
i
. /
max
.
To discover the dense units from the nodes satisfying the
thresholds, we have explored that the dense units generated
by the nodes in lower levels (based on the assumption that
the root is at level 0) can be utilized to generate the dense
units for the nodes in higher levels such that the repeated
scans of the prefix path of the nodes to generate the dense
units can be avoided. For illustration, let us first illustrate the
generation of the units 1
/
i
i
fromi
i
s prefix path. According to
the definition of 1
/
ii
, we have the following recursive
definition for 1
/
ii
:
1
/
i
i

[
i1
,1
n
ii
1
/1
i
,
. 2
where the node i
,
denotes a prefix node of i
i
1 , < i,
and the concatenation notation n
ii
1
/1
i
,
operates by con-
catenating n
ii
to each / 1-dimensional unit in 1
/1
i,
.
Equation (2) shows that for i
i
to generate the units of
cardinality /, we require each prefix nodes i
,
of the node i
i
to provide the units of cardinality / 1.
Based on (2), for the node i
i
, the generation of the dense
units 1
/
ii
for the cardinalities / in the range /i
i
. /
max
will
require each prefix node of the node i
i
to provide the units of
cardinalities in the range /i
i
1. /
max
1. In addition, the
range /i
,
. /
max
is the subspace cardinality range, where
the prefix node i
,
of the node i
i
is employed to generate the
dense units. We explore that these two cardinality ranges,
/i
i
1. /
max
1 and /i
,
. /
max
, are highly overlapped.
Thus, we can directly take the dense units generated by the
prefix nodes of i
i
to generate the dense units for i
i
. We can
examine the overlapping between /i
i
1. /
max
1 and
/i
,
. /
max
by considering the relationship between /i
i

and /i
,
. Because we have /i
,
/i
i
due to the fact that
i
,
.conit ! i
i
.conit, the following two cases of /i
i
and
/i
,
are taken into consideration.
Case 1. /i
,
< /i
i
.
In this case, the cardinality range /i
,
. /
max
wholly
encompasses the one /i
i
1. /
max
1, that is, the dense
units generated by the prefix node i
,
encompass the set of
units. The prefix node i
,
should provide for generating
dense units for i
i
in the range /i
i
. /
max
.
Case 2. /i
,
/i
i
.
Except for the cardinality /i
i
1, the range /i
,
. /
max

encompasses the one /i


i
1. /
max
1. Therefore, only
the units 1
/i
i
1
i
,
should be further generated for i
i
to
generate /i
i
-dimensional dense units.
Based on the foregoing, the dense units for the nodes
satisfying the thresholds can be efficiently discovered with a
preorder traversal of the DFP-tree, where the dense units
generated when traversing to a node i
i
are buffered for later
being used by i
i
s descendants to generate their related
dense units. The procedure GenerateInherentDenseUnits de-
vised to discover the inherent dense units is shown in Fig. 5.
4.2.2 Generation of Acquired Dense Units
In this discovering stage, for the nodes whose node counts
do not exceed t
/
, we take the nodes carrying the same one-
dimensional unit together into consideration in discovering
the /-dimensional dense units. Let `
/
n
denote the set of
nodes which carry the one-dimensional unit n and all have
the node counts smaller than t
/
. Without directly applying
the discovering process on the prefix paths of the nodes in
`
/
n
to discover /-dimensional dense units, we first examine
whether these paths have the possibilities to generate the
/-dimensional dense units. The examination is performed
by checking the maximal possible unit counts of the units
generated from the prefix paths of the nodes in `
/
n
, and
22 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
Fig. 4. An example DFP-tree for illustrating the dense unit discovery.
the novel feature, surplus count, in the DFP-tree is
introduced for this examination.
Surplus count. For each header table entry, we introduce
/
max
1 count values, called the surplus counts, for the
subspace cardinalities from 2 to /
max
. Specifically, consider a
header table entry with the one-dimensional unit n. Let
oC
/
n
2 / /
max
denote the surplus count of unit n for
subspace cardinality /. The value oC
/
n
stores the summation
of the node counts of the nodes in `
/
n
. For example, consider
the DFP-tree shown in Fig. 4, where /
max
3, t
2
30, and
t
3
5. Let us take the one-dimensional unit
22
for
illustration. There are four nodes carrying
22
in the DFP-
tree, with respective node counts: 4, 32, 4, and 12. Therefore,
oC
2
22
is calculated as 204 4 12.
Procedure of discovering acquired dense units. Note
that the surplus count oC
/
n
is the maximal possible unit
count of the units that can be generated from the prefix
paths of the nodes in `
/
n
, which is the case when a unit can
be derived from all these paths so that this unit has the unit
count equal to the summation of the node counts of the
nodes in `
/
n
, i.e., oC
/
n
. Clearly, if oC
/
n
< t
/
, there will be no
/-dimensional dense units that can be discovered from the
prefix paths of the nodes in `
/
n
such that we need not apply
the discovery process on `
/
n
to explore the /-dimensional
dense units.
Therefore, we discover the acquired dense units by first
examining the surplus counts with the density thresholds to
accelerate the discovery process. The procedure GenerateAc-
quiredDenseUnits devised to discover the acquired dense
units is shown in Fig. 6. In this procedure, each time we
consider a header table entry with one-dimensional unit n.
Let 1
/
n
denote the set of prefix paths of the nodes in `
/
n
. We
set the buffer coid1i:t
n
f/joC
/
n
! t
/
and 2 / /
max
g.
Thus, for n, only the subspace cardinalities / in coid1i:t
n
are
considered to take the paths 1
/
n
to discover the /-dimensional
dense units. However, for a one-dimensional unit n, we may
expend huge time in discovering dense units of different
cardinalities from different set of paths. To overcome this
problem, we explore Theorem1 to discover these dense units
in only one mining process.
Theorem 1. Consider a one-dimensional unit n in the header
table. For any two subspace cardinalities /
1
and /
2
, if /
1
< /
2
,
then 1
/1
n
1
/2
n
.
Proof. According to the definition of 1
/
n
, for a node which
carries unit n, once its prefix path is inserted into 1
/2
n
due
to the fact that the node count is smaller than t
/2
, its
prefix path will also be inserted into 1
/1
n
for any subspace
cardinality /
1
with /
1
< /
2
(due to t
/1
t
/2
. Based on
this observation, if /
1
< /
2
, we can see that 1
/1
n
will
contain all the elements in 1
/2
n
, i.e., 1
/1
n
' 1
/2
n
. Further-
more, in case of /
1
< /
2
, it is clear that if there exist no
nodes whose node counts are in the range t
/2
. t
/1
. 1
/1
n
will be equal to 1
/2
n
. Based on the foregoing, we have
1
/1
n
1
/2
n
for any two subspace cardinalities /
1
and /
2
with /
1
< /
2
. tu
Our devised procedure MineDenseUnits, devised for a
one-dimensional unit n to discover the acquired dense units
of cardinalities in coid1i:t
n
is shown in Fig. 6. The main
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 23
Fig. 5. Procedures to discover the inherent dense units.
Fig. 6. Procedures to discover the acquired dense units.
component of this procedure is the path removal technique
devised based on Theorem 1 to utilize one mining process
to discover the dense units of different subspace cardinal-
ities from different set of paths.
Theorem 1 shows that, for the subspace cardinalities in
coid1i:t
n
, the paths 1
/
n
for lower subspace cardinalities / will
subsume the paths 1
/
n
for higher subspace cardinalities /.
Based on this phenomenon, the technique of path removal is
devisedwith the recursive mining scheme in [14]. Let coid
max
denote the maximal subspace cardinality in coid1i:t
n
. In this
recursive mining scheme, we will generate the candidate
units of cardinalities in 1. coid
max
for discovering the dense
units of cardinalities in coid1i:t
n
. More precisely, we
generate the candidate units by starting from the units of
subspace cardinality 1 and recursively extending one one-
dimensional unit at a time until getting units of cardinality
coid
ior
, and in this generation process, when we come to the
candidate units of the cardinalities in coid1i:t
n
, these
candidate units will be examined with the corresponding
density thresholds for discovering the dense units. Further-
more, in this discovering process, we will first utilize the
paths 1
/
n
of lower subspace cardinalities / in generating
lower dimensional candidate units (for discovering lower
dimensional dense units). Then, in the progress of extending
the candidate units, the paths which do not belong to the
paths 1
/
n
for higher subspace cardinalities / will be removed
for discovering higher dimensional dense units.
More precisely, in each recursive mining step,
7
we will
have a current unit 1, called the base unit, and a conditional
DFP-tree related to the base unit 1, denoted as Ticc
1
. For
each header table entry in Ticc
1
, which carries one-
dimensional unit ^ n
/
, we derive a one-1-dimensional-unit-
extended unit, denoted as 1, by concatenating ^ n
/
to 1, that
is, 1 ^ n
/
[ 1. We are now in the state of generating the
candidate units of cardinality j1j (j1j denotes the cardinality
of 1. Next, we will consider to apply the recursive mining
process on the DFP-tree, constructed on ^ n
/
s conditional
pattern base
8
extracted from Ticc
1
, to recursively extending
the unit 1 to discover higher dimensional dense units.
However, before constructing the DFP-tree on ^ n
/
s condi-
tional pattern base for the subsequent recursive mining,
Theorem 2 is explored to show that if j1j 2 coid1i:t
n
, we
must first perform the path removal on ^ n
/
s conditional
pattern base for the correct discovery of higher dimensional
dense units.
Theorem 2. Let 1 denote the unit 1 ^ n
/
[ 1 in each recursive
mining step (in procedure MineDenseUnits shown in Fig. 6).
If j1j 2 coid1i:t
n
, there are some paths in ^ n
/
s conditional
pattern base which must be first removed for correctly
discovering higher dimensional dense units.
Proof. Two cases of j1j are examined. Let coid
icrt
denote the
smallest cardinality in coid1i:t
n
which is larger than j1j.
Case 1. j1j 2 coid1i:t
n
.
We are now in the state of discovering the dense units
of cardinality j1j such that ^ n
/
s conditional pattern base
contains the information of the paths in 1
j1j
n
. However, the
goal of the subsequent recursive mining process on
Ticc
^ n
/
constructed on ^ n
/
s conditional pattern base is to
discover the dense units for the next cardinality coid
icrt
.
Since 1
j1j
n
1
coidicrt
n
, we need to first do the path removal
on ^ n
/
s conditional pattern base for retaining only the
information of 1
coid
icrt
n
in Ticc
^ n/
to further discover the
dense units of cardinality coid
icrt
.
Case 2: j1j 62 coid1i:t
n
.
We are now in the middle progress of extending
candidate units with the goal of discovering dense units
of cardinality coid
icrt
. Thus, ^ n
/
s conditional pattern
base contains the information of paths 1
coidicrt
n
, and there
is no need to do the path removal on ^ n
/
s conditional
pattern base. tu
The technique of path removal. According to Theo-
rem 2, for the situation of j1j 2 coid1i:t
n
, our devised
technique of path removal is utilized to process the paths in
^ n
/
s conditional pattern base for the subsequent dense unit
discovery. Let coid
icrt
denote the smallest cardinality in
coid1i:t
n
which is larger than j1j. We are now in the state of
discovering the dense units of cardinality j1j, and the
purpose of the subsequent discovering process is to
discover the dense units of cardinality coid
icrt
. Since 1
j1j
n

1
coidicrt
n
based on Theorem 1, the purpose of path removal is
to retain only the information of 1
coid
icrt
n
in ^ n
/
s conditional
pattern base, which now contains the information of 1
j1j
n
.
Thus, the paths whose information needs to be removed are
the set of paths which are in 1
j1j
n
but not in 1
coid
icrt
n
, and let
these paths be denoted as 1ot/1i:t.
To remove the information of 1ot/1i:t from
^ n
/
s conditional pattern base, we can not directly remove
the paths 1ot/1i:t in ^ n
/
s conditional pattern base. It is
because there would be some paths in 1ot/1i:t that do
not exist in ^ n
/
s conditional pattern base, and paths in
1ot/1i:t may not be in the form they should appear in
^ n
/
s conditional pattern base (since ^ n
/
s conditional
pattern base is extracted from Ticc
1
which is recursively
constructed from the original DFP-tree). To remove the
information of 1ot/1i:t in ^ n
/
s conditional pattern base,
we first apply a two-step reconstruction process on
1ot/1i:t to prepare the paths for further being removed
from ^ n
/
s conditional pattern base. The two steps of the
reconstruction process are the step 1, path exclusion, and
the step 2, path reorganization, where the correctness of
these two steps is shown in Theorem 3.
Specifically, the step 1, path exclusion, of the recon-
struction process is to delete the paths in 1ot/1i:t which
can not exist in ^ n
/
s conditional pattern base. We delete the
paths in 1ot/1i:t, which do not contain 1

n
, which is the
unit by removing the unit n from ^ n
/
[ 1. The step 2, path
reorganization, in the reconstruction process is to reorga-
nize the remaining paths in 1ot/1i:t to the form they
should appear in ^ n
/
s conditional pattern base. For each
path 1 in 1ot/1i:t, the path reorganization is performed by
first deleting the nodes which are positioned below ^ n
/
in the
24 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
7. Note that this recursive mining process for discovering the acquired
dense units for a one-dimensional unit n starts by taking n (as the base
unit 1 and the DFP-tree constructed on the paths of 1
/
n
with the smallest /
in coid1i:t
n
as the input.
8. For a one-dimensional unit n, its conditional pattern base [14]
extracted from the DFP-tree is the set of prefix paths of the nodes
carrying the one-dimensional n in the DFP-tree. Note that ns conditional
pattern base is a small database of units co-occurs with n in the DFP-tree
such that from ns conditional pattern base we can discover the units
associated with n.
header table of Ticc
1
, and then sorting the remaining nodes
in 1 according to the unit order in the header table of Ticc
1
.
After applying the two-step reconstruction process on
1ot/1i:t, we directly delete the paths in 1ot/1i:t from
^ n
/
s conditional pattern base. Then, after performing the
path removal on ^ n
/
s conditional pattern base, the mining
process proceeds by taking 1 and the DFP-tree constructed
on ^ n
/
s conditional pattern base to recursively extend the
unit 1 to discover higher dimensional dense units.
Theorem 3. In path removal technique, the two steps of the path
reconstruction process, i.e., path exclusion and path reorga-
nization, can correctly prepare the paths for performing the
path removal.
Proof. In the step 1, path exclusion, the paths in 1ot/1i:t can
not exist in current ^ n
/
s conditional pattern base if they do
not contain the set of one-dimensional units in 1, i.e.,
^ n
/
[ 1. Because ^ n
/
s conditional pattern base is con-
structed by extracting the prefix paths of the ^ n
/
s nodes in
the DFP-tree Ticc
1
, the paths in 1ot/1i:t which do not
contain ^ n
/
cannot exist in ^ n
/
s conditional pattern base.
Furthermore, because the paths in ^ n
/
s conditional pattern
base are the subpaths of Ticc
1
, the paths in 1ot/1i:t can
be in ^ n
/
s conditional pattern base if they can be the paths
of Ticc
1
. The paths in 1ot/1i:t can be in Ticc
1
if they
contain the one-dimensional units in 1 since the DFP-tree
Ticc
1
is constructed from the original DFP-tree by
recursively conditioned on the set of one-dimensional
units in 1. Therefore, the paths in 1ot/1i:t cannot be in
^ n
/
s conditional pattern base if they do not contain 1.
Based on the foregoing, we delete the paths in 1ot/1i:t,
whichdo not contain 1

n
, whichis the unit by removing the
unit n from ^ n
/
[ 1. The reason for taking out the unit n
from ^ n
/
[ 1 when considering deleting the paths is that
paths in 1ot/1i:t are the prefix paths of the nodes
carrying n in the original DFP-tree such that the unit n will
not appear in these paths in 1ot/1i:t.
The step 2, path reorganization, in the reconstruction
process is to reorganize the remaining paths in 1ot/1i:t
to the form they should appear in ^ n
/
s conditional pattern
base. Note that ^ n
/
s conditional pattern base are the set of
prefix paths of ^ n
/
s nodes in Ticc
1
. Thus, the paths in
^ n
/
s conditional pattern base contain only the one-
dimensional units whose positions in the header table of
Ticc
1
are above ^ n
/
, and their carried one-dimensional
units are sorted according to the unit order in the header
table of Ticc
1
. Based on this fact, for each path 1 in
1ot/1i:t, the path reorganization is performed by first
deleting the nodes which are positioned below ^ n
/
in the
header table of Ticc
1
, and then sorting the remaining
nodes in 1 according to the unit order in the header table
of Ticc
1
. tu
5 EXPERIMENTAL EVALUATION
All the experiments are conducted in Windows XP profes-
sional platform with 2 GB RAM and 1.7G P4CPU. Section 5.1
gives the data sets utilized in our experimental studies. The
evaluation of the accuracy of DENCOS is shown in
Section 5.2, where the analysis on parameter setting is shown
in Section 5.2.4. The experimental results of evaluating the
efficiency of DENCOS are shown in Section 5.3.
5.1 Data Set
5.1.1 Synthetic Data
Several synthetic data sets shown in Table 1 are used to
assess the qualitative performance of DENCOS. In these
data sets, 10 percent of the data points are random noises.
These synthetic data sets are generated by using the data
generation method utilized in CLIQUE [4]. In a data set, the
clusters are generated by specifying the following terms:
1) the dimensions of the subspace in which the cluster is
embedded and 2) for each attribute
,
of the subspace, the
range
,
.:toit.
,
.cid of
,
the cluster is embedded in.
Then, we generate the data set such that the average
densities of the data points inside the clusters are much
larger than their surrounding regions. The data points
assigned into a cluster are generated with uniform distribu-
tion. For a data point j assigned to a cluster, its attribute
values are assigned as follows. For each attribute
,
of the
subspace in which the cluster is embedded, we randomly
determine the value of attribute
,
of j from the range

,
.:toit.
,
.cid. For the remaining attributes, the value is
drawn randomly from the entire range of the attribute.
5.1.2 Real Data
Two real data sets from UCI machine learning repository
[26] are also utilized to evaluate the clustering result of
DENCOS. These real data sets are the adult database and
the thyroid disease database. The adult database contains
32,561 census data with six numerical attributes, and the
thyroid disease database contains 18,152 thyroid diagnoses
with six numerical attributes.
5.2 Algorithm Accuracy
5.2.1 Algorithm Accuracy on Synthetic Data
In this section, we utilize the synthetic data sets shown in
Table 1 to compare the clusteringresults of DENCOS withthe
ones of CLIQUE and SUBCLU. To evaluate the clustering
results, we take the dense regions generated by the data
generator as the known clusters, and evaluate the quality of
these known clusters discovered in the clustering algorithm
by two matrices, precision and recall. For a cluster discov-
ered, precision is defined as the percentage of the data
points in this cluster that really belong to the known cluster.
Recall is defined as the percentage of the data points in a
knowncluster that are identifiedinthis cluster. For DENCOS,
we use c 6 to partition each dimension into six intervals.
For each data set, a number of c values are used to find the
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 25
TABLE 1
The Synthetic Data Sets
best clustering result, and the average value of c for the
experimental results shown is 15. We set the maximal
subspace cardinality /
max
according to the value of c used
such that the threshold for the maximal subspace cardinality
is higher than one. On the other hand, for CLIQUE, we also
use c 6 topartitionthe dimensions. Inall data sets, CLIQUE
and SUBCLU are studied with a broad range of parameter
settings and the best results are taken for fair comparison.
Data set DS01 $ DS03. We execute DENCOS, CLIQUE,
and SUBCLU on these three data sets, and the results reveal
that they all can accurately discover the clusters with both
precision and recall close to unity.
Data set DS04. In data set DS04, the two three-
dimensional clusters are generated in the same subspace.
In Table 2a, we show five experimental results conducted on
DS04. Specifically, CLIQUE(3) and CLIQUE(5) are the
results of executing CLIQUE by setting different para-
meters. CLIQUE(3) shows the best result in discovering
the three-dimensional clusters, and CLIQUE(5) shows the
best one in discovering the five-dimensional cluster. Similar
notations are applied for SUBCLU, i.e., SUBCLU(3) and
SUBCLU(5). As can be seen in Table 2a, DENCOS can
discover all clusters with both high precision and recall
whereas CLIQUE and SUBCLU have difficulties in simulta-
neously discovering three clusters with high quality.
In the result of CLIQUE(5), we found that the two three-
dimensional clusters in the same subspace are merged into
one, which is reflected in the low recall. A possible reason is
that the low threshold for discovering the five-dimensional
cluster leads the low-density units between the two three-
dimensional clusters to be identified as dense ones such that
the two three-dimensional clusters are combined into one by
these low-density units. On the other hand, in CLIQUE(3),
the threshold which is set for discovering the three-
dimensional clusters with high quality may be too high as
compared to the low density of the five-dimensional cluster,
resulting in poor quality of the five-dimensional cluster.
Similar results are also derived in SUBCLU(3) and
SUBCLU(5). In SUBCLU, clusters are discovered by identify-
ing the core objects, where a point is a core object if the number
of data points in its c-neighborhood is larger than i. The
same thresholds c andiare imposed to define core objects in
all subspace cardinalities. As shown in SUBCLU(5), the two
thresholds c and i are relaxed to identify core objects in
higher subspaces because data are more sparsely populated
in higher subspaces. However, the relaxed thresholds would
make some data points between the two clusters be also
identifiedas core objects, andthese core objects will be linked
together with the core objects in the two three-dimensional
clusters, resulting in only one cluster identified.
Data set DS05 $ DS06. In data sets DS05 and DS06, we
test for the increase of the number of subspace clusters. The
comparative results on data sets DS05 and DS06 are shown
respectively in Table 2b and Table 3. As can be seen, when
the clusters are embedded in much different subspace
cardinalities, it is more difficult for CLIQUE and SUBCLU
to discover all clusters with high quality. In these data sets,
DENCOS can achieve higher quality for clusters in different
subspace cardinalities than CLIQUE and SUBCLU, thus
demonstrating the applicability of our proposed subspace
clustering model.
5.2.2 Algorithm Accuracy on Real Data
Two real data sets, the adult and the thyroid disease data
sets shown in Section 5.1.2, are also utilized to compare our
clustering quality with the ones of CLIQUE and SUBCLU.
The quality of the clustering result is evaluated in terms of
density ratio, abbreviated as 11. For a subspace o
0
, let
1ci:ity
o
0
c|n:tci
is the average region density of the regions
inside the clusters of o
0
, and 1ci:ity
o
0
ioi c|n:tci
is the average
region density of the regions outside the clusters of o
0
. Thus,
the density ratio of this subspace, denoted as 11o
0
, is
defined as
11o
0
1ci:ity
o
0
c|n:tci
,1ci:ity
o
0
ioi c|n:tci
. 3
Since the algorithms, DENCOS, CLIQUE, and SUBCLU,
all aim at extracting high-density regions in a subspace as
clusters, in a subspace o
0
a higher 11o
0
value means
26 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
TABLE 2
Experimental Results for (a) the Data Set DS04 and
(b) the Data Set DS05
TABLE 3
Experimental Results for the Data Set DS06
that the regions of higher densities can be better separated
from the regions of lower densities, and thus is a better-
quality result. Furthermore, for DENCOS and CLIQUE,
1ci:ity
o
0
c|n:tci
is calculated as the average unit count of the
units in the clusters of o
0
, and 1ci:ity
o
0
ioi c|n:tci
is the
average unit count of the units which are not contained in
the clusters of o
0
. For SUBCLU, to determine the regions
inside the clusters, we also partition the space into units.
We utilize a looser condition to identify the units in
clusters. In a subspace o
0
, a unit is determined to be in the
cluster if the percentage of this units enclosed data points
that are included in the clusters of SUBCLU exceeds
80 percent, and we use the number of total data points
contained in this unit in the calculation of 1ci:ity
o
0
c|n:tci
.
Fig. 7 shows the quality of clustering result of DENCOS
compared to CLIQUE and SUBCLUon the adult and thyroid
disease data sets. When comparing the clustering results, we
show the DR gain for each subspace cardinality /. For a
subspace cardinality /, the DR gain is the ratio of the average
density ratio of the subspaces of cardinality / calculated in
clustering result of DENCOS to the one calculated in the
compared clustering result. In these two data sets, we run
DENCOS by setting c 5, c 2, and /
max
6, and
DENCOS can discover the clusters in subspaces of cardin-
alities from 1 to 6. As for CLIQUE and SUBCLU, for each
data set we run with a lot of parameter settings and two
clustering results are compared as shown in Fig. 7.
As can be seen from Fig. 7, DENCOS results in high DR
gain in these two data sets. For each data set, CLIQUE and
SUBCLU cannot achieve high qualities of clusters in all
subspaces. More specifically, in CLIQUE and SUBCLU, a
strict threshold setting can derive good quality of clusters
in lower subspaces but clusters in higher subspaces cannot
be discovered. On the other hand, a looser threshold
setting can discover clusters in higher subspaces but will
result in the decreased quality of the clusters in lower
subspaces. For example, in Fig. 7a, if t 0.4 is used in
CLIQUE, five-dimensional and six-dimensional clusters
cannot be discovered. However, if the threshold t is
lowered to 0.15, five-dimensional clusters can be discov-
ered but the clusters in subspace cardinalities from 1 to 4
have the decreased quality. The decreased quality of
clusters in lower subspaces can be seen from the increased
DR gain for each subspace cardinality / in [1, 4] when the
threshold t is changed from 0.4 to 0.15. For example, when
t is changed from 0.4 to 0.15, the DR gain for / 1, / 2,
/ 3, and / 4 are respectively changed from 1 to 1.17,
from 1.59 to 1.84, from 1.43 to 1.79, and from 1.15 to 1.7.
Similar results can also be discovered in SUBCLU. There-
fore, as shown in Fig. 7, DENCOS can get high DR gain
and thus gets better quality clustering results than
CLIQUE and SUBCLU.
5.2.3 Algorithm Performance
on High-Dimensional Real Data
In this section, we further utilize two real data sets with
much higher data dimensionalities to assess the effective-
ness of DENCOS in higher dimensional data sets. These real
data sets are 1) Corel Image Features data set in UCI KDD
archive [15] and 2) Letter Recognition data set in UCI
machine learning repository [26]. The Corel Image
Features data set contains image features (co-occurrence
texture) extracted from a Corel image collection, and the
Letter Recognition data set contains the numerical
attributes (statistical moments and edge counts) extracted
from the stimulus images with English alphabets. These
two real data sets are with 16 data attributes.
The experimental results of DENCOS on these two real
data sets are shown in Table 4, where c is set as 5. The setting
of c is based on the parameter setting method described in
Section 5.2.4. Thus, for Corel Image Features data set, c is
set to 2, and for Letter Recognition data set, c is set to 3. In
Table 4, for each subspace cardinality /, we showthe average
density ratio (defined in (3)) of the subspaces of cardinality /.
As shown in Table 4, DENCOS can discover the clusters in
lower subspaces although these two real data sets are of high
data dimensionalities. In Table 4, for these data sets, the
average density ratio for each cardinality / is very high, that
is, the discovered clusters have much higher densities than
their surrounding regions. This reveals that the clusters
discovered by DENCOS are real regions that are of high
densities in the subspaces, thus showing the effectiveness of
DENCOS in discovering the subspace clusters for real data
sets of much high-data dimensionalities.
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 27
Fig. 7. Accuracy of DENCOS c 5 on the adult and thyroid disease
data sets. (a) Compared with CLIQUE on adult data set, (b) compared
with CLIQUE on thyroid disease data set, (c) compared with SUBCLU
on adult data set, and (d) compared with SUBCLU on thyroid disease
data set.
TABLE 4
Performance of DENCOS on Two Real Data Sets,
Where Average Density Ratio (DR) for Each
Subspace Cardinality / is Shown
5.2.4 Analysis on Parameter Setting
In DENCOS, we only need to set c, c, and /
max
to discover
the density-conscious subspace clusters. The unit strength
factor c is a relative value on unit counts in a subspace,
and is independent to the subspace cardinalities, that is,
users only need to specify an c value to discover the
clusters in all subspaces. Thus, it is much easier for users
to specify c to discover clusters in all subspaces since they
need not consider the varying region densities in different
subspace cardinalities. Furthermore, we can automatically
recommend a suitable c with a given c by the approach
shown as follows. We first run DENCOS by setting c to
be bc,2c, and iteratively increase or decrease c if the
quality of the clustering result can be improved. The idea
of first setting c to be a value smaller than c is to ensure
that the one-dimensional clusters, if exist, can be dis-
covered. In addition, the method of whether to increase or
decrease a current c
0
value is shown as follows. For each
side, i.e., increasing c
0
or decreasing c
0
, we calculate the
number of the net quality-increased subspace cardinalities.
For a subspace cardinality /, it is called quality-increased
(quality-decreased) if the increased (decreased) percentage of
the average density ratio of the subspaces of cardinality /
exceeds 10 percent. Then, the number of the net quality-
increased subspace cardinalities is the subtraction of the
number of the quality-increased subspace cardinalities to
the number of the quality-decreased subspace cardinal-
ities. Therefore, we choose the side where this net value is
positive, and if both sides have positive net values, we
choose the side where the net value is larger. With this
approach, the c values in the experiments conducted in
this section are set up automatically.
Note that /
max
is introduced in this paper to ensure that
the density threshold t
/max
1. The reason is that using a
density threshold which is smaller than or equal to one to
identify dense units is meaningless because in this circum-
stance all the units containing at least one data point will be
identified as dense units, resulting in meaningless cluster-
ing results. Therefore, with the given c and c. /
max
can be set
up automatically to ensure that t
/max
1.
5.3 Algorithm Efficiency
5.3.1 Algorithm Efficiency on Synthetic Data
In this section, we utilize the synthetic data sets in Table 1 to
evaluate the efficiency of DENCOS. For fair comparison,
CLIQUE is extended for being able to discover the dense
units defined by our settings of thresholds. First, we run
CLIQUE once by setting the single threshold to be the value
of t
/max
, and thus the units in all subspaces whose unit
counts exceed t
/
max
are identified. After that, for these units,
a filtering procedure is run through all subspaces to remove
the /-dimensional units whose unit counts do not satisfy t
/
.
On the other hand, since SUBCLU cannot be extended here
to discover the clusters with our proposed thresholds due to
the fact that they do not adopt the grid structure to discover
the clusters, we run SUBCLU by utilizing the parameters
such that the clusters in the highest subspace cardinality can
be discovered with high quality.
Data set size. We vary the data size of the data sets
DS03, DS04, and DS05 in Table 1 to assess the scalability
against the data set size. On these three data sets, the
performance of DENCOS compared with CLIQUE and
SUBCLU are respectively shown in Figs. 8a, 8b, 8c and
Figs. 8d, 8e, 8f. As can be seen in Fig. 8, when the data set
size increases, the execution time of DENCOS does not
increase very much. It is noted that the clusters embedded
in a synthetic data set are not changed when we vary the
data set size. Thus, for different data set size, the structures
of the constructed DFP-tree are not changed much such
that the execution time of mining the dense units from the
DFP-tree do not increase very much. In addition, this
behavior does not depend on data distribution. Our
experimental results shown in Fig. 8 reveal that for each
synthetic data set (DS03, DS04, and DS05), the execution
time of DENCOS does not change much when the data set
size increases.
As can be seen in Fig. 8, for different data sets, DENCOS
has better performance than the extended CLIQUE and
SUBCLU. In CLIQUE, the Apriori-like candidate generate-
and-test scheme is utilized to search the dense units. Thus,
28 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010
Fig. 8. The scalability against the data set size on data sets DS03, DS04, and DS05. (a) DS03, (b) DS04, (c) DS05, (d) DS03, (e) DS04, and
(f) DS05.
setting the density threshold as t
/
max
for being able to
discover the units satisfying our proposed thresholds
would result in the explosion of candidate units in lower
subspaces because t
/
max
is too low as compared with the
densities of the lower dimensional units, thus resulting in
larger execution time. In addition, for SUBCLU, the
exponential execution time may be incurred by the range
queries executed by each data point to calculate the number
of data points within its distance c for checking whether it is
a core object. The large amount of range queries inevitably
incur the large execution time in SUBCLU.
Dimensionality of the data set. Fig. 9 shows the log-log
relationship of the execution time versus the data
dimensionality on the data sets DS04 and DS05 in Table 1.
We vary the data dimensionality from 10 to 20. All the
three algorithms DENCOS, CLIQUE, and SUBCLU exhibit
quadratic behavior on the data dimensionality. Note that
for a data set of dimensionality d, the number of
subspaces of cardinality / will be Od
/
, thus illustrating
the intrinsic quadratic subspace clustering problem. As
can be seen in Fig. 9, DENCOS performs better than
CLIQUE and SUBCLU.
5.3.2 Algorithm Efficiency on Real Data
In addition to utilize synthetic data sets to evaluate the
scalability of DENCOS in Section 5.3.1, we also use the real
data sets described in Section 5.1.2 to assess the efficiency of
DENCOS. For comparison, the extension of algorithms
CLIQUE and SUBCLU as described in Section 5.3.1 are also
evaluated. Table 5 shows the execution time of algorithms
DENCOS, CLIQUE, and SUBCLU on two real data sets. As
shown in Table 5, DENCOS has much better time efficiency
than CLIQUE and SUBCLU.
6 CONCLUSIONS
In this paper, we devised a novel subspace clustering model
to discover the subspace clusters. We note that previous
works lack of considering the critical problem, called the
density divergence problem, in discovering the clusters, where
they utilize an absolute density value as the density thresh-
old to identify the dense regions in all subspaces. Therefore,
as shown in the conducted experiment results, previous
works have the difficulties in achieving high qualities of the
clusters in all subspaces. In view of the density divergence
problem, we identify the dense regions (clusters) in a
subspace by discovering the regions which have relatively
high densities as compared to the average region density in
the subspace. Therefore, in our model, different density
thresholds will be utilizedto discover the clusters in different
subspace cardinalities. For this novel clustering model, an
innovative algorithm, called DENCOS (DENsity COnscious
Subspace clustering), is also devised. As shown by our
experimental results, DENCOS can discover the clusters in
all subspaces with high quality, and the efficiency of
DENCOS significantly outperforms previous works, thus
demonstrating its practicability for subspace clustering.
ACKNOWLEDGMENTS
The work was supported in part by the National Science
Council of Taiwan, ROC, under Contracts NSC95-2752-E-
002-006-PAE. In addition, the authors would like to thank
K. Kailing, H.-P. Kriegel, and P. Kroger for their offering the
codes of CLIQUE and SUBCLU, whose variations are used
in their experimental studies.
REFERENCES
[1] C.C. Aggarwal, A. Hinneburg, and D. Keim, On the Surprising
Behavior of Distance Metrics in High Dimensional Space, Proc.
Eighth Intl Conf. Database Theory (ICDT), 2001.
[2] C.C. Aggarwal and C. Procopiuc, Fast Algorithms for Projected
Clustering, Proc. ACM SIGMOD Intl Conf. Management of Data,
1999.
[3] C.C. Aggarwal and P.S. Yu, The IGrid Index: Reversing the
Dimensionality Curse for Similarity Indexing in High Dimen-
sional Space, Proc. Sixth ACM SIGKDD Intl Conf. Knowledge
Discovery and Data Mining, 2000.
[4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,
Automatic Subspace Clustering of High Dimensional Data for
Data Mining Applications, Proc. ACM SIGMOD Intl Conf.
Management of Data, 1998.
[5] I. Assent, R. Krieger, E. Muller, and T. Seidl, DUSC: Dimension-
ality Unbiased Subspace Clustering, Proc. IEEE Intl Conf. Data
Mining (ICDM), 2007.
[6] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, When is
Nearest Neighbors Meaningful? Proc. Seventh Intl Conf. Database
Theory (ICDT), 1999.
[7] A. Blum and P. Langley, Selection of Relevant Features and
Examples in Machine Learning, Artificial Intelligence, vol. 97,
pp. 245-271, 1997.
[8] M.-S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from
Database Perspective, IEEE Trans. Knowledge and Data Eng., vol. 8,
no. 6, pp. 866-883, Dec. 1996.
[9] C.H. Cheng, A.W. Fu, and Y. Zhang, Entropy-Based Subspace
Clustering for Mining Numerical Data, Proc. Fifth ACM SIGKDD
Intl Conf. Knowledge Discovery and Data Mining, 1999.
[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases
with Noise, Proc. Second Intl Conf. Knowledge Discovery and Data
Mining (SIGKDD), 1996.
[11] H. Fang, C. Zhai, L. Liu, and J. Yang, Subspace Clustering for
Microarray Data Analysis: Multiple Criteria and Significance,
Proc. IEEE Computational Systems Bioinformatics Conf., 2004.
[12] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy,
Advances in Knowledge Discovery and Data Mining. MIT Press, 1996.
[13] J. Han and M. Kamber, Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2000.
CHU ET AL.: DENSITY CONSCIOUS SUBSPACE CLUSTERING FOR HIGH-DIMENSIONAL DATA 29
Fig. 9. The scalability against the data dimensionality on data sets
(a) DS04 and (b) DS05.
TABLE 5
Efficiency of DENCOS on Two Real Data Sets
[14] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without
Candidate Generation, Proc. 2000 ACM SIGMOD Intl Conf.
Management of Data, 2000.
[15] S. Hettich and S. Bay, The UCI KDD Archive, http://
kdd.ics.uci.edu, 1999.
[16] A. Hinneburg, C.C. Aggarwal, and D. Keim, What is the Nearest
Neighbor in High Dimensional Spaces? Proc. 26th Intl Conf. Very
Large Data Bases (VLDB), 2000.
[17] K. Kailing, H.-P. Kriegel, and P. Kroger, Density-Connected
Subspace Clustering for High-Dimensional Data, Proc. Fourth
SIAM Intl Conf. Data Mining (SDM), 2004.
[18] Y.B. Kim, J.H. Oh, and J. Gao, Emerging Pattern Based Subspace
Clustering of Microarray Gene Expression Data Using Mixture
Models, Proc. Intl Conf. Bioinformatics and Its Applications (ICBA),
2004.
[19] H.-P. Kriegel, P. Kroger, M. Renz, and S. Wurst, A Generic
Framework for Efficient Subspace Clustering of High-Dimen-
sional Data, Proc. Fifth IEEE Intl Conf. Data Mining (ICDM), 2005.
[20] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and
Data Mining. Kluwer Academic Publishers, 1998.
[21] L. Lu and R. Vidal, Combined Central and Subspace clustering
for Computer Vision Applications, Proc. 23rd Intl Conf. Machine
Learning (ICML), 2006.
[22] G. Moise, J. Sander, and M. Ester, P3C: A Robust Projected
Clustering Algorithm, Proc. Sixth IEEE Intl Conf. Data Mining
(ICDM), 2006.
[23] H.S. Nagesh, S. Goil, and A. Choudhary, Adaptive Grids for
Clustering Massive Data Sets, Proc. First SIAM Intl Conf. Data
Mining (SDM), 2001.
[24] L. Parsons, E. Haque, and H. Liu, Subspace Clustering for High
Dimensional Data: A Review, ACM SIGKDD Explorations News-
letter, vol. 6, pp. 90-105, 2004.
[25] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, A
Monte Carlo Algorithm for Fast Projective Clustering, Proc. 2002
ACM SIGMOD Intl Conf. Management of Data, 2002.
[26] UCI Repository of Machine Learning Databases, http://mlaern.
ics.uci.edu/MLRepository.html, 1998.
[27] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, FINDIT: A Fast
and Intelligent Subspace Clustering Algorithm Using Dimension
Voting, Information and Software Technology, vol. 46, pp. 255-271,
2004.
[28] M.L. Yiu and N. Mamoulis, Frequent-Pattern Based Iterative
Projected Clustering, Proc. Third IEEE Intl Conf. Data Mining
(ICDM), 2003.
Yi-Hong Chu received the BS degree in elec-
trical engineering from National Taiwan Univer-
sity, Taipei, Taiwan, in 2002. She is currently a
PhD candidate in the Electrical Engineering
Department, National Taiwan University, Taipei,
Taiwan. Her research interests include data
mining, data clustering, and database.
Jen-Wei Huang received the BS and PhD
degrees in electrical engineering from National
Taiwan University, Taipei, Taiwan, in 2002 and
2009, respectively, and is currently an assistant
professor in the Computer Science Department
at Yuan Ze University, Taiwan. He has been
doing research in IBM Almaden Research
Center from 2008 to 2009. He majors in
computer science and is familiar with data
mining area. His research interests include
data mining, mobile computing, and bioinformatics. Among these, the
Web mining, incremental mining, mining data streams, time series
issues, sequential pattern minning, and multimedia data mining are his
special interests. In addition, some researches are on mining general
temporal association rules, sequential clustering, data broadcasting,
and progressive sequential pattern mining.
Kun-Ta Chuang received the BS degree from
National Taiwan Normal University, Taipei,
Taiwan, R.O.C., in 2000, and the PhD degree
in communication engineering from National
Taiwan University, Taipei, Taiwan, R.O.C., in
2006. He is currently serving as a software
engineer in SYNOPSYS Inc. to develop physical
verification tools. His research interests include
data mining, mobile data management, and
electronic design automation.
De-Nian Yang received the BS and PhD
degrees from the Department of Electrical
Engineering, National Taiwan University, Tai-
pei, Taiwan, in 1999 and 2004, respectively.
He is currently an assistant research fellow in
Institute of Information Science, Academia
Sinica, Taiwan. He is a member of the IEEE.
His research interests include data manage-
ment and networking.
Ming-Syan Chen received the BS degree in
electrical engineering from National Taiwan
University, Taipei, Taiwan, and the MS and
PhD degrees in computer, information and
control engineering from The University of
Michigan, Ann Arbor, in 1985 and 1988, respec-
tively. He is now a distinguished research fellow
and the director of Research Center of Informa-
tion Technology Innovation (CITI) in the Acade-
mia Sinica, Taiwan, and is also a distinguished
professor jointly appointed by EE Department, CSIE Department, and
Graduate Institute of Communication Eng. (GICE) at National Taiwan
University. He was a research staff member at IBM Thomas J. Watson
Research Center, Yorktown Heights, New York, from 1988 to 1996, the
director of GICE from 2003 to 2006, and also the president/CEO of
Institute for Information Industry (III), which is one of the largest
organizations for information technology in Taiwan, from 2007 to 2008.
His research interests include databases, data mining, mobile comput-
ing systems, and multimedia networking, and he has published more
than 270 papers in his research areas. In addition to serving as program
chairs/vice-chairs and keynote/tutorial speakers in many international
conferences, he was an associate editor of the IEEE Transactions on
Knowledge and Data Engineering and also the Journal of Information
Systems Education, is currently on the editorial boards of the Very Large
Database (VLDB) Journal, the Knowledge and Information Systems
(KAIS) Journal, and the International Journal of Electrical Engineering
(IJEE), and is a distinguished visitor of IEEE Computer Society for Asia-
Pacific from 1998 to 2000, and also from 2005 to 2007. He holds, or has
applied for, 18 US patents and seven ROC patents in his research
areas. He is a recipient of the Academic Award, the NSC (National
Science Council) Distinguished Research Award, the Pan Wen Yuan
Distinguished Research Award, the Teco Award, the Honorary Medal of
Information, and the K.-T. Li Research Breakthrough Award for his
research work, and also the Outstanding Innovation Award from IBM
Corporate for his contribution to a major database product. He also
received numerous awards for his research, teaching, inventions, and
patent applications. He is a fellow of the ACM and the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
30 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

You might also like