Professional Documents
Culture Documents
Abstract—Clustering of big data has received much attention vector of attributes, data clustering is performed on feature
recently. In this paper, we present a new clusiVAT algorithm and vectors xi ∈ Rp , where xi is the p-dimensional feature vec-
compare it with four other popular data clustering algorithms. tor for oi , 1 ≤ i ≤ n. These data can also be represented
Three of the four comparison methods are based on the well
known, classical batch k-means model. Specifically, we use in the form of an n × n dissimilarity matrix D, where Dij
k-means, single pass k-means, online k-means, and cluster- represents dissimilarity (distance) between oi and oj . Usually
ing using representatives (CURE) for numerical comparisons. the Euclidean distance ||xi − xj || is taken as the dissimilarity
clusiVAT is based on sampling the data, imaging the reordered measure, but it can be any norm on Rp .
distance matrix to estimate the number of clusters in the data Many papers and books describe different data cluster-
visually, clustering the samples using a relative of single link-
age (SL), and then noniteratively extending the labels to the ing approaches and their applications [1]–[6]. Among the
rest of the data-set using the nearest prototype rule. Previous large number of clustering approaches in the literature, the
work has established that clusiVAT produces true SL clusters largest two groups are based on hierarchical clustering and
in compact-separated data. We have performed experiments to centroid-based clustering. Hierarchical clustering relies on
show that k-means and its modified algorithms suffer from ini- the fact that nearby objects have a higher probability of
tialization issues that cause many failures. On the other hand,
clusiVAT needs no initialization, and almost always finds par- belonging to the same cluster than to a cluster containing
titions that accurately match ground truth labels in labeled objects that are farther away. This category includes single
data. CURE also finds SL type partitions but is much slower linkage (SL), which is based on cutting large edges in a min-
than the other four algorithms. In our experiments, clusiVAT imum spanning tree (MST) [7]. In this paper, we discuss
proves to be the fastest and most accurate of the five algo- two connectivity-based algorithms, clusiVAT and clustering
rithms; e.g., it recovers 97% of the ground truth labels in the
real world KDD-99 cup data (4 292 637 samples in 41 dimensions) using representatives (CURE). Centroid-based clustering algo-
in 76 s. rithms represent clusters as groups located in close proximity
to their cluster centers. Most centroid-based models depend
Index Terms—Big data cluster analysis, cluster tendency
assessment, data analytics, Internet of things, single linkage. on optimizing an objective function, which typically measures
a property such as: 1) intercluster separation; 2) within-cluster
variance; or 3) both.
I. I NTRODUCTION Technologies such as social media, mobile computing, and
ATA clustering is the problem of partitioning a set of
D unlabeled objects O = {o1 , o2 , . . . , on } into k groups
of similar objects, where 1 < k < n. Before clusters can
the realization of the Internet of Things (IoT) generate an exor-
bitant amount of data every day, which comprise the big data
problem [8]–[11]. Big data approaches currently consider one
be sought, it is necessary to estimate k; this is the clus- or more aspects of the so called 5Vs (volume, velocity, variety,
ter tendency problem. When each object is represented by a value, and veracity) [12]. This paper concentrates on the vol-
ume aspect of big data, which requires novel techniques to be
Manuscript received April 29, 2014; revised January 21, 2015 and
June 9, 2015; accepted August 29, 2015. This work was supported by addressed by conventional data clustering algorithms.
the Australian Research Council (ARC) Research Network on Intelligent Our main contributions in this paper are as follows.
Sensors, Sensor Networks and Information Processing under REDUCE 1) We present our new clusiVAT algorithm for big data
Project Grant EP/I000232/1 through the Digital Economy Programme run by
Research Councils U.K.—a cross council initiative led by EPSRC and con- clustering and perform experiments to compare its
tributed to by Arts and Humanities Research Council, Economic and Social performance with other popular big data clustering
Research Council, and Medical Research Council; and ARC under Grant algorithms: k-means, single pass k-means (spkm), online
LP120100529 and Grant LF120100129. This paper was recommended by
Associate Editor F. Karray. k-means (okm), and CURE.
D. Kumar, M. Palaniswami, and S. Rajasegarar are with the Department 2) We perform experiments on 24 2-D datasets having
of Electrical and Electronic Engineering, University of Melbourne, Gaussian clusters of up to 1 000 000 samples, nine
Melbourne, VIC 3010, Australia (e-mail: dheerajk@student.unimelb.edu.au;
palani@unimelb.edu.au; sraja@unimelb.edu.au). high-dimensional sets of Gaussian clusters (having a
J. C. Bezdek and C. Leckie are with the Department of Computing maximum of 500 000 500-dimensional datapoints), and
and Information Systems, University of Melbourne, Melbourne, VIC 3010, two real-life datasets (the largest of which has 4 292 637
Australia (e-mail: jcbezdek@gmail.com; caleckie@unimelb.edu.au).
T. C. Havens is with the Department of Electrical and Computer vectors with 41 features each), to show the useful-
Engineering, Michigan Technological University, Houghton, MI 49931, USA ness of clusiVAT in terms of CPU time and partition
(e-mail: thavens@mtu.edu). accuracy (PA).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 3) To illustrate the utility of clusiVAT for unlabeled data,
Digital Object Identifier 10.1109/TCYB.2015.2477416 we perform clustering experiments on indoor office
2168-2267 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
environment energy usage data from the University Ball and Hall [24] proposed a k-means based method named
of Surrey, U.K. [13]–[15]. While clusiVAT is able to ISODATA for data analysis and pattern classification in 1965.
suggest the number of clusters in the dataset, other These algorithms are all batch methods that attempt to min-
algorithms must rely on intuition or guesses or need imize a global objective function. Unfortunately, there is a
to use a clustering tendency algorithm for an esti- sequential version of this model that is essentially a com-
mate of k. clusiVAT partitions have the largest value petitive learning algorithm that is also called k-means. This
of Dunn’s index (DI) amongst candidates found by the sequential version was first proposed by MacQueen [25]
five algorithms. in 1967. Sequential k-means is an application of vector quanti-
4) We apply the Friedman test to show the statistical signif- zation which tries to find k means in the dataset for k clusters,
icance of the accuracy ranking for the various clustering where each data point belongs to the cluster with nearest mean.
algorithms examined in this paper. In this paper, “k-means” refers to the batch version.
The rest of this paper is structured as follows. Section II The k-means algorithm is easy to implement and is compu-
contains related work. Our new clusiVAT model is discussed tationally efficient, but it has various limitations. For example,
in Section III. Section IV gives brief descriptions of k-means the number of clusters is an input for k-means, which is usu-
and two big data relatives, spkm and okm. CURE is reviewed ally not known. More worrisome is the fact that k-means often
in Section V and Section VI gives a overview of the Friedman gets stuck at a local trap state of its objective function, which
test. In Section VII, we discuss the computational complexity may lead to incorrect cluster interpretations. This problem is
of the clusiVAT algorithm. Section VIII contains numer- usually ascribed to poor initialization. Another limitation of
ical comparisons using real and synthetic datasets before k-means is that its distance-based model for identifying good
summarizing in Section IX. clusters depends on the topology of the norm used in its objec-
tive function. The usual model uses an inner product norm
whose topology matches well with elliptically shaped clus-
II. R ELATED W ORK ters. Furthermore, k-means tries to impose the same shape on
Data clustering is primarily concerned with separating all k clusters. Thus, in some sense k-means and SL work well
objects into k different groups, which presupposes one impor- for data distributions at geometrically opposite extremes.
tant preclustering task, namely, estimating the number of A large number of algorithms based on both SL and k-means
clusters in the data (clustering tendency). The visual assess- have been proposed for the big data clustering problem. To
ment of tendency (VAT) algorithm [16] addresses the question the best of our knowledge, the first scalable SL-based algo-
of clustering tendency by reordering the dissimilarity matrix D rithm was proposed in [26], where it was called scalable-VAT
to obtain D∗ so that different clusters may be displayed as dark (sVAT)-SL. The clusiVAT model and algorithm proposed in
blocks along the diagonal of the image of D∗ . this paper are extensions of the ideas presented in [26].
SL proceeds by connecting the next nearest vertex to the Another scalable relative of sVAT-SL was discussed and com-
current edge until the complete MST is formed. k clusters pared to a fast MST algorithm called filter-Kruskal in [27].
are then formed by cutting the largest k − 1 edges of As for the big data versions of k-means, a hierarchical ver-
the MST. SL performs best if the clusters are long, chain-like sion that divides the data into two parts at each step before
clouds, well separated from each other. As cluster separation clustering, named bisecting k-means, was proposed in [28].
decreases and the clusters in the data start merging with each A fast, scalable version of k-means was presented in [29],
other, SL becomes unreliable. Nonetheless, SL has been suc- which does not require all the data to be stored in main mem-
cessfully used in many data clustering applications. In the ory at the same time. A fuzzy algorithm based on k-means for
field of astronomy, dark matter halos were discovered by big data was proposed in [30]. Eschrich et al. [31] replaced
Lacey and Cole [17] using SL. In the field of wireless sen- group points with the group centroid to speed up a fuzzy ver-
sor networks, Moshtaghi et al. [18] used SL for anomaly sion of k-means for big data. Feldman et al. [32] used coresets
detection. Dendrograms, which are visual representations of to approximate a large number of datapoints from big data by
linkage clusters, are used in many numerical taxonomy appli- a single point. In this paper, we have used two big data adap-
cations [19]. In the field of healthcare, SL has been used to tations of k-means namely, spkm, and okm, which split the
segment time-series sensor data for patient monitoring at elder- big dataset into small chunks of data before clustering for
care facilities [20]. Zhang et al. [21] discussed a commercial faster run time. An application of k-means based clustering is
application of VAT for role-based access security. presented in [33].
The k-means algorithm is one of the most popular clustering CURE [34] is a sample-based algorithm for large datasets,
algorithms, mainly because of its simplicity and applicabil- which performs clustering on a small subset, and then extends
ity in various fields and is used extensively in literature as a the results to the entire dataset. It is able to identify clus-
benchmark for clustering algorithms. k-means was developed ters having nonspherical shapes and is robust to outliers.
independently in different scientific fields. For continuous mul- CURE seeks a middle ground between SL and k-means by
tidimensional data, k-means was first explicitly proposed by initializing clusters in the sample, and then, akin to SL, merg-
Steinhaus [22] in 1956 in the field of mechanics. In the field ing nearest clusters until the desired value of k is attained.
of communication, Lloyd [23] proposed k-means for least Since CURE combines elements of linkage-based and central
squares quantization in pulse code modulation in 1957 as a tendency-based clustering methods, it is in some sense an ideal
Bell Laboratory technical note (it was later published in 1982). comparison method for the experiments presented later.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Czekanowski [40] in 1909. Though it was a manual approach where Pij is the set of all paths from object i (oi ) to object j (oj )
on very small dataset of only 13 samples, it opened the way in O. We use the recursive version of iVAT presented in [43] as
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Data scatterplot, VAT, sVAT, and siVAT images for small Gaussian clusters (top) and big Gaussian clusters (bottom). (a) Dataset N = 5000.
(b) VAT for N = 5000. (c) sVAT for n = 500. (d) siVAT for n = 500. (e) Dataset N = 1 000 000. (f) VAT for N = 1 000 000. (g) sVAT for n = 500.
(h) siVAT for n = 500.
it has time complexity of O(n2 ) as compared to O(n3 ) for the Algorithm 4: clusiVAT
iterative construction of D∗ in [42]. Importantly, the theory Input : X = {x1 , x2 , . . . , xN } − N p-dimensional data
that connects SL to VAT also holds for recursive iVAT, which points
preserves VAT order. k − overestimate of actual number of clusters
Though VAT and iVAT work fine for small datasets, n − approximating sample size
they both suffer from resolution and memory constraints Output: Dn∗ − n × n iVAT reordered dissimilarity matrix
that limit their usefulness to input matrices sized on the of Dn
order of 105 or so. To overcome this limitation, scalable- u : X → {1, 2, . . . , k} − cluster membership
VAT [sVAT (Algorithm 3)] was introduced in [44], which
Apply siVAT on X returning Dn∗ , S, P, d
works by sampling the big dataset and then constructing a
Choose the number of clusters k using siVAT image
VAT or iVAT image of the sample. sVAT finds a small Dn
t = arg max di
distance matrix (having size n × n) of a subset of the big 1≤i≤k
data, X = {x1 , x2 , . . . , xN }, where n is a “VAT-sized” fraction Form the aligned partition:
of N. siVAT is just like sVAT, except it uses iVAT after the u∗ = {t1 : t2 − t1 : ... : tk − tk−1 }
sampling step. uSP = u∗Pi ; 1 ≤ i ≤ k
i
To illustrate VAT, sVAT, and siVAT consider Fig. 1. Fig. 1(a) for x̂ ∈ X̂ = X − XS do
is the scatterplot of 5000 2-D data points randomly drawn j = arg min{dist{xŝ , xi }}
from a four-component Gaussian mixture having equal prior i∈
S
uŝ = uj (NPR)
probabilities. Its VAT image is shown in Fig. 1(b). Fig. 1(c) end
shows the sVAT image of 500 samples (10% of the total
dataset) which was made in about 0.1% of the time taken
to compute the full VAT image. Fig. 1(b) and (c) both
suggests the presence of four clusters in the data by four The only role played by the MST built by Prim’s algo-
dark blocks along the diagonal, but these dark blocks are rithm in VAT and iVAT is to acquire the array of indices as
much clearer in the siVAT image of the n = 500 sampled edges joining the MST. This array is used in the reordering
points [Fig. 1(d)]. To illustrate the extension of this idea to operation. Now suppose that one of these images suggests that
big data, Fig. 1(e) is a scatterplot of N = 1 000 000 2-D the best guess for the number of clusters in D is k. Having this
points drawn from the same 4 component Gaussian, with estimate, we cut the k − 1 largest edges in the MST, resulting
250 000 points per cluster. In this case, we cannot gener- in k connected subtrees (the clusters). The essential step in
ate a VAT image, indicated by the question mark (?) in clusiVAT is to extend this k-partition of Dn noniteratively to
Fig. 1(f). However, we can generate sVAT and siVAT images the unlabeled objects in X using the nearest (object) prototype
for this big dataset by sampling n = 500 points (0.05% of the rule (NPR). Pseudocode for our new clusiVAT algorithm is
total dataset) from DN . The sVAT image [Fig. 1(g)] suggests given in Algorithm 4.
four clusters, which again are much sharper in the siVAT Next we give a thumbnail sketch of the theory that connects
image [Fig. 1(h)]. SL to clusiVAT. Consider a partition U of a set of n feature
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
recursive version of iVAT [43] that we use in clusiVAT pre- Jw (U) = wj ||xj − mw,i ||2 . (9)
serves VAT ordering, the same clustering principle applies to i=1 xj ∈Xi
clusiVAT as well.
wkm attempts to minimize the overall weighted
For datasets having DI VD (U, X) < 1, SL clusters cannot be
(within groups) sum of squared errors. wkm is used in
guaranteed by clusiVAT. For such datasets, clusiVAT becomes
the next two sections, and takes as input n p-dimensional
a novel clustering algorithm, that in our experience to date
datapoints {x1 , x2 , . . . , xn }, their corresponding weights
produces much better clusters than SL [27]. Very few datasets
{w1 , w2 , . . . , wn }, and the number of clusters k. An alternative
have the CS property, but clusiVAT can be used for arbitrarily
initialization is to specify initial centroids {m1 , m2 , . . . , mk }.
large datasets (whether CS or not) and, as we shall demonstrate
If initial centroids are not provided, we randomly select
in our comparison experiments, it produces highly accurate
k input points as initial centroids. The outputs returned by
clusters in less time than several well known k-means related
wkm are a set of k clusters U ∼ {X1 , X2 , . . . , Xk }, their cen-
alternatives for clustering big data.
troids M = {mw,1 , mw,2 , . . . , mw,k } and cluster membership
function u : X → {1, 2, . . . , k}.
IV. k-M EANS AND R ELATED A LGORITHMS
Consider N p-dimensional points, X = {x1 , x2 , . . . , xN }
A. Single Pass k-Means
to be clustered into k clusters. Let the set of clusters be
U ∼ {X1 , X2 , . . . , Xk }. k-means seeks a partition having an spkm (Algorithm 5) is a crisp adaptation of the single pass
overall minimum squared error between the sample means of fuzzy c-means algorithm discussed in [48]. Let N be the
the clusters and the points in the clusters. The mean mi of number of points in a dataset, which are unloadable in main
cluster Xi is given by memory; let n denote the number of points that can be loaded
into memory; and let s = (N/n). spkm divides the N points
xj ∈Xi xj in the big dataset into s chunks of small data. A portion of
mi = (5)
|Xi | the data is loaded into memory and k-means is applied to
where |Xi | is the number of data points in cluster Xi . obtain k clusters. This first set of input data is replaced by the
The squared error for cluster Xi is defined as k weighted means {mw,i : 1 ≤ i ≤ k}, where the weights are
the numbers of points in each cluster, {wi = |Xi | : 1 ≤ i ≤ k}.
J(Xi ) = ||xj − mi ||2 . (6) These k weighted centroids are then merged with the next data
xj ∈Xi chunk and wkm is applied to this merged set with the cen-
The norm ||*|| can be any vector norm on Rp . The usual troids from the previous k-means run taken as initial centroids.
choice is the Euclidean norm, and that is what we use in our This process is repeated until the whole dataset is loaded and
numerical experiments. k-means aims at minimizing the sum processed. After obtaining the final k centroids, the big data
of squared errors for all k clusters, defined as are labeled based on the NPR as shown in Algorithm 5.
spkm can be used with arbitrarily large input data. The two
k
k
most important disclaimers about its effectiveness are that:
J(U) = J(Xi ) = ||xj − mi ||2 . (7)
1) each of the s steps in this procedure is subject to the
i=1 i=1 xj ∈Xi
same limitations and problems as the parent (k-means) and
Minimizing the squared error is an NP-hard problem [47]. 2) the output is clearly dependent on the way the big data
We have used the standard k-means algorithm which uses are divided into s chunks. This becomes clear if you interpret
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
V. CURE A LGORITHM
CURE [34] is a hierarchical clustering algorithm that ran- VI. F RIEDMAN T EST
domly samples a fixed number of points from the large The Friedman test [37] ranks the algorithms for each dataset
dataset so that the representative points (hopefully) retain separately, so that the best performing algorithm gets rank 1,
the geometry of the entire dataset. Let the sampled dataset the second best rank 2, and so on. In the case of ties, average
j
be S, and assume that it contains k clusters. Each clus- ranks are assigned to each one of them. Let ri be the rank
ter is represented by a fixed number χ of well-scattered of the jth of the A algorithms on the ith of the B datasets.
points called representative points, which are shrunken toward The average rank of the jth algorithm for all the datasets is
j
the mean of the cluster by a fraction α to have a com- then given by Rj = (1/B) Bi=1 ri . Under the null hypothesis,
pact representation of the cluster as well as to minimize the which states that all the algorithms behave similarly and thus
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Distinguished object and random sample selection for clusiVAT experiment of a big non-CS dataset having N = 1 000 000 and k = 10 [PA = 99.92%
(left and right tip of the yellow cluster and bottom of the green cluster are the errors)]. (a) Ground truth scatter plot. (b) Random samples from each partition.
(c) siVAT image of samples. clusiVAT (d) MST image of 100 point sample, (e) partition image of sample points, and (f) partition image of entire dataset.
their ranks Rj should be equal, the Friedman statistic To illustrate the above point, consider the 2-D non-CS
⎡ ⎤ big dataset shown in Fig. 2(a). It consists of k = 10 clus-
12B ⎣ A
A(A + 1) ⎦
2 ters comprising 1 000 000 points, which are intermixed with
χF2 = R2j − (10)
A(A + 1) 4 each other and hence difficult to cluster for any algorithm.
j=1
In this experiment we have taken k = 20 and n = 100.
is distributed according to χF2 with A − 1 degrees of freedom. Fig. 2(a) also shows the 20 distinguished objects found using
Iman and Davenport [52] showed that Friedman’s χF2 the sampling procedure of the clusiVAT algorithm (shown
presents a conservative behavior and proposed a better statistic by bold black dots) and their corresponding partition of the
dataset (shown by solid black lines). Fig. 2(b) shows 100
(B − 1)χF2 randomly chosen samples, to which the iVAT algorithm is
FF = (11)
B(A − 1)χF2 applied. Different clusters are more clearly visible in Fig. 2(b),
and hence easy to cluster. Fig. 2(c) shows the siVAT image,
which is distributed according to the F-distribution with A − 1
showing the possibility of four clusters at a coarse level
and (A − 1) × (B − 1) degrees of freedom.
[if you see Fig. 2(b) from a distance, you see four clusters
at four corners of the frame] and finer level examination of
VII. C OMPUTATIONAL C OMPLEXITY Fig. 2(c) reveals the presence of ten clusters. Fig. 2(d) shows
In this section, we discuss the computational complexity and the MST generated using clusiVAT. Its largest nine edges are
PA of the clusiVAT algorithm. Consider a dataset X having shown in green, which would be cut to obtain ten clusters
N p-dimensional datapoints. The first step in clusiVAT is the of the sampled data as shown in Fig. 2(e). Finally, Fig. 2(f)
selection of k distinguished objects which are at a maximum shows the partition image for the entire dataset generated
distance from each other. This step divides the entire dataset using the NPR.
into k partitions which (on average) span almost equally sized This example shows that clusiVAT can find a high qual-
subspaces of Rp . This step has time complexity linear in k . ity partition using a very small subsample of the dataset
The next step in clusiVAT is to randomly select objects from and the sampling process ensures that we get a fairly good
the k partitions to get a total of n samples. The number of representation of the entire dataset with a small number of
objects selected from each partition is proportional to the num- sample points. k-means and related algorithms process the
ber of datapoints in that partition. These n samples, which entire dataset either all at once (for k-means) or in parts
are just a small fraction of N, retain the approximate geom- (spkm and okm). For CURE, the initial sampling step ran-
etry of the dataset. In the next step, VAT is applied to the n domly selects a fixed number of samples from the dataset,
samples, which (including construction of Dn from X) has a which does not guarantee that the samples retain the geom-
time complexity of O(n2 ). So the N × N distance matrix for etry of the entire dataset. If the number of clusters is small,
the big dataset (DN ) is never needed, but just the n × n dis- the probability of CURE samples retaining the data geometry
tance matrix of the sampled dataset (Dn ). The time reported is high, but as the numbers of clusters increases, the accuracy
for all the experiments in this paper includes the time taken of CURE decreases because the samples do not always retain
to calculate this small distance matrix of the sampled points. the geometry of the entire dataset.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
AVERAGE R ESULTS OF 25 RUNS FOR THE 12 B IG DATASETS OF CS G AUSSIAN C LUSTERS
TABLE II
AVERAGE R ESULTS OF 25 RUNS FOR THE 12 B IG DATASETS OF N ON -CS G AUSSIAN C LUSTERS
TABLE III
AVERAGE R ESULTS OF 25 RUNS FOR THE N INE H IGH -D IMENSIONAL B IG DATASETS H AVING 100 N ON -CS G AUSSIAN C LUSTERS
TABLE IV
AVERAGE R ESULTS OF 25 RUNS FOR F OREST C OVER T YPE DATASET
Fig. 5. siVAT reordered distance matrix image for a n = 70 sample points Fig. 6. siVAT reordered distance matrix image for a n = 230 sample points
of the Forest dataset. of the KDD-99 training dataset.
clusiVAT recovers the highest percentage (44.7%) of the Users to root attack starts out with access to a normal user
ground truth labels with a run time of 4.23 s, and CURE is a account on the system and is able to exploit some vulnerabil-
very close second, with PA = 43.6%, but at a time cost that ity to obtain root access to the system. It contains the following
is about 12 times higher than clusiVAT. The siVAT image in attacks: “buffer_overflow,” “loadmodule,” “rootkit,” and “perl.”
Fig. 5 of the forest data carries no suggestion that k = 7 is Remote to local attack sends packets to a machine to which
a good assessment of cluster tendency in the forest data. In the attacker does not have legitimate access, and exploits
fact, this image suggests k = 2 clusters at low resolution, and some vulnerability to gain local access as a user. It consists
within the larger cluster perhaps k = 15 smaller clusters; so, of the attacks: “warezclient,” “multihop,” “ftp_write,” “imap,”
we are not unhappy with these accuracies. “guess_passwd,” “warezmaster,” “spy,” and “phf.” Probing
attack attempts to gather information about a network of com-
puters for the apparent purpose of circumventing its security. It
E. Example 5 (KDD Cup 99 Data Experiment) contains the following attacks: “portsweep,” “satan,” “nmap,”
This data set was used for the Third International and “ipsweep.”
Knowledge Discovery and Data Mining Tools Competition. Fig. 6 shows a siVAT image for the KDD-99 training
The KDD-99 training dataset consists of 4 292 637 instances dataset. The middle big black block represents the “smurf”
of 41 dimensional vectors and is labeled data that specifies the attack comprising 60% of the total dataset. The top left cor-
attack type (normal or attack). We normalized the 41 features ner black block represents the “normal” case (approximately
to the interval [0, 1] so that they all had the same scale. The 18% of the total datapoints) and the bottom right corner block
ground truth partition is not CS as DI for the ground truth represents the “neptune” attack (approximately 20% of the
labels is 0. total datapoints). The remaining attack types are represented
KDD-99 has 22 simulated attack types, which fall in one by very small black subblocks along the diagonal.
of the following four categories [54]. Denial of service attack We have performed experiments to cluster different attack
makes some computing or memory resources too busy or too types and the average results of 25 runs are given in Table V.
full to handle legitimate requests. It consists of the attacks: clusiVAT performs well, having an average accuracy of 97.1%
“neptune,” “back,” “smurf,” “pod,” “land,” and “teardrop.” and a minimum average time of 76.0 s.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
AVERAGE R ESULTS OF 25 RUNS FOR KDD-99 T RAINING DATASET
TABLE VI
PIR, N OISE , AND L IGHT S ENSOR VALUE P LOTS FOR T HREE C LUSTERS AT N ODE 4
TABLE VII
AVERAGE R ESULTS OF 25 RUNS FOR REDUCE E NERGY DATASET
DI DI DI DI DI
indicating that the clusiVAT partitions are superior with regard clusiVAT gives an accuracy of 100% in much less time than
to Dunn’s validity measure. k-means and its variants, and CURE. For 2-D non-CS datasets,
clusiVAT gives quite high accuracy (≥99.8%) in 12–18 times
G. Friedman Test less CPU time than k-means and its relatives, and 60–90
In this experiment, we apply Friedman test [37] to the results times less CPU time than CURE. To illustrate the utility of
obtained from all the A = 5 clustering algorithms on all the clusiVAT for unlabeled data, we performed experiments on
B = 39 datasets (described in Examples 1–6). For the datasets an energy dataset and demonstrated that clusiVAT produced
in Examples 1–5, we have used PA, and for Example 6 we clusters having a DI much greater than 1. Since the data are
have used DI as a measure to rank the algorithms. The algo- unlabeled, there is no way to assess PA for the energy data.
rithm giving highest PA/DI is assigned rank 1, the second The Friedman test performed on the PA and DI results for
highest is assigned rank 2, and so on. If two or more algo- different datasets validates the performance ranking of the var-
rithms have the same value of PA/DI, the average of their ranks ious clustering algorithms, the average ranks being 1.56, 4.18,
is assigned to each of them. For example, for the N = 100 000 2.17, 4.36, and 2.73 for clusiVAT, k-means, spkm, okm, and
and k = 3 CS big dataset in Example 1 [Section VIII-A], the CURE, respectively.
PA values for clusiVAT, k-means, spkm, okm, and CURE are k-means and its big data variants are sometimes plagued
100, 92.5, 100, 80, and 100, respectively. The highest PA of by initialization issues, which result in extremely poor per-
100 is achieved by clusiVAT, spkm, and CURE, so each of formance; this brings down their average accuracy. Of the
them is assigned a rank of 2 (average of 1–3). k-means is k-means algorithms, spkm seemed to perform the best because
assigned rank 4 and okm rank 5. of its accuracy on small datasets. CURE creates relatively good
The average ranks of the algorithms over all 39 datasets clusters for smallish CS and non-CS datasets, but takes much
are calculated to be 1.56, 4.18, 2.17, 4.36, and 2.73 for clusi- more time than clusiVAT. A significant advantage of clusiVAT,
VAT, k-means, spkm, okm, and CURE, respectively, giving as compared to CURE, is the siVAT image of the clusiVAT
the Friedman statistics of χF2 = 94.64 and FF = 106.52 sample, which provides us with a best guess for k, whereas
[using (10) and (11)]. With A = 5 algorithms and B = 39 CURE must be initialized with an uninformed user-supplied
datasets, χF2 is distributed according to a χF2 distribution with best guess. In summary, we think that the examples presented
A − 1 = 4 degrees of freedom and FF is distributed according here justify further study of clusiVAT for detecting substruc-
to a F distribution with A − 1 = 4 and (A − 1) × (B − 1) = 152 ture in big data. We are especially interested in testbeds for
degrees of freedom. The probability of the null hypothesis real world problems, so our next project is to move this model
(all the algorithms behave similarly and thus their ranks Rj and algorithm into the big data applications domain.
should be equal) computed using both, χF2 (4), and FF (4, 152)
is 0, so the null hypothesis is rejected. Hence, on the basis R EFERENCES
of the experiments performed in this paper, we can con- [1] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,”
clude that the ranking of clustering algorithms based on PA ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999.
(DI for Example 6), from best to worst, clusiVAT, spkm, [2] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expres-
CURE, k-means, and okm is consistent. sion data: A survey,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 11,
pp. 1370–1386, Nov. 2004.
[3] A. K. Jain, “Data clustering: 50 years beyond k-means,” in Machine
IX. C ONCLUSION Learning and Knowledge Discovery in Databases. Berlin, Germany:
Springer, 2008, pp. 3–4.
In this paper, we have illustrated our new clusiVAT algo- [4] J. Bezdek, Pattern Recognition With Objective Function Algorithms.
rithm for big datasets and have compared its performance to New York, NY, USA: Plenum, 1981.
four other popular clustering algorithms: 1) k-means; 2) spkm; [5] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spec-
tral clustering by exploring intertask correlation,” IEEE Trans. Cybern.,
3) okm; and 4) CURE. vol. 45, no. 5, pp. 1069–1080, May 2015.
To show the usefulness of clusiVAT in terms of CPU [6] H. Zhu, C. Liu, Y. Ge, H. Xiong, and E. Chen, “Popularity modeling
time and PA, we performed experiments on 24 2-D syn- for mobile Apps: A sequential approach,” IEEE Trans. Cybern., vol. 45,
no. 7, pp. 1303–1314, Jul. 2015.
thetic datasets (having a maximum of 1 000 000 datapoints), [7] R. Sibson, “SLINK: An optimally efficient algorithm for the single-
nine high-dimensional synthetic datasets (having a maximum link cluster method,” Comput. J. (Brit. Comput. Soc.), vol. 16, no. 1,
of 500 000, 500 dimensional datapoints), and two real-life pp. 30–34, Jan. 1973.
[8] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of
big datasets (the largest of which has 4 292 637 vectors with Things (IoT): A vision, architectural elements, and future directions,”
41 features each). We found that for CS datasets our new Future Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, Sep. 2013.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[9] A. Shilton, S. Rajasegarar, C. Leckie, and M. Palaniswami, “DP1SVM: [34] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algo-
A dynamic planar one-class support vector machine for Internet rithm for large databases,” in Proc. ACM SIGMOD Int. Conf. Manage.
of Things environment,” in Proc. Int. Conf. Rec. Adv. Internet Data, New York, NY, USA, Jun. 1998, pp. 73–84.
Things (RIoT), Singapore, Apr. 2015, pp. 1–6. [35] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
[10] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An information frame- J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006. [Online]. Available:
work for creating a smart city through Internet of Things,” IEEE Internet http://dl.acm.org/citation.cfm?id=1248547.1248548
Things J., vol. 1, no. 2, pp. 112–121, Apr. 2014. [36] S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced non-
[11] Internet of Things (IoT) for Creating Smart Cities. [Online]. Available: parametric tests for multiple comparisons in the design of experiments
http://issnip.unimelb.edu.au/research_program/sensor_networks/ in computational intelligence and data mining: Experimental analysis of
Internet_of_Things, accessed Jun. 5, 2015. power,” Inf. Sci., vol. 180, no. 10, pp. 2044–2064, May 2010. [Online].
[12] D. Laney, 3D-Data Management: Controlling Data Volume, Velocity and Available: http://dx.doi.org/10.1016/j.ins.2009.12.010
Variety, Gartner, Stamford, CT, USA, 2001. [37] M. Friedman, “The use of ranks to avoid the assumption of normality
[13] L. Rashidi et al., “Profiling spatial and temporal behaviour in sensor implicit in the analysis of variance,” J. Amer. Statist. Assoc., vol. 32,
networks: A case study in energy monitoring,” in Proc. IEEE 9th Int. no. 200, pp. 675–701, Dec. 1937.
Conf. Intell. Sensors Sensor Netw. Inf. Process. (ISSNIP), Singapore, [38] R. A. Fisher, Statistical Methods and Scientific Inference. New York,
Apr. 2014, pp. 1–7. NY, USA: Hafner, 1959.
[14] M. Nati, A. Gluhak, H. Abangar, and W. Headley, “SmartCampus: [39] W. Petrie, “Sequences in prehistoric remains,” J. Anthropol. Inst. Great
A user-centric testbed for Internet of Things experimentation,” Britain Ireland, vol. 29, nos. 3–4, pp. 295–301, 1899.
in Proc. 16th Int. Symp. Wireless Pers. Multimedia Commun. (WPMC), [40] J. Czekanowski, “Zur differentialdiagnose der neandertal-gruppe,”
Atlantic City, NJ, USA, Jun. 2013, pp. 1–6. Korrespondenzblatt Deutsch. Ges. Anthropol. Ethnol. Urgesch., vol. 40,
nos. 6–7, pp. 44–47, 1909.
[15] S. Rajasegarar et al., “Ellipsoidal neighbourhood outlier factor for
[41] L. Wilkinson and M. Friendly, “The history of the cluster heat map,”
distributed anomaly detection in resource constrained networks,”
Amer. Statist., vol. 63, no. 2, pp. 179–184, May 2009.
Pattern Recognit., vol. 47, no. 9, pp. 2867–2879, Sep. 2014.
[42] L. Wang, X. Geng, J. Bezdek, C. Leckie, and K. Ramamohanarao,
[16] J. C. Bezdek and R. J. Hathaway, “VAT: A tool for visual assessment “Enhanced visual analysis for cluster tendency assessment and data
of (cluster) tendency,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), partitioning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10,
Honolulu, HI, USA, pp. 2225–2230, May 2002. pp. 1401–1414, Oct. 2010.
[17] C. Lacey and S. Cole, “Merger rates in hierarchical models of galaxy [43] T. C. Havens and J. C. Bezdek, “An efficient formulation of the improved
formation. II: Comparison with N-body simulations,” Mon. Not. Roy. visual assessment of cluster tendency (iVAT) algorithm,” IEEE Trans.
Astron. Soc., vol. 271, no. 3, pp. 676–692, Feb. 1994. Knowl. Data Eng., vol. 24, no. 5, pp. 813–822, May 2012.
[18] M. Moshtaghi et al., “Clustering ellipses for anomaly detection,” [44] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, “Scalable visual assess-
Pattern Recognit., vol. 44, no. 1, pp. 55–69, Jan. 2011. ment of cluster tendency for large data sets,” Pattern Recognit., vol. 39,
[19] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy—The Principles no. 7, pp. 1315–1324, Jul. 2006.
and Practice of Numerical Classification. San Francisco, CA, USA: [45] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in
W. H. Freeman, 1973. detecting compact well-separated clusters,” J. Cybern., vol. 3, no. 3,
[20] A. Wilbik, J. M. Keller, and J. C. Bezdek, “Linguistic prototypes for pp. 32–57, 1973.
data from eldercare residents,” IEEE Trans. Fuzzy Syst., vol. 22, no. 1, [46] T. C. Havens, J. C. Bezdek, J. M. Keller, M. Popescu, and J. M. Huband,
pp. 110–123, Mar. 2013. “Is VAT really single linkage in disguise?” Ann. Math. Artif. Intell.,
[21] D. Zhang, K. Ramamohanarao, S. Versteeg, and R. Zhang, “RoleVAT: vol. 55, nos. 3–4, pp. 237–251, Apr. 2009.
Visual assessment of practical need for role based access control,” [47] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “NP-hardness
in Proc. Conf. Comput. Security Appl., Honolulu, HI, USA, Dec. 2009, of Euclidean sum-of-squares clustering,” Machine Learn.,
pp. 13–22. vol. 75, no. 2, pp. 245–248, Jan. 2009. [Online]. Available:
[22] H. Steinhaus, “Sur la division des corp materiels en parties,” Bull. Acad. http://dx.doi.org/10.1007/s10994-009-5103-0
Polon. Sci., vol. 4, no. 12, pp. 801–804, 1956. [48] P. Hore, L. Hall, and D. Goldgof, “Single pass fuzzy C means,” in Proc.
[23] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, IEEE Int. Fuzzy Syst. Conf., London, U.K., Jul. 2007, pp. 1–7.
vol. 28, no. 2, pp. 129–137, Mar. 1982. [49] P. Hore et al., “A scalable framework for segmenting magnetic reso-
[24] G. Ball and D. Hall, “ISODATA, a novel method of data analysis nance images,” J. Signal Process. Syst., vol. 54, nos. 1–3, pp. 183–203,
and pattern classification,” Stanford Res. Inst., Stanford, CA, USA, Jan. 2009.
Tech. Rep. NTIS AD 699616, 1965. [50] H. Samet, Spatial Data Structures. Reading, MA, USA:
[25] J. MacQueen, “Some methods for classification and analysis of Addison-Wesley, 1995.
multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat. [51] T. H. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to
Probab., Berkeley, CA, USA, 1967, pp. 281–297. Algorithms. Cambridge, MA, USA: MIT Press, 2001.
[26] T. Havens, J. C. Bezdek, and M. Palaniswami, “Scalable single link- [52] R. L. Iman and J. M. Davenport, “Approximations of the critical
age clustering for big data,” in Proc. IEEE ISSNIP, Melbourne, VIC, region of the Friedman statistic,” Commun. Statist., vol. 9, pp. 571–595,
Australia, Apr. 2013, pp. 396–401. Jan. 1980.
[53] J. A. Blackard and D. J. Denis, “Comparative accuracies of artificial neu-
[27] D. Kumar et al., “clusiVAT: A mixed visual/numerical clustering algo-
ral networks and discriminant analysis in predicting forest cover types
rithm for big data,” in Proc. IEEE Int. Conf. Big Data, Silicon Valley,
from cartographic variables,” Comput. Electron. Agri., vol. 24, no. 3,
CA, USA, Oct. 2013, pp. 112–117.
pp. 131–151, 2000.
[28] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document [54] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis
clustering techniques,” in Proc. Workshop KDD, 2000. of the KDD’99 CUP data set,” in Proc. 2nd IEEE Symp. Comput. Intell.
[29] P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms Conf. Security Defense Appl. (CISDA), vol. 40. Ottawa, ON, Canada,
to large databases,” in Proc. 4th Int. Conf. Knowl. Disc. Data Mining, 2009, pp. 44–47.
Menlo Park, CA, USA, 1998, pp. 9–15.
[30] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami,
“Fuzzy c-means algorithms for very large data,” IEEE Trans. Fuzzy Syst.,
vol. 20, no. 6, pp. 1130–1146, Dec. 2012. Dheeraj Kumar received the B.Tech. and M.Tech.
[31] S. Eschrich, J. Ke, L. O. Hall, and D. B. Goldgof, “Fast accurate fuzzy dual degrees in electrical engineering from the
clustering through data reduction,” IEEE Trans. Fuzzy Syst., vol. 11, Indian Institute of Technology Kanpur, Kanpur,
no. 2, pp. 262–270, Apr. 2003. India, in 2010. He is currently pursuing the Ph.D.
[32] D. Feldman, M. Schmidt, and C. Sohler, “Turning big data into tiny data: degree with the Department of Electrical and
Constant-size coresets for k-means, PCA and projective clustering,” in Electronic Engineering, University of Melbourne,
Proc. 24th Annu. ACM Symp. Discrete Algorithms, New Orleans, LA, Melbourne, VIC, Australia.
USA, 2013, pp. 1434–1453. His current research interests include big data
[33] J. Cao, Z. Wu, J. Wu, and H. Xiong, “Sail: Summation-based incremental clustering, incremental clustering, spatio-temporal
learning for information-theoretic text clustering,” IEEE Trans. Cybern., estimations, Internet of Things, machine learning,
vol. 43, no. 2, pp. 570–584, Apr. 2013. pattern recognition, and signal processing.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
James C. Bezdek (LF’10) received the Ph.D. degree Sutharshan Rajasegarar received the B.Sc.
in applied mathematics from Cornell University, Engineering degree in electronic and telecommu-
Ithaca, NY, USA, in 1973. nication engineering (First Class Hons.) from the
His current research interests include optimiza- University of Moratuwa, Moratuwa, Sri Lanka, in
tion, pattern recognition, clustering in very large 2002, and the Ph.D. degree from the University of
data, co-clustering, visual clustering, and cluster Melbourne, Melbourne, VIC, Australia, in 2009.
validity. He is currently a Research Fellow with
Prof. Bezdek was a recipient of the IEEE 3rd the Department of Electrical and Electronic
Millennium, the IEEE Computational Intelligence Engineering, University of Melbourne. His current
Society Fuzzy Systems Pioneer, the IEEE Technical research interests include wireless sensor networks,
Field Award Rosenblatt, and the Kampe de Feriet anomaly/outlier detection, spatio-temporal esti-
medals. He is the Past President of the North American Fuzzy Information mations, Internet of Things, machine learning, pattern recognition, signal
Processing Society, the International Fuzzy Systems Association (IFSA), processing, and wireless communication.
and the IEEE Computational Intelligence Society, the Founding Editor
the International Journal of Approximate Reasoning and the IEEE
T RANSACTIONS ON F UZZY S YSTEMS, and a Life Fellow of the IFSA.