Iaetsd-Jaras-Cumulative Supervised Clustering Ensemble Approaches For High Dimensional Data

IAETSD Journal for Advanced Research in Applied Sciences, Volume 4, Issue 1, Jan-June /2017
ISSN (Online): 2394-8442
CUMULATIVE SUPERVISED CLUSTERING ENSEMBLE

APPROACHES FOR HIGH DIMENSIONAL DATA
S.Aravind
Student, Department of Computer Science and Engineering, Karunya University, Coimbatore.
ABSTRACT.
Water quality is fundamental for good river health. Water is essential to human life and the health of the environment. As a
valuable natural resource, it comprises marine, estuarine, freshwater (river and lakes) and groundwater environments that stretch across
coastal and inland areas. Water quality is commonly defined by its physical, chemical, biological and aesthetic (appearance and smell)
characteristics. A rest step toward addressing this challenge is the use of clustering techniques to identify the polluted water bodies or
polluted areas. Many clustering algorithms have been applied to water quality data to distinguish the range of water quality from the various
part of India. Ensemble clustering approach is proposed to cluster the sample data. Consensus of the position of the cancroids in the target
clustering. From the result identify the water quality and use the water for different purpose depends upon the range of water quality. The aim
is to group similar cancroids in each consensus chain by solving the cluster correspondence problem. The objective is to globally optimize the
assignment * out of the set f of all possible families of centred assignments to k consensus chains.
Keywords: Clustering, water quality, high dimensional data, cumulative addition, over fitting
I. INTRODUCTION
Water quality is important because it directly affects the health of the people, animals and plants that drink or otherwise utilize the water. When
water quality is compromised, its usage puts users at risk of developing health complications. The environment also suffers when the quality of
water is low. The various dimensions have been measured from the various water bodies to identify the quality of water. Water quality data from
2500 water quality monitoring stations, located on all important rivers, lakes includes for groundwater assessment form the basis of National
Water Monitoring Programme (NWMP). Get ambient water quality data of Aquatic Resources in India for 2012. The minimum, maximum and
mean values of water quality parameters like Temperature, Dissolved Oxygen (D.O.), pH, Conductivity, Biochemical Oxygen Demand (B.O.D),
Nitrate-N and Nitrite-N, Focal Coli form and Total Coli form are provided in the dataset. From the dataset here can explore the quality of water
from the various part of India and also here can generate how to use the water for various purpose through the cluster ensemble approach.
II. PROBLEM STATEMENT

Existing cluster ensemble does not have good quality assumption. i.e, prediction produce by cluster ensemble does have good
accuracy. Sometimes result may be biased for the high dimensional data. Normal ensemble method deals only with low dimensional data and
never deals with high dimensional data.
Issues with existing clustering techniques Current clustering techniques do not address all the requirements adequately. Dealing with
large number of dimensions and large number of data items can be problematic because of time complexity; Effectiveness of the method
depends on the definition of "distance" (for distance based clustering) If an obvious distance measure doesnt exist here must "define" it, which
is not always easy, especially in multidimensional spaces. The result of the clustering algorithm (that in many cases can be arbitrary itself) can
be interpreted in different ways.
III. OBJECTIVE
To generate high quality and robust clustering for high dimensional data using cluster ensemble framework. Applying different
clustering algorithm like k-means, agglomerative cluster, and analyse each clustering result. The objective is to combine multiple clustering into
the single consolidated clustering without accessing features of the multidimensional data. And also the stability of the result should be obtained
from the high dimensional data.
IV. METHODOLOGY
The methodology followed in this work is shown in figure. 1. In order to have a quality dataset, data pre-processing is carried out. In our
analysis, here filled the missing values by the attribute mean and found extreme outliers by box plot analysis. Outliers from each parameter here
removed and then replaced with their medians.
To Cite This Article: S.Aravind,. CUMULATIVE SUPERVISED CLUSTERING ENSEMBLE APPROACHES FOR
HIGH DIMENSIONAL DATA. Journal for Advanced Research in Applied Sciences ; Pages: 93-96
94. S.Aravind,. CUMULATIVE SUPERVISED CLUSTERING ENSEMBLE APPROACHES FOR HIGH DIMENSIONAL DATA.
Journal for Advanced Research in Applied Sciences; Pages: 93-96
After data pre-processing, different data mining techniques here applied on the data including correlation analysis, scattered plots for data
distribution, regression models using Curve Estimation techniques, and clustering in order to find the quality index. For clustering, Hierarchical
techniques using different intervals here applied. For measuring prediction accuracy, loss function i.e. Relative Absolute Error (RAE) has been
used. Predictor accuracy measures how far off the predicted value is from the actual known value. Loss function measures the error between
actual and predicted values. The value of RAE lies in between 0 to 1. Lesser the error value, more will be the accuracy of prediction.
Where yi represents the actual value, yi is the predicted value, y is the mean of actual values and d is the total number of values.
V. CLUSTER ENSEMBLE
The process of organizing objects into groups whose members are similar in some way. A collection of objects which are similar
between them and are dissimilar to the objects belonging to other clusters. Cluster ensemble is of generating a set of clustering from the same
dataset and combining them into a final clustering for obtaining good result. The goal of this combination process is to improve the quality of
individual data clustering.
The clustering can be done by k-means or any other algorithm. When all the clusters are grouped together here can gain the data
sense. Have to be combined multiple clustering into the single consolidate clustering without accessing features of the multi-dimensional data.
And also the stability of the result should be obtained from the high dimensional data hence the proposed system consolidate clustering is
cumulative supervised clustering ensemble approaches for high dimensional data clustering.
Fig: 1 Cluster ensemble framework
Motivated by what is believed to be a reasonable discriminating strategy based on the average of a chosen distance measure bethereen clusters,
the proposed algorithm is the group average hierarchical clustering. The combining algorithm starts by computing the distances bethereen the
rows of Z (i.e. the cumulative clusters). This is a total of ( c) distances, and one minus the binary Jaccard measure is used to compute the
distances. The group-average hierarchical clustering is used to cluster the clusters, hence the name meta-clustering. In this algorithm, the
distance between a pair of clusters d (C1,C2) is defined as the average distance between the objects in each cluster, where the objects in this case
are the cumulative clusters. It is computed as follows, d(C1,C2) = mean(z1,z2) C1C2d(z1, z2), where d(z1, z2) = 1 J(z1, z2) The
dendrogram is cut to generate k meta-clusters {Mj}kj=1 representing a partitioning of the cumulative clusters {zi}c i=1. The merged clusters are
averaged in a k N matrix M = {mji} for j {1, , k} and i {1, ,N}. So far, only the binary version of the umulative matrix
has been used for distance computations. Now, in determining the final clustering, the frequency values accumulated in Z are averaged in the
meta-cluster matrix M and used to compute the cluster assignment probabilities. Then, each object is assigned to its most likely meta-cluster. Let
M be a random variable indexing the meta-clusters and taking values in {1, , k}, let X be a random variable indexing the patterns and
taking values in {1, ,N}, and let p (M = j|X = i) be the conditional probability of each of the k meta-clusters, given an object i, which
here also write as p(Mj|xi). Here, here use xi to denote the object index of the pattern x (i), and here use Mj to denote a meta-cluster represented
M
by the row j in M. The probability estimates p (Mj|xi) are computed as p (Mj |xi) k
i=1 M
VI. CUMULATIVE MERGER

After clustering was applied to each disjoint subset, there was a set of centroids available which described each partitioned subset. Clustering or
partitioning a subset {Si} produces a set of centroids, base clustering solutions {Ci,j}kj=1, where k the number of cluster. When r subsets are
chosen there will be r sets of centroids i.e. {C1,j}kj=1,{C2,j}kj=1,..,{Cr,j}kj=1 forming an ensemble of centroids. To produce the final
partition, here need to reach a consensus of the position of the centroids in the target clustering.
One way to reach a global consensus is to partition the ensemble of centroids into k consensus chains[8], where each consensus chain will
contain r centroids { cl1,,clr}, one from each of the subsets, where l runs from 1 to k. The aim is to group similar centroids in each consensus
chain by solving the cluster correspondence problem. The objective is to globally optimize the assignment * out of the set f of all possible
families of centroid assignments to k consensus chains:

= argmin{}f xe {} (1)
= = cost cons_chain l (2)

r
cost cons_chain l = = D i (3)
where d(.,.) is the distance function bethereen centroid vectors in a consensus chain. Here used the Euclidean distance in computing the cost (7),
as the underlying clustering solutions herere also obtained using the Euclidean distance metric. So, in a graph theoretical formulation finding the
globally optimum value for the objective function (1) reduces to the minimally here perfect r-partite matching problem, which is intractable for
r>2. Because the optimization problem is intractable (NP-Hard), a heuristic method was used to group centroids.
Here know that for 2 partitions, r=2, there is a polynomial time algorithm i.e. minimally here perfect bipartite matching to globally
optimize the above objective function[4].After matching a pair of partitions, here keep the centroids of one of the pair as the reference and a new
partition is randomly chosen and matched i.e. minimally this paper describe sighted bipartite matching. Now, the centroids of this new matched
partition are in the same consensus chain in which the centroids of the reference partition belong. In this way here continue grouping the
centroids of new partitions into consensus chains one by one until they are exhausted[9]. After the consensus chains are created, here simply
compute the here arithmetic mean of centroids in a consensus chain to represent a global centroid, where the here of a centroid are determined
from the size of the subset from which it was created. In some cases, especially in the knowledge reuse framework, here importance of base
clustering solutions may be difficult to obtain. In those cases, all the base clustering solutions may be considered to have the same here
importance. Each consensus chain tells us which centroid from which subset is matched to a centroid in another subset. A final partition, in the
form of label vectors, may be obtained by assigning an example to the nearest global centroid. It should be noted that the bipartite merger
algorithm partitions the ensemble of centroids into k perfectly balanced chains, that is, each chain has the same number of centroids (one from
each base clustering solution). More about consensus chains can be found in our earlier work
VII. DATA SET USED

While this clustering approach is designed for data sets that are too large to fit in main memory, for comparison purposes here show
results on tractable sized data sets for which here can compare against clustering all of the data. Some data sets here chosen to allow for
comparison with published results. Cluster ensemble approach is applied to the data set called water quality data set. Many clustering algorithms
have been applied to water quality data to distinguish the range of water quality from the various part of India. Ensemble clustering approach is
proposed to cluster the sample data. In this work randomly data is selected and apply the clustering algorithm such as hierarchical cluster and
find the similarities and dissimilarities between each cluster. From the result identify the water quality and use the water for different purpose
depends upon the range of water quality.
Water quality data contains 27 attributes which represents various chemical measurements to check the quality of data. Random
subspace method is used to select the instances from the data set. Hierarchical clustering algorithm is applied to the randomly selected instances.
Hear 100 instances are randomly selected and it correctly classified instances are 84 and incorrectly classified instances are 16.
Fig: 2 Threshold curve
From the water quality data here can able to predict the modelling techniques will help for decision making for the improvement of water
quality. In this regards, initial data trends, parametric satisfactory analysis, regression models, finding the quality index, and finally finding the
source of contamination as here as the possible reasons for high contamination are found. For finding quality index of water, Average Linkage
(Within Groups) method of Hierarchical Clustering using Euclidean distance performed better than other techniques. Similarly, for
classification, MLP produces more accurate results as compared to its counterparts
VIII. RESULT ANALYSIS
ACCURACY BY CLASS
1
0,5
0
REPTree DECISION
STUMP
F-MEASURE ROC Root mean squared error
Fig: 3 performance measure

A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1-rules. Decision stumps
are often used as components (called "weak learners" or "base learners") in machine learning ensemble techniques such as bagging and boosting.
For example, a state-of-the-art ViolaJones face detection algorithm employs AdaBoost with decision stumps as weak learners.
One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks overfitting the
training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space.
REPTree tree algorithm is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as
the horizon effect. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set.
There are many techniques for tree pruning that differ in the measurement that is used to optimize performance. REPTree gives more accuracy
that f measure value is 0.875 which greater then decision stump and itfs f measure value is 0.754.
IX. CONCLUSION
In this paper here proposed methods for merging an ensemble of clustering solutions in a scalable framework in terms of time and
space complexity. Here evaluated our algorithms both under balanced and unbalanced distributions. Under a balanced distribution, the centroids
based cumulative merger algorithms is used to To produce the final partition, here need to reach a consensus of the position of the centroids in
the target clustering. One way to reach a global consensus is to partition the ensemble of centroids into k consensus chains.
REFERENCE
1. Strehl A, Ghosh J. Clusters ensembles-a knowledge reuse framework for combining multiple partitions. Journal of Machine learning
Research. 2002;3:583617.
2. Long B, (Mark)Zhang Z, Yu Philip S. Combining Multiple Clusterings by Soft Correspondence. ICDM. 2005:282289.
3. E.Fataei and S.Shiralipoor, Evaluation of surface water quality uaing cluster analysis by world journal of fish and marine sciences3(5)
2011.
4. Kuhn HW. The Hungarian Method for the Assignment Problem. Naval Res Logist Quart. 1955;2:8397.
5. Fischer B, Buhmann JH. Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transaction on Pattern
Analysis and Machine Intelligence. 2003;25(4):513518.
6. Dudoit S, Fridly J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 2003;19(9):10901099
7. Kriegel H, Kroger P, Pryakhin A, Scubert M. Effective and Efficient Distributed Model-based Clustering. ICDM. 2005:258265.
8. Zhiwen Yu, Hantao Chen, Jane You, Guoqiang Han, Le Li, HybridFuzzy Cluster Ensemble Framework for Tumor Clustering from
Biomolecular Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 3, pp. 657-670, 2013.
9. N. Iam-On, T. Boongoen, S. Garrett, C. Price, A Link-Based Approach to the Cluster Ensemble Problem, IEEE Transactions on
Pattern Analysis and Machine Intelligence , vol. 33, no. 12, pp. 2396-2409, 2011.
10. Z. Lu, Y. Peng, Exhaustive and Efficient Constraint Propagation: A Graph-Based Learning Approach and Its Applications,
International Journal of Computer Vision, vol. 103, no. 3, pp. 306-325, 2013.

Iaetsd-Jaras-Cumulative Supervised Clustering Ensemble Approaches For High Dimensional Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Iaetsd-Jaras-Cumulative Supervised Clustering Ensemble Approaches For High Dimensional Data

Uploaded by

Copyright:

Available Formats

IAETSD Journal for Advanced Research in Applied Sciences, Volume 4, Issue 1, Jan-June /2017

ISSN (Online): 2394-8442

CUMULATIVE SUPERVISED CLUSTERING ENSEMBLE

II. PROBLEM STATEMENT

Fig: 1 Cluster ensemble framework

VI. CUMULATIVE MERGER

= = cost cons_chain l (2)

VII. DATA SET USED

Fig: 2 Threshold curve

VIII. RESULT ANALYSIS

F-MEASURE ROC Root mean squared error

Fig: 3 performance measure

You might also like