Chapter 0

EVALUATING OPTIMAL CLUSTERING
TECHNIQUES FOR EFFICIENT

STORAGE RETRIEVAL METHODS IN LARGE
DATABASE USING SOFT COMPUTING
TECHNIQUES
THESIS SUBMITTED TO
BHARATI VIDYAPEETH UNIVERSITY, PUNE
FOR AWARD OF DEGREE OF
DOCTOR OF PHILOSOPHY IN COMPUTER APPLICATION
UNDER THE FACULTY OF MANAGEMENT STUDIES
SUBMITTED BY
Miss Swati Sah
UNDER THE GUIDANCE OF

Dr. Ashutosh Gaur
RESEARCH CENTRE
Bharati Vidyapeeth Deemed University
Institute of Management
Kolhapur, Maharashtra, India
MARCH 2016
CERTIFICATE
This is to certify that the work incorporated in the thesis entitled
EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT
STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT
COMPUTING TECHNIQUES for the degree of Doctor of Philosophy in the
subject of Computer Application under the faculty of Management Studies has been
carried out by Miss Swati Sah in the Department of Computers at Bharati Vidyapeeth
Deemed University Institute of Management, Kolhapur, Maharashtra, India during the
period from November 2012 to February 2016 under the guidance of Dr. Ashutosh
Gaur.
Dr. Nitin D. Nayak

Professor and Director
Bharati Vidyapeeth University,
Institute of Management,
Kolhapur, Maharashtra, India
Place:
Date:
Seal
i
CERTIFICATION OF GUIDE
This is to certify that the work incorporated in the thesis entitled
EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT
STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT
COMPUTING TECHNIQUES submitted by Miss Swati Sah for the degree of
Doctor of Philosophy in the subject of Computer Application under the faculty of
Management Studies has been carried out in the Department of Computers at Bharati
Vidyapeeth Deemed University Institute of Management, Kolhapur, Maharashtra,
India during the period from November 2012 to February 2016 under my direct
supervision and guidance.
Dr. Ashutosh Gaur
Associate Professor
Bharati Vidyapeeth University, Pune
Place :
Date :
ii
DECLARATION BY THE CANDIDATE
I hereby declare that the thesis entitled EVALUATING OPTIMAL
CLUSTERING TECHNIQUES FOR EFFICIENT STORAGE RETRIEVAL
METHODS IN LARGE DATABASE USING SOFT COMPUTING TECHNIQUES
submitted by me to the Bharati Vidyapeeth University, Pune for the degree of Doctor
of Philosophy (Ph.D.) in Computer Application under the Faculty of Management
Studies, is original piece of work carried out by me under the supervision of Dr.
Ashutosh Gaur.
I, further declare that it has not been submitted to this or any other university
or Institution for the award of any degree or Diploma.
I also confirm that all the materials, which I have borrowed from other
sources and incorporated in my thesis, are duly acknowledged. If any material is not
duly acknowledged and found incorporated in this thesis, it is entirely my
responsibility. I am fully aware of the implications of any such act which might have
been committed by me advertently or inadvertently.
Miss Swati Sah

Research Scholar
Place :
Date :
iii
Acknowledgement
First and foremost, my deepest gratitude goes to God for giving me the grace,
the strength and the wisdom to undertake and complete this research work even when
all hopes were lost in troubled times.
Accomplishment of this doctoral thesis was possible with the support of
several people. I would like to express my sincere gratitude to all of them. First of all,
I am extremely grateful to my research guide, Dr. Ashutosh Gaur, Associate Professor
Bharati Vidyapeeth Deemed University, and for his valuable guidance, scholarly
inputs and consistent encouragement received throughout the research work. This
achievement was possible only because of the unconditional support provided by him.
A person with an friendly and positive disposition, my supervisor has always made
himself available to clarify my doubts despite his busy schedules and I consider it as a
great opportunity to do my doctoral programme under his guidance and to learn from
his research expertise. Thank you very much for all your help and support.
I thank to Dr. Nitin Nayak, Director, BVIM Kolhapur, for the academic
support and the facilities provided to carry out the research work at the institute. As I
enrolled in Ph.D. program during his tenure He, has been very encouraging and
supportive, and I express my gratitude to him.
I would like to thank faculty members of the Institute have been very
kind enough to extend their help at various phases of this research, whenever I
approached them, and I do hereby acknowledge all of them.
I owe a lot to my parents, who encouraged and helped me at every stage of my
personal and academic life, and longed to see this achievement come true. I deeply
iv
miss my mother Late Mrs Saroj Sah, and my father Late Mr Madan Lal Sah who is
not with me to share this joy.
I am thankful to my family for their constant support and for all their
encouragement. I am very much indebted to my family, my brother Mr Manish Sah,
My Bhabhi Mrs Nupur Sah who has been a great support. I owe them so much for
their care. I would also like to thank my sister Mrs Shalini Sah and My brother in law
Mr Manu Sah.
I would also like to thank Sweet niece Myra for not disturbing me when I used
to work and Cousin Ritika Sah for helping me.
I further acknowledge Mr. Ganesh Dixit for his continuous help and support in
preparation of this thesis. I also like to thank to Mrs Anuja Sharma and Mr. T. P.
Sharma for their kind help and support to achieve this journey of research.
Above all, I owe it all to Almighty God for granting me the wisdom, health
and strength to undertake this research task and enabling me to its completion.
v
List of figures
S. No. Title of Figure Page No.

Figure 1.1 Knowledge Discovery in Database 13
Figure 2.1 Steps of Cluster Analysis 21
Figure 2.2 Hierarchical Clustering flow chart 26
Figure 2.3 Dendogram representing hierarchical clustering 27
Figure 2.4 :Partitional Clustering 28
Figure 2.5 SOM neighbour weight distance 37
Figure 2.6 Different types of leaning problems i.e. 43
Supervised and Unsupervised
Figure 2.7 Semi-supervised learning. (a) Input image with 45
must-link (solid blue lines) and must not link
(broken red lines) constraints. (b) Clustering
(segmentation) without constraints. (c) Improved
clustering with 10% of the data points included in
the pair-wise constraints.
Figure 2.8 Hard Computing vs. Soft Computing 50
Figure 4.1 Histogram_sepal_Length 81
Figure 4.2 Histogram_sepal_Width 81
Figure 4.3 Histogram_Petal_Length 82
Figure 4.4 Histogram_Petal_Width 82
Figure 4.5 Densityplot_Sepal_Length 82
Figure 4.6 Densityplot_Sepal_Width 83
Figure 4.7 Densityplot_Petal_Length 83
Figure 4.8 Densityplot_Petal_Width 83
Figure 4.9 Frequency Distribution 84
Figure 4.10 Barplot_Species 84
Figure 4.11 Boxplot_Sepal.Length 85
Figure 4.12 Boxplot_Sepal.Width 86
Figure 4.13 Boxplot_Petal.Length 86
Figure 4.14 Boxplot_Petal.Width 87
Figure 4.15 Scatterplot_Sepal 87
Figure 4.16 Scatterplot_Petal 88
Figure 4.17 Scatterplot_Petal.Length 88
Figure 4.18 Scatterplot_Petal.Width 88
Figure 4.19 Jitter_Scatterplot_Sepal 89
Figure 4.20 Jitter_Scatterplot_Petal 89
Figure 4.21 3D_Scatterplot_Sepal 90
vi
Figure 4.22 3D_Scatterplot_Petal 90
Figure 4.23 Matrix_Scatterplot 91
Figure 4.24 Levelplot 92
Figure 4.25 Parallel_coordinate_plot 92
Figure 4.26 Parallel_plot 93
Figure 4.27 qp_plot 93
Figure 4.28 Contour Plot 94
Figure 4.29 Self-Organizing Map 100
Figure 4.30 Correlation Analysis with pair plot 104
Figure 4.31 Correlation Analysis 104
Figure 4.32 Correlation matrix plot 105
Figure 4.33 K-Means Clustering with 3 clusters 106
Figure 4.34 a clusters and their center point 106
Figure 4.34 b K-means clustering (K=3) 107
Figure 4.35 K-means clustering with 4 clusters 107
Figure 4.36 clusters and their center point 108
Figure 4.37 Distribution of data in 3 clusters using K medoid 110
Figure 4.38 Comparison of time complexity of K-medoid and 110
K-means
Figure 4.39 Comparison of time complexity of K-medoid and 112
K-means
Figure 4.40 Comparison of space complexity of K-medoid an 112
d K-means
Figure 4.41 Dendrogram of newiris dataset 112
Figure 4.42 Clusters of newiris dataset using agglomerative 113
algorithm
Figure 4.43 Dendrogram showing four Clusters of newiris 113
dataset using agglomerative algorithm
Figure 4.44 a 3X5 vector code books of newiris data 114
Figure 4.44 b Energy evolution of SOM clustering 114
Figure 4.45 SOM Grid 116
Figure 4.46 Overview of SOM clusters 116
Figure 4.47 overview of Fuzzy c-means clustering 118
Figure 4.48 center points of Fuzzy c-means clustering 118
Figure 4.49 Time complexity of fuzzy c-means and k-means 119
on number of clusters
Figure 4.50 Time complexity of fuzzy c-means and k-means 120
on number of iterations
vii
List of Tables
Page
S. No. Title of Table
No.
Table 1.1 Evolution of internet from 1995 to 2015 3
Comparison between k-means and k-medoids
Table 2.1 32
algorithms
Difference between Descriptive & Predictive data
Table 2.2 41
mining
Table 2.3: Example applications of large-scale data
Table 2.3 clustering
45
Table 4.1 Summary of the dataset 80
Table 4.2 Total no of species in data set 84
Table 4.3 Summary of dataset 103
Table 4.4 Mean of 3 clusters by k-means method 105
Table 4.5 Distribution of species in 3 clusters of k-means method 107
Table 4.6 Mean of 4 clusters by k-means method 108
Table 4.7 Distribution of species in 3 clusters of k-means method 108
Table 4.8 Distribution of data in 3 clusters using K-medoid 109
Comparison of time complexity of K-medoid and K-
Table 4.9 110
means
Comparison of time complexity of K-medoid and K-
Table 4.10 111
means considering no. of iterations
Table 4.11 Comparison of space complexity of K-medoid 111
Comparative results of the hierarchical and K-means
Table 4.12 113
clustering
Comparison of K-means, Hierarchical Clustering and
Table 4.13 117
SOM
Table 4.14 Membership functions 118
Table 4.15 Distribution of different species among all 3 clusters 119
Comparison of time complexity of fuzzy c-means and
Table 4.16 119
k-means on no. of cluster
Comparison of time complexity of fuzzy c-means and
Table 4.17 120
k-means on no. of iterations
viii
Table of Contents Page No.
Certificate i
Certification from Guide ii
Declaration iii
Acknowledgement iv
List of Figures vi
List of Tables viii
Chapter 1 Introduction 1-19
1.1 Introduction 1
1.2 Clustering methods 7
1.2.1 Hierarchical Clustering 7
1.2.1.1 Agglomerative Hierarchical Clustering 8
1.2.1.2 Divisive Hierarchical Clustering 8
1.2.2 Partitional Clustering 8
1.3 Soft Computing. 8
1.4 Data Analysis 10
1.4.1 Types of Data 11
1.4.2 Types of Features 11
1.4.3 Types of Analysis 12
1.5 Data mining models. 14
1.6 Background of the study. 15
1.7 Motivation. 15
1.8 Objectives. 16
1.9 Organization of the thesis. 16
References 18
ix
Chapter 2 Review of Literature 20-66
2.1 Introduction 20
2.2 General Types of Clusters 23
2.3 Applications of Clustering . 24
2.4 Purpose of Clustering. 24
2.5 Types of Clustering methods. 25
2.5.1 Hierarchal Clustering 26
2.5.1.1 Agglomerative Hierarchal Clustering 27
2.5.1.2 Divisive Hierarchal Clustering. 27
2.5.2 Partitional Clustering 28
2.5.2.1 K means algorithm. 29
2.5.2.2 K medoid algorithm. 31
2.6 Other Clustering Methods 32
2.6.1 BIRCH 32
2.6.2 Partitioning around Medoids 32
2.6.3 CLARA 32
2.6.4 DBSCAN 33
2.6.5 Fuzzy C Means 34
2.6.6 Self Organizing Maps 37
2.6.7 Particle Swarm Optimization 38
2.7 Data Mining 39
2.7.1 Exploratory Data Analysis 40
2.7.1.1 Descriptive modeling 40
2.7.1.2 Predictive modeling 41
2.8 Types of data mining 41
2.9 Large Scale clustering 45

x
2.9.1 Efficient Nearest Neighbour Search 46
2.9.2 Data Summarization 46
2.9.3 Distributed Computing 46
2.9.4 Incremental Clustering 46
2.9.5 Sampling based methods 47
2.9.6 Multiway Clustering 47
2.10 Heterogeneous Data 48
2.11 Soft Computing 49
2.11.1 Soft Computing Paradigm 50
2.11.2 Genetic Algorithm. 51
2.11.3 Evolutionary Computing 53
References 55
Chapter 3 Aims and Objectives 67-74
3.1 Introduction 67
3.2 Research objectives 67
References 73
Chapter 4 Observations and Results 75-123
4.1 Introduction 75
4.2 Introduction to Data set and R Language 75
4.2.1 Data set 75
4.2.2 R Language 76
4.3 Exploratory Data Analysis 77
4.4 Clustering Algorithms 94

Major Partitioning clustering methods: K-Means and K-
4.4.1 94
Medoids
xi
4.5 Experiments and results 101
4.5.1 Data Understanding 102
4.5.2 Correlation Analysis 103
4.5.3 Implementation of K-means algorithm for clustering 105
4.5.4 Implementation of K-medoid algorithm for clustering 109

Implementation of Hierarchical algorithm (agglomerative) for
4.5.5 112
clustering
4.5.6 Implementation of Self Organizing Map 114
4.5.7 Implementation of Fuzzy C-means Clustering 117
4.5.8 Particle Swarm Optimization 121
4.5.8.1 PSO Based K-Means And Fuzzy C-Means Clustering 121
Chapter 5 Findings and Synthesis 124-134
5.1 Introduction 124
5.2 Objectives of the study 125
5.3 Observations and Result 126
5.3.1 Exploratory Data Analysis 127
5.3.2 Experiments and results 130
5.4 Conclusion 132
xii

Chapter 0

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 0

Uploaded by

Copyright:

Available Formats

EVALUATING OPTIMAL CLUSTERING

TECHNIQUES FOR EFFICIENT

UNDER THE GUIDANCE OF

This is to certify that the work incorporated in the thesis entitled

EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT

STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT

COMPUTING TECHNIQUES for the degree of Doctor of Philosophy in the

Deemed University Institute of Management, Kolhapur, Maharashtra, India during the

Dr. Nitin D. Nayak

This is to certify that the work incorporated in the thesis entitled

EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT

STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT

COMPUTING TECHNIQUES submitted by Miss Swati Sah for the degree of

Doctor of Philosophy in the subject of Computer Application under the faculty of

Vidyapeeth Deemed University Institute of Management, Kolhapur, Maharashtra,

supervision and guidance.

Dr. Ashutosh Gaur

I hereby declare that the thesis entitled EVALUATING OPTIMAL

CLUSTERING TECHNIQUES FOR EFFICIENT STORAGE RETRIEVAL

METHODS IN LARGE DATABASE USING SOFT COMPUTING TECHNIQUES

of Philosophy (Ph.D.) in Computer Application under the Faculty of Management

or Institution for the award of any degree or Diploma.

duly acknowledged and found incorporated in this thesis, it is entirely my

been committed by me advertently or inadvertently.

Miss Swati Sah

all hopes were lost in troubled times.

Accomplishment of this doctoral thesis was possible with the support of

I am extremely grateful to my research guide, Dr. Ashutosh Gaur, Associate Professor

supportive, and I express my gratitude to him.

approached them, and I do hereby acknowledge all of them.

I owe a lot to my parents, who encouraged and helped me at every stage of my

not with me to share this joy.

encouragement. I am very much indebted to my family, my brother Mr Manish Sah,

to work and Cousin Ritika Sah for helping me.

S. No. Title of Figure Page No.

Certification from Guide ii

List of Tables viii

Chapter 1 Introduction 1-19

1.2 Clustering methods 7

1.2.1 Hierarchical Clustering 7

1.2.1.1 Agglomerative Hierarchical Clustering 8

1.2.1.2 Divisive Hierarchical Clustering 8

1.2.2 Partitional Clustering 8

1.3 Soft Computing. 8

1.4 Data Analysis 10

1.4.1 Types of Data 11

1.4.2 Types of Features 11

1.4.3 Types of Analysis 12

1.5 Data mining models. 14

1.6 Background of the study. 15

1.9 Organization of the thesis. 16

2.2 General Types of Clusters 23

2.3 Applications of Clustering . 24

2.4 Purpose of Clustering. 24

2.5 Types of Clustering methods. 25

2.5.1 Hierarchal Clustering 26

2.5.1.1 Agglomerative Hierarchal Clustering 27

2.5.1.2 Divisive Hierarchal Clustering. 27

2.5.2 Partitional Clustering 28

2.5.2.1 K means algorithm. 29

2.5.2.2 K medoid algorithm. 31

2.6 Other Clustering Methods 32

2.6.2 Partitioning around Medoids 32

2.6.5 Fuzzy C Means 34