You are on page 1of 13

EVALUATING OPTIMAL CLUSTERING

TECHNIQUES FOR EFFICIENT


STORAGE RETRIEVAL METHODS IN LARGE
DATABASE USING SOFT COMPUTING
TECHNIQUES

THESIS SUBMITTED TO
BHARATI VIDYAPEETH UNIVERSITY, PUNE
FOR AWARD OF DEGREE OF
DOCTOR OF PHILOSOPHY IN COMPUTER APPLICATION
UNDER THE FACULTY OF MANAGEMENT STUDIES

SUBMITTED BY
Miss Swati Sah

UNDER THE GUIDANCE OF


Dr. Ashutosh Gaur

RESEARCH CENTRE
Bharati Vidyapeeth Deemed University
Institute of Management
Kolhapur, Maharashtra, India

MARCH 2016
CERTIFICATE

This is to certify that the work incorporated in the thesis entitled

EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT

STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT

COMPUTING TECHNIQUES for the degree of Doctor of Philosophy in the

subject of Computer Application under the faculty of Management Studies has been

carried out by Miss Swati Sah in the Department of Computers at Bharati Vidyapeeth

Deemed University Institute of Management, Kolhapur, Maharashtra, India during the

period from November 2012 to February 2016 under the guidance of Dr. Ashutosh

Gaur.

Dr. Nitin D. Nayak


Professor and Director
Bharati Vidyapeeth University,
Institute of Management,
Kolhapur, Maharashtra, India
Place:
Date:

Seal

i
CERTIFICATION OF GUIDE

This is to certify that the work incorporated in the thesis entitled

EVALUATING OPTIMAL CLUSTERING TECHNIQUES FOR EFFICIENT

STORAGE RETRIEVAL METHODS IN LARGE DATABASE USING SOFT

COMPUTING TECHNIQUES submitted by Miss Swati Sah for the degree of

Doctor of Philosophy in the subject of Computer Application under the faculty of

Management Studies has been carried out in the Department of Computers at Bharati

Vidyapeeth Deemed University Institute of Management, Kolhapur, Maharashtra,

India during the period from November 2012 to February 2016 under my direct

supervision and guidance.

Dr. Ashutosh Gaur

Associate Professor
Bharati Vidyapeeth University, Pune

Place :

Date :

ii
DECLARATION BY THE CANDIDATE

I hereby declare that the thesis entitled EVALUATING OPTIMAL

CLUSTERING TECHNIQUES FOR EFFICIENT STORAGE RETRIEVAL

METHODS IN LARGE DATABASE USING SOFT COMPUTING TECHNIQUES

submitted by me to the Bharati Vidyapeeth University, Pune for the degree of Doctor

of Philosophy (Ph.D.) in Computer Application under the Faculty of Management

Studies, is original piece of work carried out by me under the supervision of Dr.

Ashutosh Gaur.

I, further declare that it has not been submitted to this or any other university

or Institution for the award of any degree or Diploma.

I also confirm that all the materials, which I have borrowed from other

sources and incorporated in my thesis, are duly acknowledged. If any material is not

duly acknowledged and found incorporated in this thesis, it is entirely my

responsibility. I am fully aware of the implications of any such act which might have

been committed by me advertently or inadvertently.

Miss Swati Sah


Research Scholar
Place :
Date :

iii
Acknowledgement

First and foremost, my deepest gratitude goes to God for giving me the grace,

the strength and the wisdom to undertake and complete this research work even when

all hopes were lost in troubled times.

Accomplishment of this doctoral thesis was possible with the support of

several people. I would like to express my sincere gratitude to all of them. First of all,

I am extremely grateful to my research guide, Dr. Ashutosh Gaur, Associate Professor

Bharati Vidyapeeth Deemed University, and for his valuable guidance, scholarly

inputs and consistent encouragement received throughout the research work. This

achievement was possible only because of the unconditional support provided by him.

A person with an friendly and positive disposition, my supervisor has always made

himself available to clarify my doubts despite his busy schedules and I consider it as a

great opportunity to do my doctoral programme under his guidance and to learn from

his research expertise. Thank you very much for all your help and support.

I thank to Dr. Nitin Nayak, Director, BVIM Kolhapur, for the academic

support and the facilities provided to carry out the research work at the institute. As I

enrolled in Ph.D. program during his tenure He, has been very encouraging and

supportive, and I express my gratitude to him.

I would like to thank faculty members of the Institute have been very

kind enough to extend their help at various phases of this research, whenever I

approached them, and I do hereby acknowledge all of them.

I owe a lot to my parents, who encouraged and helped me at every stage of my

personal and academic life, and longed to see this achievement come true. I deeply
iv
miss my mother Late Mrs Saroj Sah, and my father Late Mr Madan Lal Sah who is

not with me to share this joy.

I am thankful to my family for their constant support and for all their

encouragement. I am very much indebted to my family, my brother Mr Manish Sah,

My Bhabhi Mrs Nupur Sah who has been a great support. I owe them so much for

their care. I would also like to thank my sister Mrs Shalini Sah and My brother in law

Mr Manu Sah.

I would also like to thank Sweet niece Myra for not disturbing me when I used

to work and Cousin Ritika Sah for helping me.

I further acknowledge Mr. Ganesh Dixit for his continuous help and support in

preparation of this thesis. I also like to thank to Mrs Anuja Sharma and Mr. T. P.

Sharma for their kind help and support to achieve this journey of research.

Above all, I owe it all to Almighty God for granting me the wisdom, health

and strength to undertake this research task and enabling me to its completion.

v
List of figures

S. No. Title of Figure Page No.


Figure 1.1 Knowledge Discovery in Database 13
Figure 2.1 Steps of Cluster Analysis 21
Figure 2.2 Hierarchical Clustering flow chart 26
Figure 2.3 Dendogram representing hierarchical clustering 27
Figure 2.4 :Partitional Clustering 28
Figure 2.5 SOM neighbour weight distance 37
Figure 2.6 Different types of leaning problems i.e. 43
Supervised and Unsupervised
Figure 2.7 Semi-supervised learning. (a) Input image with 45
must-link (solid blue lines) and must not link
(broken red lines) constraints. (b) Clustering
(segmentation) without constraints. (c) Improved
clustering with 10% of the data points included in
the pair-wise constraints.
Figure 2.8 Hard Computing vs. Soft Computing 50
Figure 4.1 Histogram_sepal_Length 81
Figure 4.2 Histogram_sepal_Width 81
Figure 4.3 Histogram_Petal_Length 82
Figure 4.4 Histogram_Petal_Width 82
Figure 4.5 Densityplot_Sepal_Length 82
Figure 4.6 Densityplot_Sepal_Width 83
Figure 4.7 Densityplot_Petal_Length 83
Figure 4.8 Densityplot_Petal_Width 83
Figure 4.9 Frequency Distribution 84
Figure 4.10 Barplot_Species 84
Figure 4.11 Boxplot_Sepal.Length 85
Figure 4.12 Boxplot_Sepal.Width 86
Figure 4.13 Boxplot_Petal.Length 86
Figure 4.14 Boxplot_Petal.Width 87
Figure 4.15 Scatterplot_Sepal 87
Figure 4.16 Scatterplot_Petal 88
Figure 4.17 Scatterplot_Petal.Length 88
Figure 4.18 Scatterplot_Petal.Width 88
Figure 4.19 Jitter_Scatterplot_Sepal 89
Figure 4.20 Jitter_Scatterplot_Petal 89
Figure 4.21 3D_Scatterplot_Sepal 90

vi
Figure 4.22 3D_Scatterplot_Petal 90
Figure 4.23 Matrix_Scatterplot 91
Figure 4.24 Levelplot 92
Figure 4.25 Parallel_coordinate_plot 92
Figure 4.26 Parallel_plot 93
Figure 4.27 qp_plot 93
Figure 4.28 Contour Plot 94
Figure 4.29 Self-Organizing Map 100
Figure 4.30 Correlation Analysis with pair plot 104
Figure 4.31 Correlation Analysis 104
Figure 4.32 Correlation matrix plot 105
Figure 4.33 K-Means Clustering with 3 clusters 106
Figure 4.34 a clusters and their center point 106
Figure 4.34 b K-means clustering (K=3) 107
Figure 4.35 K-means clustering with 4 clusters 107
Figure 4.36 clusters and their center point 108
Figure 4.37 Distribution of data in 3 clusters using K medoid 110
Figure 4.38 Comparison of time complexity of K-medoid and 110
K-means
Figure 4.39 Comparison of time complexity of K-medoid and 112
K-means
Figure 4.40 Comparison of space complexity of K-medoid an 112
d K-means
Figure 4.41 Dendrogram of newiris dataset 112
Figure 4.42 Clusters of newiris dataset using agglomerative 113
algorithm
Figure 4.43 Dendrogram showing four Clusters of newiris 113
dataset using agglomerative algorithm
Figure 4.44 a 3X5 vector code books of newiris data 114
Figure 4.44 b Energy evolution of SOM clustering 114
Figure 4.45 SOM Grid 116
Figure 4.46 Overview of SOM clusters 116
Figure 4.47 overview of Fuzzy c-means clustering 118
Figure 4.48 center points of Fuzzy c-means clustering 118
Figure 4.49 Time complexity of fuzzy c-means and k-means 119
on number of clusters
Figure 4.50 Time complexity of fuzzy c-means and k-means 120
on number of iterations

vii
List of Tables

Page
S. No. Title of Table
No.
Table 1.1 Evolution of internet from 1995 to 2015 3
Comparison between k-means and k-medoids
Table 2.1 32
algorithms
Difference between Descriptive & Predictive data
Table 2.2 41
mining
Table 2.3: Example applications of large-scale data
Table 2.3 clustering
45
Table 4.1 Summary of the dataset 80
Table 4.2 Total no of species in data set 84
Table 4.3 Summary of dataset 103
Table 4.4 Mean of 3 clusters by k-means method 105
Table 4.5 Distribution of species in 3 clusters of k-means method 107
Table 4.6 Mean of 4 clusters by k-means method 108
Table 4.7 Distribution of species in 3 clusters of k-means method 108
Table 4.8 Distribution of data in 3 clusters using K-medoid 109
Comparison of time complexity of K-medoid and K-
Table 4.9 110
means
Comparison of time complexity of K-medoid and K-
Table 4.10 111
means considering no. of iterations
Table 4.11 Comparison of space complexity of K-medoid 111
Comparative results of the hierarchical and K-means
Table 4.12 113
clustering
Comparison of K-means, Hierarchical Clustering and
Table 4.13 117
SOM
Table 4.14 Membership functions 118
Table 4.15 Distribution of different species among all 3 clusters 119
Comparison of time complexity of fuzzy c-means and
Table 4.16 119
k-means on no. of cluster
Comparison of time complexity of fuzzy c-means and
Table 4.17 120
k-means on no. of iterations

viii
Table of Contents Page No.

Certificate i

Certification from Guide ii

Declaration iii

Acknowledgement iv

List of Figures vi

List of Tables viii

Chapter 1 Introduction 1-19

1.1 Introduction 1

1.2 Clustering methods 7

1.2.1 Hierarchical Clustering 7

1.2.1.1 Agglomerative Hierarchical Clustering 8

1.2.1.2 Divisive Hierarchical Clustering 8

1.2.2 Partitional Clustering 8

1.3 Soft Computing. 8

1.4 Data Analysis 10

1.4.1 Types of Data 11

1.4.2 Types of Features 11

1.4.3 Types of Analysis 12

1.5 Data mining models. 14

1.6 Background of the study. 15

1.7 Motivation. 15

1.8 Objectives. 16

1.9 Organization of the thesis. 16

References 18

ix
Chapter 2 Review of Literature 20-66

2.1 Introduction 20

2.2 General Types of Clusters 23

2.3 Applications of Clustering . 24

2.4 Purpose of Clustering. 24

2.5 Types of Clustering methods. 25

2.5.1 Hierarchal Clustering 26

2.5.1.1 Agglomerative Hierarchal Clustering 27

2.5.1.2 Divisive Hierarchal Clustering. 27

2.5.2 Partitional Clustering 28

2.5.2.1 K means algorithm. 29

2.5.2.2 K medoid algorithm. 31

2.6 Other Clustering Methods 32

2.6.1 BIRCH 32

2.6.2 Partitioning around Medoids 32

2.6.3 CLARA 32

2.6.4 DBSCAN 33

2.6.5 Fuzzy C Means 34

2.6.6 Self Organizing Maps 37

2.6.7 Particle Swarm Optimization 38

2.7 Data Mining 39

2.7.1 Exploratory Data Analysis 40

2.7.1.1 Descriptive modeling 40

2.7.1.2 Predictive modeling 41

2.8 Types of data mining 41

2.9 Large Scale clustering 45


x
2.9.1 Efficient Nearest Neighbour Search 46

2.9.2 Data Summarization 46

2.9.3 Distributed Computing 46

2.9.4 Incremental Clustering 46

2.9.5 Sampling based methods 47

2.9.6 Multiway Clustering 47

2.10 Heterogeneous Data 48

2.11 Soft Computing 49

2.11.1 Soft Computing Paradigm 50

2.11.2 Genetic Algorithm. 51

2.11.3 Evolutionary Computing 53

References 55

Chapter 3 Aims and Objectives 67-74

3.1 Introduction 67

3.2 Research objectives 67

References 73

Chapter 4 Observations and Results 75-123

4.1 Introduction 75

4.2 Introduction to Data set and R Language 75

4.2.1 Data set 75

4.2.2 R Language 76

4.3 Exploratory Data Analysis 77

4.4 Clustering Algorithms 94


Major Partitioning clustering methods: K-Means and K-
4.4.1 94
Medoids

xi
4.5 Experiments and results 101

4.5.1 Data Understanding 102

4.5.2 Correlation Analysis 103

4.5.3 Implementation of K-means algorithm for clustering 105

4.5.4 Implementation of K-medoid algorithm for clustering 109


Implementation of Hierarchical algorithm (agglomerative) for
4.5.5 112
clustering
4.5.6 Implementation of Self Organizing Map 114

4.5.7 Implementation of Fuzzy C-means Clustering 117

4.5.8 Particle Swarm Optimization 121

4.5.8.1 PSO Based K-Means And Fuzzy C-Means Clustering 121

Chapter 5 Findings and Synthesis 124-134

5.1 Introduction 124

5.2 Objectives of the study 125

5.3 Observations and Result 126

5.3.1 Exploratory Data Analysis 127

5.3.2 Experiments and results 130

5.4 Conclusion 132

xii

You might also like