You are on page 1of 7

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

76

An Integration of K-means and Decision Tree (ID3) towards a more Efficient Data Mining Algorithm
Dost Muhammad Khan1, Nawaz Mohamudally2
1

Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur, Pakistan & PhD Student, School of Innovative Technologies & Engineering, University of Technology, Mauritius

Associate Professor & Consultancy & Technology Transfer Centre, Manager, University of Technology, Mauritius (UTM)

Abstract
K-means clustering data mining algorithm is commonly used to find clusters in huge datasets, due to its simplicity of implementation and fast execution. After applying K-means clustering algorithm on a dataset, it is difficult for one to interpret and extract the required results from these clusters, unless an appropriate data mining tool or algorithm is used. Decision tree (ID3) is the best choice, used for the interpretation of the clusters of K-means algorithm because it is a user friendly, faster to generate and simpler to explain understandable decision rules, compared to the other commonly used data mining algorithms. In this research paper, we integrate K-means clustering algorithm with Decision tree (ID3) algorithm to come up with a more efficient data mining algorithm using intelligent agent, called Learning Intelligent Agent (LIAgent), which is capable to perform classification, clustering and interpretation tasks on the datasets. Keywords: LIAgent, Data Mining Algorithms, Dataset, Clusters, Partitioned Clustered Dataset, Visualization

Fig. 1 States of an Agent

1. Introduction
Data mining algorithms are applied to discover hidden patterns and relations among the different fields from complex datasets. When we code the algorithms using intelligent agent technology, this provides us with more sophisticated and powerful tools for data mining. An agent is an object which has independent thread of control and can be initiated [1][2][3][4]. Figure 1 depicts the different states of an agent.

An agent initializes itself in the first state, whereas in the second state the agent starts to operate; the next state is the stop and agent may start again depending upon environment and tasks that it tried to accomplish. The agent reaches to its complete state after completing all the tasks. Intelligence of the Intelligent Agent (IA) is derived from the Artificial Intelligence (AI) techniques and Mobility is as per same traditional computing in Distributed System (DS). Machine Learning deals with the development of techniques which allow the computer to learn. The agents must be able to learn to do classification, clustering and prediction using learning algorithms. K-means clustering algorithm is an unsupervised learning and output the clusters of a given dataset on the other hand decision tree (ID3) is a supervised learning and output the decision rules in the form of if-then-else of a given dataset and therefore, LIAgent is a hybrid integration of supervised and unsupervised machine learning [5][6][7][8]. The next part of this paper is organized as follows: Section 2 reviews the relevant data mining algoritms, namely Kmeans clustering and Decision tree (ID3). Section 3 is about the methodology, a hybrid integration of these two

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

77

data mining algorithms, followed by section 4 where we discuss the findings and obtained results and finally section 5 presents the conclusion.

2. Overview of Data Mining Algorithms


Data mining algorithms are accepted nowadays due to their robustness, scalability and efficiency in different fields of study like bioinformatics, genetics, medicine and education and many more areas. The classification, clustering, interpretation and data visualization are the main areas of data mining algorithms, Decision Tree (ID3), K-Nearest Neighbor, Nave Bayesian classification and Neural Network are the most commonly used classification algorithms, whereas, K-means and Selforganized (Kohonen) maps are the well-accepted clustering algorithms and 2D or 3D scattered graphs are used for visualization. Once the clusters are discovered, after applying K-means algorithm on a dataset, there are different ways to utilize the clustering results; clusters membership can be used as a label for the separate classification problems, some descriptive data mining techniques like ID3 (decision rules) can be used to find descriptions of clusters and clusters can be visualized using 2D or 3D scattered graphs [9][10][11]. Figure 2 shows the different combinations of data mining algorithms.

and all these combinations are illustrated in the above figure 2, no other combination will accomplish all these tasks. If we choose the second option, the problem is that Self-organized (Kohonen) maps produce the output in the form of graphs or maps (bit map images), but on the other hand numeric data type is required as an input for ID3, which is only possible by opting the first option, therefore, the combination of K-means with ID3 is selected for this research paper. For the data visualization 2D scattered graphs can be drawn. In this section we discuss these two algorithms of this preferred combination of the algorithms.

2.1 K-means clustering Algorithm


Following steps explain K-means clustering algorithm: Step 1: Enter the number of clusters and number of iterations, which are the required and basic inputs of Kmeans clustering algorithm. Step 2: Compute the initial centroids by using the Range Method shown in equations 1 and 2.

ci ((max X min X ) / k ) * n

(1) (2)

c j ((max Y min Y ) / k ) * n

Fig. 2 The Combinations of Data Mining Algorithms

We can combine K-means clustering algorithm with four classification algorithms namely, ID3, Neural Networks, Nave Bayesian and K-Nearest Neighbour and similarly, Self-organized (Kohonen) map another clustering algorithm can be combined with these four classification algorithms. There are eight possible combinations of clustering and classification algorithms which are: (Kmeans, ID3), (K-means, Neural Networks), (K-means, Nave Bayesian) and (K-means, K-Nearest Neighbour) and (Self-organized map, ID3), (Self-organized map, Neural Networks), (Self-organized map, Nave Bayesian) and (Self-organized map, K-Nearest Neighbour). In order to perform the tasks of classification, clustering and interpretation on a dataset, there are two possible choices, first is (K-means, ID3) and the second is (Self-organized (Kohonen) map) from the above mentioned combinations

The initial centroid is C(ci, cj).Where: max X, max Y, min X and min Y represent maximum and minimum values of X and Y attributes respectively. k represents the number of clusters and i, j and n vary from 1 to k where k is an integer. In this way, we can calculate the initial centroids; this will be the starting point of the algorithm. The value (maxX minX) will provide the range of X attribute, similarly the value (maxY minY) will give the range of Y attribute. The value of n varies from 1 to k. The number of iterations should be small otherwise the time and space complexity will be very high and the value of initial centroids will also become very high and may be out of the range in the given dataset, which is a drawback of the algorithm. Step 3: Calculate the distance using Euclideans distance formula in equation 3. On the basis of the distances, generate the partition by assigning each sample to the closest cluster. Euclidean Distance Formula:

d xi , x j

x
N k 1

ik

x jk

(3)

Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i and j vary from 1 to N where N is total number of attributes of a given object. i,j and N are integers. Step 4: Compute new cluster centers as centroids of the clusters, again compute the distances and generate the

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

78

partition. Repeat this process until the cluster memberships stabilize [12][13]. Strengths and weaknesses of K-means clustering algorithm are shown in table 1.
Table 1: Strengths and Weakness of K-means clustering Algorithm

Table 2: Strengths and Weaknesses of Decision Tree (ID3) Algorithm

Strengths Time complexity is O(nkl). Linear time complexity in the size of the dataset. Space complexity is O(k + n).

It is an order-independent algorithm. It generates same partition of data irrespective of order of samples. Not applicable

Weaknesses It is easy to implement, it has the drawback of depending on the initial centre provided. If a distance measure does not exist, especially in multidimensional spaces, first define the distance, which is not always easy. The Results obtained from this clustering algorithm can be interpreted in different ways.

Strengths It generates understandable rules. It performs classification without requiring much computation. It is suitable to handle both continuous and categorical variables. It provides an indication for prediction or classification.

Weaknesses It is less appropriate for a continuous attribute. It does not perform better in problems with many class and small number of training examples. The growing of a decision tree is expensive in terms of computation because it sorts each node before finding the best split. It is suitable for a single field and does not treat well on nonrectangular regions.

3. Methodology
For a given dataset, create n number of partitions and input one partitioned dataset at a time to start the execution of proposed algorithm, LIAgent, which creates m number of clusters, convert the partitioned dataset into a partitioned clustered dataset and produces o number of decision rules for this partitioned clustered dataset, where n, m and o all are non-zero positive integers. A medical dataset, Breastcancer, of more than 200 records is selected for this research paper. The data is an interval scale between 1 and 10 in this dataset and is properly cleansed. The attributes are: Clump Thickness (CT), Uniformity of Cell Size (UCS), Uniformity of Cell Shape (UCSh), Marginal Adhesion (Mad), Single Epithelial Cell Size (SECS), Bare Nuclei (BNu), Bland Chromatin (BCh), Normal Nucleoli (NNu), Mitoses , Class (benign, malignant) [18]. We create five vertical partitions of Breastcancer dataset, after selecting the proper number of attributes in each partition. These partitions are shown in tables 3 to 7.
Table 3: 1st Vertically partition of Breastcancer Dataset

All clustering techniques do not address all the requirements adequately and concurrently.

K-means clustering algorithm is applied in Marketing, Biology, Libraries, Insurance, City-planning, Earthquake studies, www and Medical Sciences etc. [14][15].

2.2 Decision Tree (ID3) Algorithm


Decision tree (ID3) produces decision rules in the form of if-then-else as output, which can be use for the decision support systems, classification and prediction and are helpful to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The function of decision tree (ID3) is shown in figure 3.

Figure 3. The Function of Decision Tree (ID3) algorithm

The cluster is the input data for decision tree (ID3) algorithm, which produces decision rules of this cluster. Following steps explain Decision Tree (ID3) algorithm: Step 1: Let S is a training set. If all instances in S are positive, then create YES node and halt. If all instances in S are negative, create a NO node and halt. Otherwise select a feature F with values v1,...,vn and create a decision node. Step 2: Partition the training instances in S into subsets S1, S2, ..., Sn according to the values of V. Step 3: Apply the algorithm recursively to each of the sets Si [16][17]. Table 2 shows strengths and weaknesses of ID3 algorithm.

CT 4 2 2

UCS 6 3 2

Class benign malignant benign

Table 4: 2nd Vertically partition of Breastcancer Dataset

UCSh 7 6 4

MAd 5 1 3

Class benign malignant benign

Table 5: 3rd Vertically partition of Breastcancer Dataset

SECS 5 9 2

BNu 3 2 4

Class benign malignant benign

Table 6: 4th Vertically partition of Breastcancer Dataset

BCh

NNu

Class

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

79

1 8 5
th

4 9 6

benign malignant benign

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

Table 7: 5 Vertically partition of Breastcancer Dataset

BCh 148 85 185

Mitoses 0 94 168

Class benign malignant benign

Each partitioned table is a dataset of more then 200 records; only 3 records are exemplary shown in each table. For visualization of the results of these clusters, 2D scattered graphs are also drawn. Figure 4 illustrates the architecture of LIAgent.

Rule: 1 if UCSh = 1 then Rule: 2 if MAd = 1 then Class = "benign, "malignant" else Class = "benign" else Rule: 3 if MAd = 1 then Rule: 4 if UCSh = 8 then Class = "malignant" else Rule: 5 if UCSh = 5 then Class = "malignant" else Rule: 6 if UCSh = 4 then Class = "malignant, "benign" else Rule: 7 if UCSh = 3 then Class = "benign, "malignant" else Class = "benign" else Rule: 8 if UCSh = 2 then Rule: 9 if MAd = 2 then Class = "benign" else Rule: 10 if MAd = 3 then Class = "benign" else Rule: 11 if MAd = 10 then Class = "benign" else Class = "malignant" else Rule: 12 if UCSh = 4 then Rule: 13 if MAd = 5 then Class = "benign" else Rule: 14 if MAd = 2 then Class = "benign" else Rule: 15 if MAd = 4 then Class = "malignant, "benign" else Class = "malignant" else Rule: 16 if MAd = 3 then Rule: 17 if UCSh = 3 then Class = "benign, "malignant" else Rule: 18 if UCSh = 5 then Class = "benign, "malignant" else Class = "malignant" else Rule: 19 if MAd = 6 then Rule: 20 if UCSh = 5 then Class = "benign" else Class = "malignant" else Class = "malignant"
Fig. 5 The Decision Rules for Second partition

Fig. 4 The Architecture of LIAgent

Each partitioned dataset is an input for LIAgent which generates the two outputs, the clusters of the partitioned dataset and the decision rules of partitioned clustered datasets. The tasks of classification, prediction and interpretation are accomplished by the LIAgent, a single efficient data mining algorithm, K-means algorithm is used to produce the clusters of the given dataset and ID3 uses this clustered dataset to generate decision rules for classification and interpretation, which helps the user to take the decision on the given dataset. The process of visualization is done by plotting the 2D scattered graphs.

There are twenty decision rules of 2nd partitioned of dataset. The result for 2nd partition of Breastcancer dataset is if the value of both attributes Marginal Adhesion and Uniformity of Cell Shape is 1 then the patient has benign class of Breast Cancer otherwise the malignant class of Breast Cancer. Decision rules for 5th partition of dataset Breastcancer are given below in figure 6.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Rule: 1 if Mitoses = 1 then Rule: 2 if BCh = 7 then Class = "malignant, "benign" else Rule: 3 if BCh = 1 then Class = "benign" else Rule: 4 if BCh = 2 then Class = "benign, "malignant" else Rule: 5 if BCh = 3 then Class = "benign, "malignant" else Rule: 6 if BCh = 4 then Class = "malignant, "benign" else Rule: 7 if BCh = 8 then Class = "malignant" else Rule: 8 if BCh = 9 then Class = "malignant" else Rule: 9 if BCh = 10 then Class = "malignant" else Class = "benign" , "malignant" else Rule: 10 if Mitoses = 2 then Rule: 11 if BCh = 1 then Class = "malignant" else Rule: 12 if BCh = 6 then Class = "malignant" else Rule: 13 if BCh = 3 then Class = "malignant , "benign" else Class = "malignant, "benign" else Rule: 14 if BCh = 1 then Class = "benign" else Class = "malignant" Fig. 6 The Decision Rules for Fifth partition

4. Results and Discussion


We tested this proposed algorithm, LIAgent, on a medical dataset, Breastcancer and for the further processing, the value of k, number of clusters and the value of n, number of iterations are set to 5 and 50 respectively, i.e. k=5 and n=50. The outputs produced by the algorithm are: five clusters, one for each partitioned datasets and five decision rules, one for each partitioned clustered datasets. The results of only two partitioned clustered datasets, 2nd and 5th are demonstrated. Decision rules for 2nd partition of dataset Breastcancer are shown in figure 5.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

80

There are fourteen decision rules generated of 5th partitioned of the dataset. The decision rules in this partition are less as compared to 2nd partition. The result for 5th partition of Breastcancer dataset is if the value of the attribute Mitoses is 1 and the value of attribute Bland Chromatin is either 1, 2 or 3 then the patient has benign class of Breast Cancer otherwise the malignant class of Breast Cancer. We can draw a conclusion from the above cited two decision rules of the Breastcancer dataset in figures 5 and 6 that if the value of the attributes of this dataset is variable then the class is malignant and the class is benign for the constant values of the attributes. The visualization is an important tool which provides the better understanding of the data and illustrates the relationship among the attributes of the data. For the visualization 2D scattered graphs are drawn for all the partitioned clustered datasets. Figure 7 shows 2D scattered graph of 1st partitioned clustered dataset of Breastcancer.
2D Scattered Graph of 1st Partition
12 U n if o r m it y o f C e ll S iz e 10 8 6 4 2 0 0 50 100 150 200 250 Clump Thickness Uniformity of Cell Size

2D Scattered Graph of 2nd Partition


12 M a rg in al A d h esio n 10 8 6 4 2 0 0 50 100 150 200 250 Uniformity of Cell Shape Uniformity of Cell Shape Marginal Adhesion

Fig. 8 2D Scattered Graph between UCSh and MAd attributes of Breastcancer dataset

The graph in figure 8 shows that the value of the attributes Marginal Adhesion and Uniformity of Cell Shape both are same from 50 to the end of the graph. The graph can be divided into two regions; one is from 0 to 80 and the second is from 80 to 230. In the first region, the value of the attributes Uniformity of Cell Shape and Marginal Adhesion varies from 3 to 10 and it is constant in the subsequent second region. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and benign class of breast cancer for the constant values of the attributes. Figure 9 shows 2D scattered graph of 3rd partitioned clustered dataset of Breastcancer.

Clump Thickness
12

2D Scattered Graph of 3rd Partition

B a re N u c le i

Fig. 7 2D Scattered Graph between CT and UCS attributes of Breastcancer dataset

10 8 6 4 2 0 0 50 100 150 200 250 Single Epithelial Cell Size Single Epithelial Cell Size Bare Nuclei

The graph in figure 7 shows that the value of the attributes Uniformity of Cell Size and Clump Thickness varies in the beginning and then it becomes constant from 50 to the end of the graph. The graph can be divided into three regions; first is from 0 to 50, the second is from 50 to 100 and the third one is from 130 to the 230. In the first region, the value of the attributes Clump Thickness and Uniformity of Cell Size is variable and varies from 4 to 10. The value of the attributes is almost constant in the next two regions. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and if the value of the attributes is constant then the benign class of breast cancer. Figure 8 shows 2D scattered graph of 2nd partitioned clustered dataset of Breastcancer.

Fig. 9 2D Scattered Graph between SECS and BNu attributes of Breastcancer dataset

The graph in figure 9 shows that the value of the attributes Bare Nuclei and Single Epithelial Cell Size is constant in the beginning from 1 to 150 then there is a variation in the values of these attributes. The graph has two main regions; one is from 0 to 130 and the second is from 130 to 230. The value of the attributes Single Epithelial Cell Size and Bare Nuclei is constant in the first region of the graph and then varies between 2 and 10 in the subsequent next region. The outcome of this graph is that if the value of the attributes is variable then the patient has

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

81

malignant class of breast cancer and if the value of the attributes is constant then the class of breast cancer is benign. Figure 10 shows 2D scattered graph of 4th partitioned clustered dataset of Breastcancer.
2D Scattered Graph of 4th Partition
12 10 N o rm a l N u c le o l i 8 6 4 2 0 0 50 100 150 200 250 Bland Chromalin Bland Chromalin Normal Nucleoli

throughout in this graph. The graph can be divided into two main regions; first is from 0 to 10 and the second is from 10 to 230. The value of the attributes Bland Chromatin and Mitoses varies from 3 to 10 in the first region and remains constant in the subsequent next region. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer otherwise benign class of breast cancer for the constant value of the attributes. Table 8 summarizes the results of LIAgent.
Table 8: The summary of the Results of LIAgent

Partition # 1 2 3
st nd rd th th

# Clusters 5 5 5 5 5

# Records 200 200 200 200 200

# Decision Rules 22 20 27 20 14

Fig. 10 2D Scattered Graph between BCh and NNu attributes of Breastcancer dataset

4 5

The structure of the graph in figure 10 is similar to the graph of figure 6. The value of the attributes Normal Nucleoli and Bland Chromatin is constant from 40 to the end of the graph. The graph can be divided into three regions; first is from 0 to 50, second is from 50 to 110 and the third one is from 110 to 230. The value of the attributes Bland Chromatin and Normal Nucleoli varies from 4 to 10 in the first region and then the value becomes constant in the next two regions of the graph. The outcome of this graph is that if the value of the attributes is variable then the patient has malignant class of breast cancer and if the value of the attributes is constant then the class of breast cancer will be benign. Figure 11 shows 2D scattered graph of 5th partitioned clustered dataset of Breastcancer.
2D Scattered Graph of 5th Partition
12 10 8 M i to s e s 6 4 2 0 0 50 100 150 200 250 Bland Chromalin Bland Chromalin Mitoses

The table shows that number of clusters and number of records in each partitioned dataset are fixed but the number of decision rules varies from one partition to other partition. All the clusters are not good for the classification; only the required and useful clusters can be thrashed out for further processing. The algorithm works properly and produces the satisfactory and consistent results.

5. Conclusion
The objectives set at the outset of this research paper were, firstly to integrate K-means and ID3 algorithms and secondly to validate its efficiency. For the first objective the correct diagnosis has been obtained using this proposed LIAgent on a medical dataset Breastcancer, coming to the second objective, it is a quite exhaustive and requires complete study of the integration of these two algorithms, therefore, we cannot at this stage conclude that the experiment conducted with this combination is somehow very efficient. In future, a more extensive study should be carried out to show that this combination is superior to the other combinations of data mining algorithms. We conclude our paper that the first objective the integration of these two algorithms works perfectly, as far second objective is concerned, the results are partially satisfactory.

Fig. 11 2D Scattered Graph between BCh and Mitoses attributes of Breastcancer dataset

Future Work
The further study can be carried out on the different medical datasets to validate the effectiveness of the proposed methodology in this research paper.

The structure of the graph in figure 11 is similar to the graphs in figure 6 and 8. The value of the attributes Mitoses and Bland Chromatin is almost constant

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

82

Acknowledgment
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II.

References
[1] Donovan, A. M., Knowledge Discovery in Databases and Information Retrieval in Knowledge Management Systems, Knowledge Management System, LIS385T, The University of Austin, April 22, 2003. [2] Mohammadian (ed), Masoud., Intelligent Agents for Data Mining and Information Retrieval, ISBN: 1591401941 Idea Group Publishing 2004. [3] Ullman, J. D., Data Mining: A knowledge discovery in databases, URL: http://www-db.stanford.edu/~ullman/mining. [4] Kantardzic, Mehmed., Data Mining: Concepts, Models, Methods, and Algorithms, ISBN:0471228524 John Wiley & Sons 2003. [5] Liu, Bing, Web Data Mining Exploring Hyperlinks, Contents, and Usage Data, ISBN-13 978-3-540-37881-5, Springer Berlin Heidelberg New York pp. 124 -139. [6] Peng, Y., Kou, G., Shi, Y., Chen, Z., A Descriptive Framework for the Field of Data Mining and Knowledge Discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639-682, 2008. [7] Bigus, J. P., Constructing intelligent agents using JAVA 2nd ed, ISBN: 0-471-39601-X. 2001. [8] Bigus, J.P., Agent Building and Learning Environment, In Proceedings of The International Conference on Autonomous Agent 2000, Association for computing Machinary, 108-109. Barcelona, Spain. [9] Wang, John., Data Mining Opportunities and Challenges,

Idea Group Publishing ISBN: 1-59140-051-1, chapter IX page 235 and chapter XVI page 381 [10] Liu, Bing., Web Data Mining Exploring Hyperlinks, Contents, and Usage Data, ISBN: 13 978-3-540-37881-5, Springer Berlin Heidelberg New York, chapter 3 and chapter 4. [11] Introduction to Data Mining and Knowledge Discovery, ISBN: 1-892095-02-5, Third Edition by Two Crows Corporation, page numbers: 11,12,13,15. [12] Bezek, A., and Gams, Matjaz., An Agent Version of a Cluster Server, IEEE/ACM, CCGRID 2003. [13] Davidson, Ian., Understanding K-Means Non-hierarchical Clustering, SUNY Albany Technical Report 02-2.2002. [14] Fung, Glenn., A Comprehensive Overview of Basic Clustering Algorithms,. June 22, 2001. [15] Robinson. F, Apon. A., Brewer. D., Dowdy. L., Doug Hoffman. D., Lu. Baochuan., Initial Starting Point Analysis for K-Means Clustering: A Case Study. [16] MacQueen, J.B., Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297. [17] Quinlan, J.R., C4.5: Programs for Machine Learning, San Francisco: Morgan Kanfmann. ISBN: 1-55860-238-0, 1993. [18] US Census Bureau. Iris, Diabetes, Vote and Breast datasets, Publicly available at URL: ww.sgi.com/tech/mlc/db.

Dost Muhammad Khan: M.Sc.(Computer Science) from B.Z. University, Multan, Pakistan. Currently PhD Student at SITE, UTM, Mauritius. Assistant Professor Department of Computer Science & IT, The Islamia University of Bahawalpur, Pakistan. Published 5 research papers in International Conferences & Journals. Fields of interest are Data Mining, Intelligent Agent, and Object Relational Database Management.

You might also like