You are on page 1of 24

X- CLUSTER : A Novel and Efficient Clustering Tool

SUBMITTED BY: MUGDHA SHARMA (0622083108) SAURABH NAHATA (0682083108) IVY JAIN (0692083108) ANKIT GOYAL (0722083108)

About The Project


Aim of this project is to devise a new algorithm of clustering for Data

Mining The main functionalities which are implemented in the system are preprocessing and clustering. In the preprocessing of the data, input file, .xls file is chosen. The null values, if any The redundancy or duplicity in the data sets of the attributes is removed. In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.

What is Clustering?
One of popular data mining technique is

Clustering in which various objects are assigned to groups (clusters) depending on their characteristics. It is a type of unsupervised learning which helps in pattern recognition so that objects in one group behave in a similar manner and differently from the other. Clustering has a major application in case of target marketing in which a company decides to target group of customers based on their characteristics.

Present Tool: WEKA


WEKA (Waikato Environment for Knowledge Analysis) is a popular

suite of machine learning software, developed at the University of Waikato, New Zealand. The Explorer interface features several panels providing access to the main components of the workbench. The Preprocess panel has facilities for importing data from a database, a CSV file, and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria. The Cluster panel gives access to the clustering techniques in WEKA, e.g., the simple K- means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.

Proposed System : X Cluster


Initially in the data preprocessing phase, the MS-Excel File is taken as

input. There is no question of CSV of ARFF File(s). File(s) for data mining is firstly cleaned, by removing the null data sets from the input file. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). After removing redundancy/ duplicity of the data sets from the input file we create a KD Tree of collected data sets. Our Clustering Algorithm uses KD Tree extensively for improving its Time Complexity Requirements.

PREPROCESSING IN WEKA
A set of data with the following statistics was run on WEKA and our tool both : Relation = weather No. of attributes = 3 No. of Instances = 20 (Including redundant/duplicate and null instances)

ARFF file of the data sets.

Preprocessing of the ARFF file in WEKA Tool

Microsoft Excel File of the data sets

Preprocessing of the ARFF file in WEKA Tool

KD Trees
K Dimensional Trees Space Partitioning Data Structure Splitting planes perpendicular to

Coordinate Axes
Reduces the Overall Time Complexity

to O(log n)

KD Tree formed according to the data sets provided by the user

Nearest Neighbor Search

How KD Trees are faster ?

CLUSTERING IN WEKA : K Means


K-Means methodology is a commonly used clustering technique.

In this user attempts to group data sets into K Number of Clusters

based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below..
First, k initial means are randomly selected from the data sets K clusters are selected by associating every observation with the

nearest mean Centroid of each cluster become nearest mean Step 2 and 3 are repeated.

Disadvantages of clustering in WEKA:


Clustering algorithm implemented in WEKA does not allow us to

choose the initial cluster centre. The clusters formed are poor in case the data set is small or sparse. K-Means cannot handle non-globular data of different sizes and densities K-Means will not identify outliers. No guarantee of converging to global optimum In worst case K means is slow and take exponential time

Poor clusters formed by WEKA

Clusters formed by X-CLUSTER (where no of clusters chosen is 4)

Features Included: Initialize the centre Choose number of cluster Data sets are filtered using K-d tree History option available

BENEFITS OF PROPOSED APPROACH


Ensure data accuracy and Save disk space

Using excel file for input removes the requirement of learning a new

file format. Null sets were removed comfortably. Duplicate data were also removed. This helped in reducing the processing time that would otherwise be required in case of ARFF file because it would not ignore null and duplicate sets. The proposed algorithm also provided with an option to choose the initial cluster centre. The algorithm also uses a filtering algorithm KD-trees to speed up Clustering. It reduces the time complexity. The user need not depend on third party softwares like WEKA or any other similar tool.

CONCLUSION
We device a new algorithm for clustering by considering the following variations: The system provides a user-friendly interface. MS-Excel File(s) is successfully read, handled and processed by the system with the help of jxl.jar library. By using this library, new features and functionalities of using Excel document were known. Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed. A filtering algorithm is included in this which uses KD-TREES to speed up each clustering step. This algorithm choose better starting clusters i.e. choosing the initial values (or seeds) for the clustering algorithm. There is also a rich feature in our tool that it shows the History from the initial position of the centers to their final position.

References
Han, J. & Kamber Data Mining: Concepts and Techniques Morgan Kaufmann Publishers, An imprint of Elsevier, 2010, 2nd Edition. Osama Abu Abbas Comparisons between data Clustering algorithms The International Arab Journal of Information Technology, Vol. 5, No. 3, July 2008. Tapas Kanung et al. An efficient K- means Clustering algorithm: analysis and implementation IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 7, July 2002. Airel Perez Suarez et al. A new graph-based algorithm for Clustering documents IEEE International Conference on Data Mining Workshops, 2008 Prabhjot kaur, Anjana Gosain Improving the performance of Fuzzy Clustering algorithms through Outlier Identification FUZZ - IEEE International Conference on Fuzzy Systems, 2009 Sun Yuepeng, Liu Min, Wu Cheng A Modified K- means Algorithm for Clustering Problem with Balancing Constraints Third International Conference on Measuring Technology and Mechatronics Automation, 2011 Geoffrey Holmes et al. Weka, a Machine Learning Workbench Department of Computer Science, University of Waikato, 1994 Jian Yu, Pengwei Hao The Worse Clustering Performance Analysis IEEE International Conference on Granular Computing, 2007 Hai-Dong Meng et al. Research and Implementation of Clustering Algorithm for Arbitrary Clusters International Conference on Computer Science and Software Engineering, 2008

THANK YOU.

You might also like