Professional Documents
Culture Documents
PART A
(PART A : TO BE REFFERED BY STUDENTS)
Experiment No.05
Aim: Implementation of 2 dimensional K-means Algorithm for Clustering.
Prerequisites: C/C++/Java
Programming
Learning Outcomes:
Concepts of K-means Algorithm and Clustering.
Theory:
Algorithm:
Example:
Problem Statement : Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly
assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical slot.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties
at the end of the practical in case the there is no Black board access available)
Roll No. E059
Class : B.Tech CS
Date of Experiment:
Grade :
Date of Grading:
/*
* To change this license header, choose License Headers in Project
Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* @author mpstme.student
*/
//package means;
import java.util.ArrayList; import java.util.Scanner;
public class KMeans {
public static int NUM_CLUSTERS ;
public static int TOTAL_DATA ;
private static ArrayList<Data> dataSet = new ArrayList<>();
private static ArrayList<Centroid> centroids = new ArrayList<>();
private static void initialize(double SAMPLES[][])
{
ArrayList<Integer> temp=new ArrayList<>(); for(int
i=0;i<NUM_CLUSTERS;i++){
int t=(int)Math.floor(Math.random()*TOTAL_DATA);
if(temp.isEmpty()||!temp.contains(t)){
temp.add(t);
centroids.add(new Centroid(SAMPLES[t][0],SAMPLES[t][1]));
}
else
{
i--;
}
}
System.out.println("Centroids initialized at:"); for(int
i=0;i<NUM_CLUSTERS;i++){
System.out.println(" (" + centroids.get(i).X() + ", " + centroids.get(i).Y()
+ ")");
}
}
private static void kMeanCluster(double SAMPLES[][])
{
final double bigNumber = Math.pow(10, 10); double minimum =
bigNumber;
double distance = 0.0; int sampleNumber = 0; int cluster = 0;
boolean isStillMoving = true; Data newData = null;
}
}
sampleNumber++;
}
while(isStillMoving)
{
for(int i = 0; i < NUM_CLUSTERS; i++)
{
double totalX = 0.0; double totalY = 0.0; double totalInCluster = 0.0;
for(int j = 0; j < dataSet.size(); j++)
{
if(dataSet.get(j).cluster() == i){
this.mY = y;
}
public double Y()
{
return this.mY;
}
public void cluster(int clusterNumber)
{
this.mCluster = clusterNumber;
}
public int cluster()
{
return this.mCluster;
}
}
private static class Centroid
{
private double mX = 0.0; private double mY = 0.0;
public Centroid()
{
}
public Centroid(double newX, double newY)
{
this.mX = newX; this.mY = newY;
}
return this.mY;
}
}
public static void main(String[] args)
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter total no of clusters");
NUM_CLUSTERS=sc.nextInt();
do{System.out.println("Total No of data");
TOTAL_DATA=sc.nextInt();
if(TOTAL_DATA<NUM_CLUSTERS){
System.out.println("Number of data should be atleast equal to number of
clusters");
}
}while(TOTAL_DATA<NUM_CLUSTERS);
double SAMPLES[][]=new double[TOTAL_DATA][2];
System.out.println("Enter sample values");
for(int i=0;i<TOTAL_DATA;i++){
for(int j=0;j<2;j++){
SAMPLES[i][j]=sc.nextDouble();
}
}
initialize(SAMPLES);
kMeanCluster(SAMPLES);
for(int i = 0; i < NUM_CLUSTERS; i++)
{
System.out.println("Cluster " + i + " includes:"); for(int j = 0; j <
TOTAL_DATA; j++)
{
if(dataSet.get(j).cluster() == i){
System.out.println(" (" + dataSet.get(j).X() + ", " + dataSet.get(j).Y() +
")");
}
}
System.out.println();
}
System.out.println("Centroids finalized at:"); for(int i = 0; i <
NUM_CLUSTERS; i++)
{
System.out.println(" (" + centroids.get(i).X() + ", " + centroids.get(i).Y()
+")");
}
System.out.print("\n");
}
}
Input Data:
debug:
Enter total no of clusters
5
Total No of data
7
Enter sample values
12
34
56
78
89
88
99
Centroids initialized at:
(3.0, 4.0)
(1.0, 2.0)
(5.0, 6.0)
(8.0, 9.0)
(9.0, 9.0)
Cluster 0 includes:
(3.0, 4.0)
Cluster 1 includes:
(1.0, 2.0)
Cluster 2 includes:
(5.0, 6.0)
Cluster 3 includes:
(7.0, 8.0)
(8.0, 8.0)
Cluster 4 includes:
(8.0, 9.0)
(9.0, 9.0)
Centroids finalized at:
(3.0, 4.0)
(1.0, 2.0)
(5.0, 6.0)
(7.5, 8.0)
(8.5, 9.0)
BUILD SUCCESSFUL (total time: 55 seconds)
Output Data:
After successful completion of this experiment, we learned to implement k-means method for
clustering the given objects using centroids. We observe that the objects get clustered according to
their distances from a given centroid which is chosen randomly.
B.4 Conclusion:
(Students must write the conclusions based on their learning)
2) Hierarchy Algorithms Create a hierarchical decomposition of the set of data using the same
criterion.
Advantages:
- Structure that is more informative
- Does not require to specify number of clusters
Disadvantages:
Selection of merge points is critical.
Split decisions if not well chosen may lead to low quality clusters.
3) Density Based Based on connectivity and density function.
Advantage:
- It is based on connecting points within certain distance
thresholds Disadvantage:
They expect some kind of density drop to detect cluster borders
Q2. Explain Hierarchical algorithms for clustering with example.
The hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
Agglomerative: Start with the points as individual clusters. At each step, merge the closest pair of
clusters until only one cluster (or k clusters) left.
Divisive: Start with one, all-inclusive cluster. At each step, split a cluster until each
cluster contains a point (or there are k clusters).
algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of
BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data
points in an attempt to produce the best quality clustering for a given set of resources (memory and
time constraints). In most cases, BIRCH only requires a single scan of the database. In addition,
BIRCH also claims to be the "first clustering algorithm proposed in the database area to handle 'noise'
(data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months.