Professional Documents
Culture Documents
SUSANNA STILL 1
Fig. 1. 100 Square Grid Results Fig. 3. 200 Square Grid Results
of different variance, mean, and position. Two of the distribu- Mesh based clustering is a useful tool that requires addi-
tions are circular, while the third is oblong. tional research. Since identifying the appropriate grid size is
As shown in Figure 1, the clustering attempt with 100x100 a major limiting factor of the Choudhari algorithm, the im-
cells resulted in one large cluster. The cell size is too large plemented algorithm performs incremental subdivisions until
and enough neighbor cells contain points such that there is a threshold of cluster membership is reached. Unfortunately,
a connection between all of the clusters. Figure 2 shows that there are still problems needing resolution with this algorithm
with 150x150 cells, the upper two clusters are now separated and with mesh clustering in general.
from the lower cluster. The overlapping of the lower cluster Overlapping and sparse distributions create major problems
is not as severe as the upper clusters, thus, this cell size for this form of grid clustering. For overlapping data-sets it is
was able to separate the top clusters and the bottom cluster. desirable to have a small cell size to prevent the clusters being
Finally, with 200x200 cells, Figure 3 shows the three clusters grouped together. For sparse data sets, it is desirable to have a
fully separated. The graphs are filtered to not show clusters of large cell size to capture more of the neighbors and join more
exceedingly small size. points together into a common cluster. Unfortunately, as the
ELENA clouds experiment above shows, it is possible to have
both overlapping and spare distributions occur together.
B. ELENA Concentric Additionally, the lack of a stopping criterion allows for a
The ELENA concentric database consists of 2500 two- great variety of possible cluster shapes and sizes. Unfortu-
dimensional data points divided into two non-overlapping nately, if the cell size is not optimal, then far more cells may be
clusters. One cluster is a ring shape, while the other cluster joined together into a cluster than should be. All it takes is for
is a circular shape embedded within the first clusters ring. one small chain of boxes to connect two clusters and join them,
There is no appreciable gap between the two clusters. The no matter how far they are apart. Furthermore, the practice of
grid algorithm is not limited to clustering circles of ellipses. adding adjacent neighbors to the same cluster is impractical in
ICS 635 - MACHINE LEARNING - DR. SUSANNA STILL 3
V. F UTURE W ORK
Although it is possible to simply rerun the algorithm with
larger or smaller grid sizes, it seems unnecessary to include all
of the cells in such future operations. Although the algorithm
originally has O(n) complexity, repetitive runs would result
in longer run-time.
As such it may be possible to optimize future iterations
by limiting the number of cells being subdivided. Instead of
dividing all of the cells, which is somewhat equivalent to
increasing the grid size, we can instead divide only the cells
that are important. That is, we will first discard cells with no
data points within. Next, we should not need to subdivide cells
that have a high degree of similar data points compared to free
space.
Thus, with normalized data, we can start the algorithm
with the largest possible grid size, and allow the algorithm to
continue subdividing the grids where subdivision is suspected
to result in improved clustering. This process will likely
result in clusters composed of heterogeneous sized cell pieces.
Cluster centers will likely be larger cells surrounded by smaller
cells that further define the cluster boundaries.
R EFERENCES
[1] R. O. Duda, P. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
Wiley, 2001.
[2] V. N. Choudhari A., Hanmandlu M. and C. R.D, Mesh based clustering
without stopping criterion, in INDICON, 2005 Annual IEEE, 2005.
[3] A. Hinneburg and D. A. Keim, Optimal grid-clustering: Towards
breaking the curse of dimensionality in high-dimensional clustering, in
Proceedings of the 25th VLDB Conference, 1999, pp. 506517. [Online].
Available: http://fusion.cs.uni-magdeburg.de/pubs/optigrid.pdf
[4] Elena database, April 2005. [Online]. Available:
http://www.dice.ucl.ac.be/mlg/DataBases/ELENA/ARTIFICIAL/