You are on page 1of 17

17/04/2018 Your Document is Ready

FSMA Information Systems Magazine


n. 4 (2009) pp. 18-36
http://www.fsma.edu.br/si/sistemas.html

Grouping Techniques

Ricardo Linden

abstract

This chapter describes grouping methods that allow interesting features to be extracted from
these data by separating them into functional groups or by hierarchizing them for further study.

Keywords Grouping; Heuristics; K-Means; Neural networks; Graphs

Abstract

This chapter describes some clustering techniques for feature extraction are described. These
methods work well for functional group separation and can be used for further studies.

Keywords Clustering; Heuristics; K-Means; Neural Nets; Graphs

1. Introduction

In this section the main concepts associated with the clustering problem will be introduced. The reader will
perceive the difficulty associated with the process and will understand the main uses of the technique. In
addition, the basic mathematical tooling is presented so that the reader can understand the algorithms
introduced in section 2.2.

1.1 Grouping: Basic Concepts

Grouping analysis, or clustering, is the name given to the group of computational techniques whose purpose
is to separate objects into groups, basing themselves on the characteristics that these objects possess. The
basic idea is to put objects that are similar according to some pre-determined criterion into one group .
The criterion is usually based on a dissimilarity function, which function receives two objects and
returns the distance between them. The main functions will be discussed in section 2.1.2. The grids
determined by a quality metric must have high internal and high homogeneity

18

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
separation (external heterogeneity). This means that the elements of a given set must be mutually similar and
preferably very different from the elements of other sets.

Objects are also called examples, tuples, and / or registers. Each object represents a data entry that
can be constituted by a vector of attributes that are numeric or categorical fields (categorical is a type of field
that can assume one of a set of predefined values). Examples of numerical data include age (integer),
temperature (actual) and salary (real), among others, while examples of categorical data include DNA bases
(one of A, C, G or T values), military patent or if the person is sick (Boolean value, being able to assume the
values true or false) (GORDON, 1981).

Grouping analysis is a useful tool for analyzing data in many different situations. This technique can
be used to reduce the size of a data set by reducing a wide range of objects to the information of the center of
its set. Since clustering is an unsupervised learning technique ( where learning is supervised, the process is called
classification ), it can also serve to extract hidden characteristics of the data and develop hypotheses about its
nature.

Unfortunately, the clustering problem presents an exponential complexity. This means that brute-
force methods, such as enumerating all possible groups and choosing the best setting, are not feasible. For
example, if we wanted to separate 100 elements into 5 groups, there are
5 100 » 10 70 possible groupings. Even on a computer that can test 10 9 different configurations per second,
it would take more than 10 53 years to complete the task (about 40 orders of magnitude larger than the Earth's
age). Obviously, it is necessary then to seek an efficient heuristic that allows solving the problem before the
Sun cools and life on Earth extinguishes.

1.2 Measures of Dissimilarity

All clustering methods we will describe in the next session assume that all relevant relationships between
objects can be described by an array containing a measure of dissimilarity or proximity between each pair of
objects.
Each entry p ij in the matrix consists of a numeric value that demonstrates how close the objects i
and j are. Some metrics calculate the similarity, others calculate dissimilarity, but in essence they are
identical.
All dissimilarity coefficients are functions d: G X G Â , where G denotes the set of objects with
which we are working. Basically, these functions allow to perform the transformation of the data matrix,
given by:
x ... x ... x
11 1f 1p
... ... ... ... ...
x x
G = x i1 ... if ... ip (1)

... ... ... ... ...


x ... x ... x
n1 nf np
in an array of distances, given by:
0
d (2.1) 0
(2)
d = d (3.1 ) d (3.2) 0
: : :

d ( n , 1) d ( n , 2) ... ... 0
19

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

where the input d (i, j) represents exactly the distance between elements i and j.
All functions of dissimilarity shall comply with the following basic criteria:
· D ij ³ 0, " i, j Î G

· D ij = d ji , " i, j Î G. This rule states that the distance between two elemen tos does not vary,
regardless of the point at which it is measured. Therefore, the matrix (2.2) is shown being lower
triangular.As it is symmetrical, the values above the diagonal are implicitly defined.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready

· D ij + d jk ³ d ik , " i, j, k Î L . This is known as the triangle inequality, and basically specifies


that the shortest distance between two points is a straight line.

If, in addition to all the properties listed above, the metric also has the property d (ax, ay) = | a | d (x,
y), then it is called a norm. We will now look at the main dissimilarity measures used in clustering algorithms.
The first metric to be mentioned is the metric call of Minkowski. It is given by the following
formula:

d ( i , j ) = q (| x- x | q + | x - x | q + ... + | x - x | q ) (3)
i1 j1 i2 j2 ip jp
As you can easily see in the formula, Minkowski's metrics are a norm. To verify this, simply
multiply all elements by a . Within the root, the value at q can be highlighted and then removed from the
root, assuming the value of a , as required by the definition.
The most known special case of this metric is when q = 2 . In this case, we have the known
Euclidean distance between two points, which is given by:

d ( i , j ) = (| x - x | 2 + | x - x | 2 + ... + | x - x | 2 ) (4)
i1 j1 i2 j2 ip jp
There are two versions of the Minkovski metrics that are popular. When q assumes value equal to
1, we have called Manhattan distance and when qv ai to infinity, we have the distance of Chebyshev. Nest
and last case, we have that distance is given by:

d ( i , j ) = max x ik - x jk (5)
k
The choice of the value of q depends solely and exclusively on the emphasis one wishes to give at
greater distances. The higher the value q the greater the sensitivity of the metric at greater distances. A
simple example is when we want to calculate the distance between the points (0,2,1) and (1,18,2). Using
different values of q, we have the following values for distances:

· Q = 1 d = [| 1-0 | + | 18-2 | + | 2-1 |] = 18

· Q = 2 d = [(1-0) 2 + (18-2) 2 + (2-1) 2 ] 1/2 = 258 1/2 = 16.06

· Q = 3 d = [(1-0) 3 + (18-2) 3 + (2-1) 3 ] 1/3 = 4098 1/3 = 16,002

· Q = ¥ d = [(1-0) ¥ + (18-2) ¥ + (2-1) ¥ ] 1 / ¥ = 16

Figure 1 shows the difference in interpretation between and the various versions of Minkovsky's
metric when we are in two dimensions.

20

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

Figure 1. Example of distance calculated by the metric units. The Euclidean distance
(thin line) is calculated through a straight line between the points. The Manhattan
distance is calculated one block (unit) at a time. The fact that we do not go straight does
not change the distance (check !!!). The Chebyschev distance (dashed line) is given by
the larger of the two dimensions of the distance

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
Another distance metric whose calculation is extremely simple is the Canberra distance. This metric
calculates the sum of the fractional differences between the coordinates of pairs of objects. The formula for
this metric is given by:

rr P x -y
d(x,y)=
i i
(6)
i=1 x i + and i
If one of the coordinates is equal to zero, the term is equal to 1. In all other cases, the term belongs
to the interval [0,1]. The total sum total always gives a number belonging to the interval [0, p], where p is the
size of the vectors.
If both vectors have an i coordinate equal to zero, it is established by definition that the rmin i is
equal to zero (in fact it equals 0/0). One problem with this metric is that the distance is very sensitive to
small variations when both coordinates are very close to zero.
Another problem associated with the Canberra metric is that coordinates whose distance is the
same, but which have different module will present the different contributions. For example, imagine that
1-4
the coordinate i presents the values 1 and 4. Then the contribution of theterm i it will be 1 + 4 = 0.6 .
However, if the coordinate j shows the values 30 and 33, your contribution it will be
30 - 33
30 + 33 " 0.05 . That is, the contributions are from orders of magnitude to different, despite the
differences are identical.
The distance from Mahalanobis is a metric that differs from the Euclidean distance by taking into
account the correlation between the data sets. The formula for distance of Mahalanobis between the is
vectors of the same distribution that have a matrices of covariance S is given by:
rr r r r r
d(x,y)= (x-y) S (x-(y) T -1
(7)
If the covariance matrix is the identity matrix, this formula reduces to Euclidean distance. If the
covariance matrix is diagonal, the distance of Mahalanobis is reduced to the normalized Euclidean distance,
and its formula is given by:

rr P ( xi - y ) 2
i
d(x,y)= (8)
i1
=
s i2

21

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

In this formula, p is the dimension of the vectors and s i is the standard deviation of x i in the sample space.

It can be shown that the surfaces where the Mahalanobis distance are constant are ellipsoids
centered on the mean of the sample space. In the special case where the characteristics are uncorrelated,
these surfaces are spherical, just as in the case of the Euclidean distance.
The use of the Mahalanobis distance corrects some of the limitations of the Euclidean distance,
since it automatically takes into account the scale of the coordinate axes and takes into account the
correlation between the characteristics. However, since there is no free lunch, there is a (high) price to pay
for these benefits. Covariance matrices can be difficult to determine, and memory and computation time
grow quadratically with the number of characteristics.
One can also use the Mahalanobis metric to measure the distance between an element x and a
cluster of elementswhose mean is given by m = { m 1 , m 2 , ..., m n } is that has an array of

covariance given by S . In this case, the distance is given by the formula d = ( x - m ) S (x-m
T -1
)
.Conceptionally, it is as if we were assessing the relevance of an element not only by its distance to the
center of the cluster, but also by its distribution, thus determining the distance of the element x in terms of a
comparison with the standard deviation of the cluster. The higher the value of this metric, the greater the
number of standard deviations that an element is away from the center of the cluster, and the lower its
chance of being an element of the cluster.

The detection of outliers is one of the most common uses of the distance from Mahalanobis, since a
high value determines that an element is at several center deviations (already all the axial differences are
computed) and therefore is probably an outlier. If we make the comparison with several existing groups

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
whose distributions are known, this metric can be used as a way of determining which group an element is
most likely to belong to.

Another interesting metric is correlation. This measures the strength of the relationship of two
variables, so it is a measure of similarity. The correlation coefficient is represented by the letter "r" and does
not have units. Its range of values is [-1, 1], where the magnitude of the correlation represents its strength and
its signal represents the type of relationship: positive signs indicate direct association (when one grows the
other also grows) whereas negative signs indicate association (one grows the other decreases).

The correlation is calculated using the following formula:


n
r= 1 ( x i - x (I.e. and i - y ) (9)

n - 1i=1 s x s and
In this formula, x and S x represent the mean and standard deviations for the values of x and y and S and y
for the values of y , and n and the number of individuals.

1.3 Representation of Groups

There are several different ways to represent the groups created by the clustering algorithms explained in this
chapter. The most usual ways are shown in Figure 2. Some formats are associated with specific methods,
such as dendrograms (Figure 2d), which are associated with hierarchical methods, while others are more
generic such as partitions and Venn diagrams (Figures 2a and b, respectively). The tables (Figure 2c) are
more useful in the case of methods that use nebulous logic, but can be used in other cases, having similar
representation power to the Venn diagrams, for example. Each format will be explained with details and used
when discussing grouping heuristics in section 2.

22

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

Figure 2. Examples of various forms of group representation.

1.4 Applications of Grouping Techniques

Grouping techniques have been used in several different areas. Some of the specific applications, in
addition to the bioinformatics area, which will be described later, include:
· Marketing: It can help people linked to marketing areas to discover distinct groups in their
customer bases, so that this knowledge is then used to develop targeted marketing programs
(CHIANG, 2003).
· Land use : Identification of the possibility of allocation of land use for agrarian and / or urban purposes
in a satellite observation database of the entire planet Earth (LEVIA JR, 2000).
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
· Insurance: Identify groups of people who have car insurance with a high accident rate (YEOH,
2001).
· World Wide Web : grouping of documents according to semantic similarities, in order to improve
the results offered by search sites (HAMMOUDA, 2002).
· Earthquake studies : Analysis of real and synthetic data of earthquakes for the extraction of
characteristics that allow prediction of seismic event precursors (DZWINNEL, 2005).

2. Group Heuristics

In this section, we will present some of the algorithms most used in the clustering area, with each algorithm
being described fully, with a dissertation about its main strengths and weaknesses.

2.1. K-Means

23

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

The K-means is a non - hierarchical grouping heuristic search to minimize the distance from the elements to
a set of k centers given by C = {x 1 , x 2 , ..., x k } iteratively. The distance between a point p i and a set of
clusters, given by d (p i , c ), is defined as the distance from the point to the center closest to it. The function
to be minimized then is given by:

1n
d(P,c)= d ( p i , c) 2 n i = 1
The algorithm depends on a parameter (k = number of clusters) defined ad hoc by the user. This is usually a
problem, since it is not usually known how many clusters exist a priori .
The K-Means algorithm can be described as follows:

1. Choose k different values for centers of groups (possibly, randomly)


2. Associate each point with the nearest center
3. Recalculate the center of each group
4. Repeat steps 2-3 until no element changes group.

This algorithm is extremely fast, usually converging in a few iterations to a stable configuration in
which no element is designated for a cluster whose center is not closest to it. An example of the execution of
the K-Means algorithm can be seen in figure 3.

Figure 3. Example of execution of the K-Mean algorithm . (a) Each element was assigned
to one of three randomly selected groups and centroids (largest circles) of each group
were calculated. (b) The elements were now assigned to the groups whose centroids are
closest to it. (c) The centroids were recalculated. The groups are already in their final
form. If they were not, we would repeat the steps
(b) and (c) until they were.

This iterative algorithm tends to converge to a minimum of the energy function defined above. One
possible problem is that this condition emphasizes the issue of homogeneity and ignores the important issue

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
of good separation of clusters. This can use a bad separation of the sets in case of a bad initialization of the
centers, this initialization is done arbitrarily (random) at the beginning of execution. In Figure 4 we can see
the effect of an initialization on the execution of a K-Means algorithm.

Another point that can affect the quality of the results is the choice of the number of sets made by
the user. Too few sets can cause two natural clusters to merge, while too large a number can cause one natural
cluster to be artificially broken in two.

24

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

Figure 4. Example of the effect of a bad initialization of the K-Means algorithm . There
are three natural clusters in this set, two of which were assigned to the solid color
group. The problem is that the third cluster is well separated from the other two
clusters, and if two clusters are initialized from that side of the figure, they will be on
that side.

The great advantage of this heuristic is its simplicity. An experienced programmer can implement his
own version in about an hour's work. In addition, several demos are available on several websites, among
which we can highlight the one located at the address given
by http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html (last accessed November 2009)
.

2.2. Hierarchical Algorithms

Hierarchical algorithms create a hierarchy of relationships between elements. They are very
popular in the field of bioinformatics and work very well, although they have no theoretical justification
based on statistics or the information theory, constituting an ad-hoc technique of high effectiveness.

There are two versions: the agglomerative, which operates by creating sets from isolated elements,
and the divisive one, which begins with a large set and breaks it into parts until it reaches isolated elements.
The agglomerative version works according to the following algorithm:

1. Make a cluster for each element


2. Find the most similar pairs of clusters, according to a chosen distance measure
3. Blow them into a larger cluster and recalculate the distance of this cluster to all other elements
4. Repeat steps 2 and 3 until a single cluster remains.

We can make a simple example to visualize the operation of the hierarchical algorithms. To do so,
let us take the set of points given by S = {2; 4; 6.3; 9; 11.6} and clustering these elements.
First we calculate the distance between points, d directly using, for example the Euclidean
distance:

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
25

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

0
2 0
D = 4.3 2,3 0 '

7 5 2.7 0

9.6 7.6 5.3 2.6 0


As we can see, the closest elements are the first (2) and the second (4). We then perform the
grouping shown in figure 5 (b). The distances can be recalculated by replacing the two elements grouped by
their mean (3) and are as follows:
0
D = 3.3 0
6
2.7 0
8.6 5.3 2.6 0
At the moment, the two closest ones are the third (9) and fourth (11.6) and we perform the grouping
shown in figure 5 (c). The new distances are recalculated, obtaining then the following are values:
0
D = 3.3 0

7.3 4 0
We perform the next grouping between the first (3) and the second (6.3) elements, obtaining the sets
shown in figure 5 (d). Finally, it remains to group the last two groups, obtaining the final grouping, shown in
figure 5 (e)

Figure 5. Step-by-step representation of the hierarchical algorithm usage example.

26

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
It should be noted that the hierarchical structure formed by the union between the elements was
represented by a dendogram, which is the most usual representation of the results of hierarchical algorithms
and intuitively shows the order of the grouping. The higher the line connecting two clusters , the later its
grouping was done. Therefore, the height of the line connecting two clusters is proportional to its distance.

There are three different ways of measuring the distance between two clusters , which are
described below and exemplified in figure 6.
· Single Link : The distance between two clusters is given by the distance between its closest points,
also called "neighbor clustering". It is a greedy method, which prioritizes elements closer and
leaving the more distant in the background.
· Average Link : Distance is given by the distance between your centroids. The recalculation problem
centroids, and distances between clusters each time a cluster changes.
· Complete Link : The distance between clusters is the distance between the most distant points.

Figure 6. Distance measures between two clusters for use in hierarchical algorithms.
(a) single link, where the distance between clusters is given by the distance between its
closest points. (b) average link, where the distance between the clusters is given by the
distance between their centroids and (c) complete link, where the distance between the
clusters is given by the distance between their most distant points.

These three methods of measuring distance are not fully equivalent. Different choices among these
ways of measuring distances can generate different results in the clustering, as can be seen in figure 7.

Figure 7. Example of hierarchical algorithm application with two different ways of


measuring distance. Given the distances between city s of the United States, the
hierarchy was constructed using the measure of distance single- link (to the left)
and complete-link (to the right). Note that the generated sets are distinct. In
the single-link,Miami is the last city to be added, while in the complete link it is quickly
added

27

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

to the left set (Example taken from http://www.analytictech.com/networks/hiclus.htm ,


last accessed November 2009).

The problem with this algorithm is that with it one obtains a hierarchy of relationships, but not
specific groups. To obtain k groups, simply cut the k-1 higher edges of the dendogram. An example can be
seen in figure 8.

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
Figure 8. Example of cutting a dendogram obtained through the hierarchical algorithm to
obtain clusters. In Figure A 1 cut was made and two groups were obtained. In figure (b),
three cuts were made and 4 groups were obtained ({A, B}, {E}, {C, D, F, G}, {H}).

There is a divisive version of the hierarchical clustering algorithm. This version starts with a single
cluster containing all the elements and begins to divide it. In each iteration, use is made of an
algorithm flat(such as the K-Means, for example), to separate the current set
of clusters smaller, repeating the process recursively until there is only sets made up of a single element or
until a predetermined stopping criterion is met.

Note that it is customary to divide the selected cluster into two smaller clusters because this makes
this technique appear to be the inverse of agglomerative algorithms. However, the number of clusters
generated does not necessarily have to be equal to doi s. One way to choose the number of clusters to
generate is to use a cut cost optimization metric. This uses two basic costs, called
similarities intra-clustersimilarity and extra-cluster and seeks to maximize the relationship between the two
in order to obtain partitioning in which the generated clusters have greater cohesion. Mathematically we
have the following:

Data

· V = { v 1 , v 2 , ..., v n } , the set of all existing objects


k
· { C 1 , C 2 , ..., C k } | V = U C i , a partitioning of the objects, then we have:
i=1

· Intra-cluster similarity given by int ra p = yes ( v i , v j )


vi,vjÎCp

· extra-cluster similarity given by extra p = yes ( v i , v j )


viÎCp,vjÏCp
int ra p
Then, the cost of partitioning with k sets is given by cost k = , choosing
p = 1.2, ... k extra p
then the partitioning for which the cost value is as small as possible.
This technique is extremely computationally costly because it requires a large number of partitions
to be made at each iteration, which can take a long time if the number of elements is large. The cost may be
limited by setting a ceiling to the value of, but still this algorithm may have a run time too long to be
acceptable.

28

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

The operation of the divisive version is more complex than that of the agglomerative algorithm,
since it depends on a flat algorithm such as the K-Means for its operation. However, it has two main
advantages:
1. It is not necessary to recalculate all distances between clusters at each step and we can interrupt the
procedure before we reach the last level of the tree. Thus, the algorithm is potentially faster than its
agglomerative version.
2. We begin by using information about the whole, rather than making decisions based on local
information, as in the agglomerative case. Thus, the clusters obtained tend to be
more faithful about the real distribution of information in the elements.
An interesting fact is that this is one of the possible ways to determine how many sets we will
define for the K-Means technique . Instead of defining a priori, we perform the following algorithm:

Define a set with all the elements and insert it into a row of sets to treat; As long as there
are sets in the queue do
Remove the first set from the queue;
Calculate the loss of information related to this set;
If the value of the loss of information is greater than a pre-set limit Perform
the K-Means with k = 2;
Place the two sets obtained in the row of sets to be treated; End
if
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
End While

It may seem that we substitute only one arbitrariness for another (after all, at first glance we only
change the k by the information loss limit). However, we notice that, instead of arbitrarily defining the
number of sets we want, we define a quality metric for each set that must be defined according to the
problem domain we are trying to solve. That is, our problem imposes an inherent quality on the sets and this
is what we are going to look for.

Another way of performing a divisive hierarchical algorithm is from the minimal spanning
tree . This can be defined as the tree that connects all the elements of the set such that the sum of the distances
of all existing edges is minimal. Once this tree is provided, for which determination there is an algorithm that
is practically linear with respect to the number of elements, it can be divided into two sets using one of the
following ways:

· Removing the greatest of all edges;


· Removing the edge of length l such that l >> l ' , where l ' represents the average length of the edges
of the group. This method has the advantage of allowing denser clusters also have
their outliersremoved, which could take longer, since being dense all their edges must have a
relatively small length.
Using any of the techniques we will obtain two clusters for the cluster being divided. We then
repeat the process until the desired number of clusters is reached.
In the past, hierarchical algorithms were extremely popular, but with the advancement of flat
techniques, they gained acceptance. Therefore, the decision to use hierarchical algorithms is complex
to take,we must take into account a number of important factors.

The first factor is memory consumption. To use a hierarchical algorithm we must


store O ( n 2 )distances. Even with the price of the memories falling extremely fast,
8
if we have 10,000 elements, we will have to store a number of distances of the order of 10 . This value is still
within the order of magnitude of a personal computer, but is already a respectable value and must be managed
with care (and efficiency).
Non-hierarchical algorithms usually depend on a series of factors that are arbitrarily determined by
the researcher, such as number of sets and the seeds of each

29

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

set. This may have a negative impact on the quality of the generated sets. Agglomerative hierarchical
algorithms are not subject to these factors, being totally deterministic and independent of parameterization.
The main advantage of the hierarchical algorithms is that they offer not only the clusters obtained,
but also the entire structure of the data and allow to easily obtain sub-sets within these data. In addition, they
allow you to visualize directly through the dendogram, the way the data binds and the true similarity between
two different points, whether they belong to the same or not.

2.3 Maps of Self-Organizing Features

A system is a group of parts that interact and that operate as a whole, possessing collective
properties distinct from each of the parts. Systems have an organization to obtain a specific function. The
organization can be of two types:
· External, when it is imposed by external factors such as machines, walls, etc.
· Self-organization: when the system evolves into an organized form on its own. Examples:
crystallization, brain, savings, etc.
The brain is clearly an organized system that holds on spontaneously, without a teacher. In the old
days, the dominant scientific view was that there were homunculi within the brain, guiding
learning. Nowadays, it is believed more in the rule of learning of Hebb. This postulates that when a neuron A
repeatedly and persistently helps to deliver neuron B, the efficacy of the association between the two cells
increases (HEBB, 1949).

These changes occur in the brain in three possible ways: by increasing the number of transmitters
emitted by cell A, by increasing the strength of the synaptic connection or by forming new synapses. This

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
allows us to conclude that the brain is a self-organized system , and can learn by modifying the
interconnections between neurons.

The main result of brain self-organization is the mapping of linear or planar topology features in
the brain. Two real examples are the tonotopic map (sonorous frequencies are spatially mapped in the cortex
in a progression from the lowest to the highest frequencies) and the retinotopic map (visual field is mapped
in the ex-visual cort with higher resolution to the center).
There is a type of neural network that implements the concepts of cerebral self-organization . These
networks are known as Self-Organizing Networks = Self-Organizing Feature Maps (SOFMs). Other
common names for them are competitive filter associative memories or Kohonen Networks, in honor of their
creator, Dr. Eng. Teuvo Kohonen.

A Kohonen network consists of two layers (one input and one competitive), as we can see in figure
9. Each input layer neuron represents a dimension of the input pattern and distributes its vector component
to the competitive layer.

30

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

Figure 9. Example of the topology of a Kohonen network. The neurons of the input layer
are connected to all neurons of the (competitive) output layer through weights. However,
the neurons of the output layer are connected to some neurons of the same layer, and
these constitute their neighborhood (BRAGA, 2000).

Each neuron in the competitive layer receives the weighted sum of the inputs and has a
neighborhood of k neurons, which vicinity can be arranged in 1, 2 or n dimensions. Upon receiving an input,
some neurons will be excited enough to fire. Each neuron that triggers can have an excitatory or inhibitory
effect on its vicinity.

After initialization of the synaptic weights, three basic processes follow


· Competition: The highest value of the function is selected (winner). The winning neuron is given
by: i (x) = arg min || xw j ||, j = 1,2,3, .., l. The winner determines the location of the neighborhood
center of the neurons to be trained (excited).
· Cooperation: Neighbors of the winning neuron are taught and excited through a neighborhood
function.
· Synaptic Adaptation: Neurons excited adjust their synaptic weights when a neuron

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
wins a competition. Not only it, but also all of the nodes located in its vicinity are adjusted,
according to a function shown in figure 10.

Figure 10. Outcome of the neighborhood elements of the winning neuron, for the effect
of adjusting the synaptic weights. Note that the closest elements are set

31

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

positively while the furthest are inbound. The very distant elements receive a small
positive adjustment in order to avoid that the outliers neurons are never adjusted
(BRAGA, 2000).

The algorithm can then be given by the following steps:


Start with small random weights
Repeat until stabilized or until the number of iterations
exceeds t Normalize the weights
Apply an I u input to the network
Find the winning neuron, i *
The adjustment of the weights of neuron i * and its neighboring nexus is given by:

D wj= h O ( I uj- wj) ,


O is the neuron output calculated according to the following
criterion: o the output of i * is 1,
the one of its neighborhood is given by the Mexican hat
function of figure 2.10),
j is the index of the coordinate, j = 1,2, ..., p.

Kohonen's networks are excellent for classification problems. Their problem is that the system is a
black box with no guarantee of convergence for networks of larger dimensions, and training can be
extremely long for large classification problems. On the other hand, they have the advantage of not only
discovering the sets but also show an ordering among them, given the notion of neural neighborhood
(OJA et al , 2002), which can be useful in many applications.

2.4 Methods based on graphs

Graph is a mathematical model that represents relations between objects. A graph is a group given by G = (V,
E) where V is a finite set of points, usually called nodes or vertices and E is a relationship between vertices,
ie a set of pairs of V x V . In Figure 2.11 we can see some examples of graphs. In example (a), we have a
graph with 5 nodes, and the set V = {1,2,3,4,5}. These nodes are connected to each other by some edges,
defining the set E = {(1,1), (1,2), (3,4), (3,5), (4,5)}. a
graph can be represented by a matrix A of adjacencies, where A = 0 if there is no edge (i, j) and 1,
ij
if the edge in question exists, as can be seen in figure 11.
The number of edges that exit or enter a certain vertex is called the degree or the connectivity of this
vertex. For example, in Figure 11 (a), vertex 4 has connectivity 2, whereas in figure (b), vertex 4 has
connectivity equals 3.

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
The vertices of a graph can receive values or labels, in which case the graph is called a label. Labels
can represent the strength of the connection between two elements. The examples of Figure 12 (a) and (b) are
unlabeled, while the example of Figure 11 (c) is a labeled graph, in which, for example, the edge between
nodes 3 and 4 has a label equal to 8 .

The edges of a graph can have a direction, and in this case a graph is called a graph. Otherwise, the
graph is called non-directional. To determine the direction, represent up the edges with arrows at the end
destination. In the case of an unguided graph , the edges (and 1 , and 2 ) and (and 2 , and 1 ) are identical,
which does not happen in a directed graph. In figure 11 (b) we see a directed graph and one can see how it
was necessary to add an arest (4.1) beyond the edge (1,4). If in addition to direccionad the graph is labeled, it
is called a network.

32

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

Figure 11. Examples of graphs. (a) A non-directed, unconnected, unlabeled, and clicked
graph. (b) a directed, unlabeled and unconnected graph. This graph is not a network
because its edges are not labeled. (c) an unconnected, connected and labeled graph.

A graph is said to be connected if there is a path between any two nodes of the graph, where one
path is a sequence of edges {(e, e), (e, e), ..., (e, e)}, where e is the origin and e is the destination. O
1 2 2 3 k-1 k 1 k
example of Figure 11 (c) is connected, but examples (a) and (b) are not. In the first case, it is not possible to
find a path between nodes 1 and 4, while in the second there is no path coming from node 2 and arriving at
node 1 (the edge exists and is directed from 1 to 2 , and not vice versa).

A graph is said to be complete if all of the vertex pairs are connected by an edge. A click graph is a
graph in which each connected component is a complete graph. The example of figure (a) is a clique graph,
since the graph can be separated into two subgraphs G 1 and G 2 , as can be seen by separating the matrix
into sub-matrices (G 1 is represented by the matrix in the corner upper left, while L 2 is represented by the
matrix in the lower right corner. the other two matrices have all elements equal to zero, indicating that
the sub-graphsare not set). The first contains vertices 1 and 2 while the second contains vertices 3,4,5. Note
that each two vertices in each of the graphs are linked by an edge to all their graph mates.

One way to group elements consists of defining a graph based on your data and assuming that the
graph obtained represents a corrupted click. In this way it is enough to look for the minimum number of
edges to be added or removed in order to obtain a click graph. The complete subgraphs obtained will then
represent the most suitable clusters for the data. An example of such a technique can be seen in figure 12.
Unfortunately this problem, called the problem of corrupted clicks, is NP-difficult, implying that
there is no practical algorithm to solve it. Fortunately there are heuristics to help us. Let us discuss one of
them, but first we will understand how we can create the graph in which the heuristic will be applied.

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
Figure 12. Examples of obtaining minimized clicks from changes in a graph. (a) The
original graph. The edges marked by X will be incremented. (b) The obtained click
graph. The dotted edges were inserted. The minimum number of changes to get this
click graph is 4 (two inserts and one deletions).

33

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

If the distances between two elements were calculated using any of the metrics
defined in section 1.2, we can assemble an s-matrix D in which each element D represents the ij
distance between the elements ie j. One can then arbitrarily scolder a limit q such that there will be an edge
between the nodes i and j if the distance d ij is less than q (greater, in case we use the correlation). An
example of the application of this technique can be seen in figure 13.

Figure 13. Example of obtaining a graph from the calculated distances. A threshold q =
1,1 was applied to the obtained distance matrix . Items marked in bold are below this
distance. A graph representation matrix is then generated with entries equal to one at the
points marked in bold. The graph represented by this matrix can be seen in its graphical
form next.

An extremely practical algorithm to solve the problem of corrupted clicks is the CAST (Cluster
Affinity Search Technique), initially defined in (BEN-DOR et al , 1999). The algorithm iteratively constructs
the partition P of a set S by finding a cluster C such that no element i Î C is distant from C and no element
i Ï C is close to C, given a graph formed from a distance bound q . The algorithm is as follows:

S = set of vertices of the graph G


P=Æ
While S ¹
C = {v}, where v = maximum degree vertex in graph G
Add to C all genes attached to a C element
P=P+C
S=S/C
Remove vertices from cluster C of graph G
End While

There are several other algorithms based on graphs, which can be considered in place of
CAST. Among them we can mention the PCC, set in the same article that the CAST (BEN-DOR et al , 1999),
which is based on a definition of the inputs of a stochastic model error and the CLICK (SHARAN et al ,
2000) which is also based on the definition of a statistical model that assigns a probabilistic meaning to the
edges of the graph. Both models have also been widely used and are much more complex than CAST, but
should be considered by all those who wish to expand their knowledge of graphing methods.

3. Conclusion

In this tutorial we have seen the main concepts of grouping, several heuristics applied to this
problem and that can be applied to several areas of knowledge.
When no preexisting information exists, there is no "optimal" classification for a data set. At this
point, clustering methods serve as a tooling

34

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

for exploitation of data which, if used equitably, can partition the data without the effect of any prejudices
and arbitrary notions. Given the fact that this problem is NP-difficult, it is necessary to apply a heuristic to
solve the problem.
The large number of existing grouping techniques can be confusing for the beginner. Sometimes,
instead of choosing, the ideal may be to use all at once. There are works, such as (HU et al , 2004) that
simultaneously use several techniques and combine their results to make a final decision.
A point that should always be emphasized when reading with data from a field of knowledge should
not rule out the knowledge of this area. All data and results should be analyzed in the light of existing
knowledge: no algorithm should be used in any area in isolation, without thinking about the ramifications of
the conclusions that its use can impose.

All of these issues are important, not only for grouping techniques but also for other algorithms
used in the desired application area. Often, grouping techniques will serve as input to other algorithms or to
preparation for experiments.

4. References

BEN-DOR, A., SHAMIR, R., YAKHINI, Z. (1999), " Clustering Gene Expression Patterns ", Journal of
Computational Biology, v. 6, n. 3, pp. 281-297.
BRAGA, A. P, CARVALHO, ACPLF, LUDERMIR, TB, 2000, "Artificial Neural Networks - Theory and
Applications", 1 a . Edition, LTC Editora, Brazil.
CHIANG IW-Y .; LIANG GS .; YAHALOM SZ (2003), "The fuzzy clustering method: Applications in the
air transport market in Taiwan", The Journal of Database Marketing & Customer Strategy Management, v.
11, n. 149-158
DZWINEL, W., YUEN, DA, BORYCZKO, K., et al. (2005), "Nonlinear multidimensional scaling and
visualization of earthquake clusters over space, time and feature space", Nonlinear Processes in Geophysics
n. 12 pp. 117-128
GORDON, A. D, 1981, Classification, Chapman and Hall Ed., 1981
HAIR, JF JR., ANDERSON, RE et al. , Multivariate Data Analysis, Prentice Hall, 1995
HAMMOUDA, KM (2002), "Web Mining: Identifying Do cument Structure for Web Document Clustering",
Master's thesis, Department of System s Design Engineering, University of Waterloo, Canada
HAYKIN, S., Neural Networks: Principles and Practices, Ed. Bookman, 2001
HEBB, DO (1949). "The organization of behavior: A neuropsychological theory" New York: Wiley.
HU, X., ILLHOY, Y. (2004), " Cluster ensemble and its applications in gene expression analysis ", ACM
International Conference Proceeding Series; Vol. 55 Proceedings of the second conference on Asia- Pacific
bioinformatics - v. 29, pp. 297-302, New Zealand.
JACKSON, CR, DUGAS, SL (2003), "Phylogenetic analysis of bacterial and archaeal arsC gene sequences
suggests an ancient, common origin for arsenate reductase", BMC Evolutionary Biology 2003, v. 3 n. 18
JIANG, D., ZHANG, A. (2002) "Cluster Analysis for G ene Expression Data: A Survey", University of
New York.
JONES, NC, PEZNER, P., "An Introduction to Bioinformatics Algorithms", MIT Press, USA, 2004
KIM, J., WARNOW, T. (1999), "Tutorial on Phylogenetic Tree Estimation", Yale University.

35

R. Linden / Information Systems Journal of FS MA n. 4 (2009) pp. 18-36

LEVIA JR DF, PAGE DR (2000), "The Use of Clus ter Analysis in Distinguishing Farmland Prone to
Residential Development: A Case Study of Sterling, Massachusetts," Environ Manage. V. 25 n. 5,
pp. 541-548
LINDEN, R., 2005, "A Hybrid Algorithm for Extracting Knowledge in Bioinformatics", Doctoral
Thesis, COPPE-UFRJ, Rio de Janeiro, Brazil

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2
17/04/2018 Your Document is Ready
MARTÍNEZ-DIAZ, RA, ESCARIO, JE, NOGAL -RUIZ, JJ, et al (2001), "Relationship between Biological
Behavior and Randomly Amplified Polymorphic DNA Profiles of Trypanosoma cruzi Strains", Mem
Inst. Oswaldo Cruz vol.96 no.2, pp. 251-256, Brazil
MATIOLI, SR (ed.), Molecular Biology and Evolution, Holos Editora, 2001
OJA, M., NIKKILA, J., TORONEN, P., et al., 2002, "Exploratory clustering of gene expression profiles of
mutated yeast strains" in Computational and Statistical Approaches to Genomics, Zhang, W. and
Shmulevich, I. ( eds., pp. 65-78, Kluwer Press
SHARAN, R., SHAMIR R 2000. "CLICK: A clustering algorithm with applications to gene expression
analysis", Proc. Int. Conf. Intell. Syst. Mol Biol. 8: 307-316.
YEOH, E., ROSS, ME, SHURTLEFF, SA et al. , 2002, "Classification, subtype discovery, and prediction of
outcome in pediatric acute lymphoblastic leukemia by gene expression profiling", Cancer Cell, v. 1, n. 1,
pp. 133-143.

36

http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/b39ba2b5da15459c8d46348e0b982147/FSMA_SI_2009_2_Tutorial.zip&app=pdf2

You might also like