You are on page 1of 12

Mathematical and Computer Modelling 44 (2006) 439–450

www.elsevier.com/locate/mcm

Optimising the mutual information of ecological data clusters using


evolutionary algorithms
H.R. Maier a,∗ , A.C. Zecchin a , L. Radbone a , P. Goonan b
a Centre for Applied Modelling in Water Engineering, School of Civil and Environmental Engineering, The University of Adelaide, Adelaide,
SA 5005, Australia
b Environment Protection Authority, GPO Box 2607, Adelaide, SA 5001, Australia

Received 6 December 2004; accepted 11 February 2005

Abstract

The Australian River Assessment System (AusRivAS) is a nation-wide programme designed to assess the health of Australian
rivers and streams. In order to produce river health assessments, the AusRivAS method uses the outcomes of cluster analysis
applied to macroinvertebrate data from a number of different locations. At present, the clustering step is conducted using the
statistical Unweighted Pair Group Arithmetic Averaging (UPGMA) method. A potential shortcoming of this approach is that it
uses a linear performance measure for grouping similar data points. A recently developed approach for clustering ecological data
(MIR-max) overcomes this limitation by using mutual information as the performance measure. However, MIR-max uses a hill-
climbing approach for optimising mutual information, which could become trapped in local optima of the search space. In this
paper, the potential of using evolutionary algorithms (EAs), such as genetic algorithms and ant colony optimisation algorithms, for
maximising the mutual information of ecological data clusters is investigated. The MIR-max and EA-based approaches are applied
to the South Australian combined season riffle AusRivAS data, and the results obtained are compared with those obtained using
the UPGMA method. The results indicate that the overall mutual information values of the clusters obtained using MIR-max and
the EA-based approaches are significantly higher than those obtained using the UPGMA method, and that the use of genetic and
ant colony optimisation algorithms is successful in determining clusters with higher overall mutual information values compared
with those obtained using MIR-max for the case study considered.
c 2005 Elsevier Ltd. All rights reserved.

Keywords: AusRivAS; River health assessment; Clustering; Genetic algorithm; Ant colony optimisation; Mutual information

1. Introduction

1.1. Background

The complexity of river ecology makes any assessment of river health difficult. Traditional assessment measures,
including physical and chemical parameters, may fail to recognize significant changes in river health, such as the
effect of acute short term pollution events. In order to overcome these limitations, biological indicators are being used

∗ Corresponding author. Tel.: +61 8 8303 4139; fax: +61 8 8303 4359.
E-mail address: hmaier@civeng.adelaide.edu.au (H.R. Maier).

c 2005 Elsevier Ltd. All rights reserved.


0895-7177/$ - see front matter
doi:10.1016/j.mcm.2006.01.004
440 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

increasingly. Such indicators include macroinvertebrates, fish, algae, diatoms, micro-organisms and macrophytes.
Macroinvertebrate data, in particular, have been used extensively for the assessment of river health, as they are present
in almost all rivers, and different species have various sensitivities to environmental stress. In addition, they only travel
within a limited range and have a lifespan that is adequate for detecting disturbances, while sufficiently brief to detect
recolonisation after a disturbance [1].
The River Invertebrate Prediction and Classification Scheme (RIVPACS), which is used to assess river health
in Britain, was the first regional-scale model that incorporated the use of macroinvertebrate data as an alternative
to physical and chemical data. Since the development of RIVPACS, similar methods of assessing river health have
been set up in other countries, including the Australian River Assessment System (AusRivAS) [2] and a similar
system in California [3]. Recently, O’Connor and Walley [4] developed a River Pollution Diagnostic System (RPDS)
for the British Environment Agency using a novel information-theoretic clustering system called MIR-max (Mutual
Information and Regression maximisation) [5].

1.2. Australian River Assessment System

The Australian River Assessment System (AusRivAS) was established in 1992 as a nation-wide programme
designed to assess the health of Australian rivers and streams. The general AusRivAS method involves the
establishment of a database of reference sites, which are sites that are considered to be minimally affected by
anthropogenic impacts. These sites are then grouped into clusters of similar macroinvertebrate communities. The
clusters are analysed to find relationships between the physical, geographical and chemical properties of sites in a
cluster and the corresponding macroinvertebrate communities. The relationships found are then used to predict the
macroinvertebrate communities at non-reference sites that would be expected if these sites were equivalent to least
disturbed reference conditions. To determine the level of river health, the expected macroinvertebrate community is
compared with the observed community.

1.3. Clustering of reference sites

As part of AusRivAS, the clustering of references sites is performed using the statistical Unweighted Pair
Group Arithmetic Averaging (UPGMA) method [6], which is an agglomerative hierarchical technique [7]. Sites are
agglomerated in a stepwise fashion to produce a hierarchical order, which is presented in the form of a dendrogram [7].
As part of the agglomeration process, Euclidean distance is used as the performance measure to assess the similarity
of sites based on the macroinvertebrate communities present.
A limitation of agglomerative methods, such as UPGMA, is that if clusters are joined sub-optimally, they cannot be
separated. Thus, errors created in previous steps of the clustering process cannot be overcome [7]. Another potential
shortcoming of the UPGMA method is that it uses a linear performance measure (i.e. Euclidean distance) for grouping
similar data points, which can fail to capture the non-linear relationships that are a feature of ecological systems.
The MIR-max system introduced by Walley and O’Connor [5] overcomes the limitations of the UPGMA approach
outlined above. MIR-max clusters the data by maximising the mutual information (MI) between the clusters and the
attributes of the data, and then arranges the clusters in two dimensions in a way that aims to preserve their relative
positions in n-dimensional data space. The mutual information criterion [8] is used as the performance measure, as
it caters for both linear and non-linear dependence between variables. In addition, the approach is not agglomerative,
enabling data points to move between clusters during the clustering process. In this paper, only the clustering aspect
of MIR-max is addressed.
As part of the MIR-max approach, mutual information is maximised using a hill-climbing method. This involves
selecting two sampling sites from different clusters and swapping them. If the MI score is increased as a result of
the swap, the change takes place; if not, the sites return to their original clusters. This process is continued until no
improvement is made for a user-defined number of iterations [5]. In this paper, an alternative approach for maximising
mutual information is introduced, which uses evolutionary algorithms (EAs) as the optimisation engine. EAs are
robust, stochastic search algorithms that work with populations of solutions and use mechanisms based on examples
from nature to evolve better solutions. Consequently, they are more likely to find globally optimum solutions when
compared with local search methods, such as that used in MIR-max. Another advantage EAs have over traditional
optimisation techniques is that they do not require the use of the gradient of a fitness function, only the value of the
H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450 441

fitness function itself. In this paper, formulations for clustering ecological data so as to maximise mutual information
are developed for two types of EAs, including genetic algorithms (GAs) [9] and ant colony optimisation algorithms
(ACOAs) [10].

1.4. Research objectives

The objectives of this research are:


1. To develop alternative approaches for clustering ecological data based on mutual information by replacing the hill-
climbing approach for optimising mutual information currently used in MIR-max with two types of evolutionary
algorithms, including genetic algorithms and ant colony optimisation algorithms.
2. To compare the MI between clusters and the attributes of the data points to be clustered obtained using the
UPGMA, MIR-max (use of hill-climbing to optimise MI), GAMI-max (Genetic Algorithm Mutual Information
maximisation) and ACMI-max (Ant Colony Mutual Information maximisation) clustering approaches for the South
Australian combined season riffle AusRivAS data.

2. Proposed approach

The proposed approach for clustering ecological data involves the use of mutual information as the performance
measure for determining the similarity between clusters and the attributes of the data points to be clustered, and to
maximise the performance measure (i.e. mutual information) using genetic and ant colony optimisation algorithms.
Details of the proposed approach are given below.

2.1. Performance measure

The mutual information between two continuous random variables X and Y is given by [11]:
f X,Y (x, y)
Z Z  
M(X, Y ) = f X,Y (x, y) loge dxdy (1)
f X (x) f Y (y)
where f X (x) and f Y (y) are the marginal probability density functions of continuous random variables X and Y ,
respectively, and f X,Y (x, y) is the joint probability density function of continuous random variables X and Y. In the
case of discrete random variables X and Y , the mutual information is given by:
p X,Y (x, y)
XX  
M(X, Y ) = p X,Y (x, y) loge (2)
x y p X (x) pY (y)

where p X (x) and pY (y) are the marginal probability functions of discrete random variables X and Y , respectively, and
p X,Y (x, y) is the joint probability function of discrete random variables X and Y . Mutual information measures the
reduction in uncertainty of Y as a result of knowledge about X . If there is no dependence between X and Y , then the
two random variables are statistically independent and, by definition, the joint probability density f X,Y (x, y) would
equal the product of the marginal densities ( f X (x) f Y (y)), or in the discrete case, p X,Y (x, y) = p X (x) pY (y). In both
instances, the MI score would be zero. If, on the other hand, the random variables were strongly related, the value of
MI would be relatively high.
Intuitively, the clustering problem involves the grouping of T samples into n clusters, such that the “similarity” of
the samples within each cluster is maximized, where the “similarity” is based on the state that each of the m attributes
is in for each sample. Consider the set of samples S = {ωk : k = 1, . . . , T }, where each sample ωk has m attributes
and each attribute is in one of s states (i.e. (ωk ) X j ∈ {1, . . . , s}, k = 1, . . . , T , j = 1, . . . , m, where (ωk ) X j is the
state of the jth attribute of sample ωk ), and consider also a given cluster allocation C = {Ci : i = 1, . . . , n}, where
set Ci is the set containing the samples in cluster i (i.e. Ci = {ω : (ω)C = i, ω ∈ S} where (ω)C is the cluster to
which sample ω is allocated. The objective of the clustering problem is to maximize overall “similarity”. Within the
MI framework, the overall “similarity” of the cluster layout C can be measured by the uncertainty about the state of
the attributes of a particular sample, given that it is known which cluster the sample came from. For example, if there
is a high degree of “similarity” between samples in a given cluster, for a randomly selected sample from that cluster
442 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

it would be expected that, given some knowledge about the distribution of attribute states of the remaining samples
within the cluster, one could be reasonably certain about the value of the states of the attributes for the selected sample.
Conversely, if there is a low degree of “similarity” between the samples in a given cluster, for a randomly selected
sample from that cluster it would be expected that, given some knowledge about the distribution of attribute states for
the remaining samples within the cluster, there is high uncertainty associated with predicting the values of the states
of the attributes, as the samples within the cluster have a high variation in attribute states.
To formalize the above concept, consider two random variables. The first random variable, C, describes the
cluster allocation of a sample selected at random from a given cluster layout C (i.e. Pr{C = c} is the probability
that any randomly selected sample is from cluster c for a given cluster layout C, where the state space of C is
{1, . . . , n}). The second random variable, X j , describes the state of attribute j of a sample selected at random from S
(i.e. Pr{X j = x} is the probability that any randomly selected sample has attribute j in state x, where the state space
of X j is {1, . . . , s}). The marginal probability distributions pC (c) and p X j (x) of C and X j , respectively, and the joint
distribution pC X j (c, x) for a given C are (adapted from [5]):
( |C |
c
for c = 1, . . . , n
pC (c) = Pr{C = c} = T (3)
0 otherwise
 |{ω : (ω) X j = x, ω ∈ S}|

p X j (x) = Pr(X j = x) = for x = 1, . . . , s (4)


0 T
otherwise
 |{ω : (ω) X j = x, ω ∈ Cc , }|

pC X j (c, x) = Pr(C = c, X j = x) = for c = 1, . . . , n, x = 1, . . . , n (5)


0 T
otherwise
where |A| is the cardinality of set A, {ω : (ω) X j = x, ω ∈ S} is the set of samples in S with attribute j in state x, and
{ω : (ω) X j = x, ω ∈ Cc } is the set of samples in cluster c with attribute j in state x. The mutual information of C, for
a given C, and X j is (adapted from [5]):

pC X j (c, x)
" #
X n X s
M(C, X j ) = pC X j (c, x) loge . (6)
c=1 x=1
pC (c) p X j (x)

The overall MI of C with respect to the set of m attribute random variables X j , j = 1, . . . , m is given as [5]:
m
X
G(C) = M(C, X j ) (7)
j=1

where G(C) is the overall MI given the cluster layout C.


In the case of AusRivAS, a reference site is considered as a sample, and (6) measures the dependence between
the cluster layout C and the jth macroinvertebrate community (i.e. the attributes X j of each sample). The state of
a macroinvertebrate community either refers to presence or absence (i.e. s = 2), or one of a number of discrete
abundance levels (i.e. s = total number of discrete abundance levels). A high MI score signifies a high dependence
between the cluster allocation and the state of the macroinvertebrate community. Consequently, sites that have
macroinvertebrate communities in similar states will cluster together. By summing the mutual information score for
each macroinvertebrate community sampled (as in (7)), the overall mutual information for a cluster layout C can be
determined with respect to the set of attributes.

2.2. Optimisation of performance measure

2.2.1. Genetic algorithm


Genetic algorithms (GAs) are heuristic iterative search techniques that attempt to find the best solution in a
given decision space based on an algorithm that mimics Darwinian evolution and survival of the fittest in a natural
environment [9]. In keeping with genetics terminology, the decision space is referred to as the environment, the
H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450 443

Fig. 1. Representation of a solution as a string of integers (chromosome).

potential solutions to the optimisation problem are called chromosomes (or strings of information that represent a
decision set) and the total number of chromosomes is called the population size. The iterations of the optimisation
process are called generations and the GA proceeds by evaluating the best sets of chromosomes in the population at
each generation. These sets of chromosomes are found by evaluating the objective function for each chromosome in
the population and by using this objective function value to indicate the fitness of the chromosomes. The chromosomes
in a population compete with each other for survival, based on their fitness levels, and more fit individuals are given a
higher probability of mating and reproducing and hence influencing the nature of the chromosomes in the following
generations. Through competition for survival, the population evolves to contain high-performing chromosomes.
In order to apply GAs to maximising the mutual information between clusters and the attributes of the data points
to be clustered, the problem has to be formulated as follows.
1. The decision variables are the cluster allocations of each of the data samples (e.g. reference sites). Consequently,
the number of decision variables is equal to the number of data samples, and the number of values each decision
variable can take is equal to the number of clusters to which the data samples should be allocated.
2. A solution consists of cluster allocations for all data points. Each solution is represented as a string of integers (i.e. a
chromosome), as shown in Fig. 1. The total number of integers is equal to the number of data points (e.g. reference
sites), and the values each integer can take range from 1 to n, where n is the number of clusters the data points can
be allocated to (Fig. 1).
The optimisation process is summarised in Fig. 2. At the start of the process, a number of solutions (population)
are generated at random, and the “fitness” of each solution is calculated in accordance with (7). Next, the “fittest”
chromosomes are selected as potential parents for the next generation. In this research, tournament selection was
used, where two chromosomes from the population are paired off at random, and the “fitter” of the two chromosomes
survives, whereas the other chromosome is eliminated. In order to ensure that the population size stays constant, the
number of tournaments conducted is equal to the population size. In the adopted algorithm, the two chromosomes that
participate in each tournament were chosen from the total population pool at random, with replacement.
Next, members of the parent pool, which consist of the winners of the tournaments, are paired up at random and
have the opportunity to exchange information via a process called crossover. The probability that a pair of strings
will exchange information is referred to as the probability of crossover. In this research, two point crossover was
used, in which two parent chromosomes are “cut” at two identical, random locations, and the integers in the parent
chromosomes (i.e. cluster locations) between the cuts are swapped.
In order to ensure sufficient exploration of the decision space, the values of some of the integers in a chromosome
(i.e. cluster locations) are changed at random in a process called mutation. Whether mutation of a particular integer
occurs is governed by the probability of mutation.
The chromosomes obtained after the application of the processes of selection, crossover and mutation (i.e. the
children) become the parents in the next generation and the process is repeated until certain stopping criteria, such as
the completion of a fixed number of iterations, are met. In this research, elitism was employed, which ensures that the
444 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

Fig. 2. Steps in the genetic algorithm optimisation process.

fittest member of a generation is guaranteed to survive the selection process in the next generation. As a result, there
is no reduction in fitness from one generation to the next.

2.2.2. Ant colony optimisation algorithm


Ant colony optimisation (ACO), developed by Dorigo et al. [10], is an iterative, population based combinatorial
optimisation algorithm based on the analogy of foraging ants. Over a period of time, a colony of ants is able to
determine the shortest path from its nest to a food source. The exhibited ‘swarm intelligence’ of the ant colony is
achieved via an indirect form of communication that involves the ants following and depositing a chemical substance,
called pheromone, on the paths they travel. Over time, shorter paths are reinforced with greater amounts of pheromone,
as they require less time to be traversed, thus becoming the dominant paths for the colony.
Within ant colony optimisation algorithms (ACOAs), an optimisation problem is represented as a graph consisting
of T + 1 nodes, where each node i = 1, . . . , T is connected to its adjacent node via a set of directed edges.1 For
example, θi = {edge(i, j) : j = 1, . . . , n} is the set of directed edges connecting node i to node i + 1. A solution S,
termed a path in ACO terminology, is comprised of the selection of an edge at each node i ≤ T . Therefore, a path
(i.e. solution) can be seen as a vector of the selected edges (cf. GA solution string), that is:
S = {selectioni : selectioni ∈ θi , ∀i = 1, . . . , T }. (8)
Fig. 3 illustrates the translation of the clustering problem to the corresponding ACO graph structure. The clustering
problem can be seen as the assignment of one of n clusters to each of the T samples. Thus, the ACO graph of this
problem is a consecutive chain of T edge sets θi , one for each sample, as in Fig. 3(a), where each θi consists of n edges,
one for each cluster, as in Fig. 3(b). As also seen in Fig. 3(a), an ant generates a solution by consecutively selecting an
edge from each θi , i = 1, . . . , T , where the selection of edge (i, j) from node i corresponds to the selection of cluster
j for sample i.
Edge selection at each node is based upon a probabilistic decision policy. This policy considers a trade-off between
the pheromone intensity on a particular edge and the desirability of that edge with respect to its individual influence on
the objective function. The desirability has different definitions for different problems. For example, if the objective
is to minimise distance, e.g. the travelling salesperson problem, the desirability of an edge maybe set equal to the
inverse of the distance associated with that edge, thus making shorter edges more desirable. By taking the pheromone
intensity and desirability of an edge into account, ACOAs effectively utilise heuristic information that has been learned

1 The definition of the graph in this case differs slightly from that represented in other papers (e.g. [12]) to allow for a more intuitive application
to the clustering problem considered.
H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450 445

(a) Example of graph structure.

(b) Detail of edge set for sample ωi .

Fig. 3. Graph structure and decision option details for the clustering problem. All text in the “{. . . }” brackets in the figure symbolises the ACO
terminology for the indicated element. A solution is generated from the graph in (a) as an ant wanders from the first node to the end node selecting
an edge to traverse between each node. The set of selected edges {ω1 , ω2 , . . . , ωT : ωi ∈ θi , i = 1, 2, . . . , T } is the ant’s solution. A more detailed
view of the decision options θi for sample ωi is seen in (b). The decision options for sample ωi are a selection of one of the n clusters. An ant’s
selection of edge (i, j) (i.e. the jth edge linking node i to node i + 1) is equivalent to the assignment of sample ωi to cluster j in the ant’s solution
generation, that is, si ← edge (i, j).

(represented as pheromone intensity) in addition to incorporating a bias towards edges that are of a greater desirability.
The probability that an ant will select edge (i, j) from node i within iteration t is given as [10]:

[τi, j (t)]α [ηi, j ]β


pi, j (t) = P (9)
[τi,l (t)]α [ηi,l ]β
l∈θi

where pi, j (t) is the probability that edge (i, j) is chosen in iteration t, τi, j (t) is the concentration of pheromone
associated with edge (i, j) in iteration t, ηi, j is the desirability of edge (i, j) and α and β are the parameters controlling
the relative importance of the pheromone intensity and desirability, respectively, for each ants’ decision. For α > β,
pheromone intensity provides the dominant influence in the probability weights, as differences between pheromone
values are emphasised more. Similarly, for α < β, desirability has the dominant influence, resulting in a higher
probability of lower cost edges being selected and the algorithm behaving more like a greedy heuristic.
For the clustering problem, there is no meaningful definition of desirability. The desirability value is only useful
for those optimisation problems for which a decision (i.e. edge in the ACOA graph) possesses some inherent property
such that an isolated cost on the objective, that is independent of the combinatorics of an entire solution, can be
ascribed to that decision. For the clustering problem, the choice of selecting a cluster for a sample has no definable
isolated impact on G in (7), as it is only the combinatorics of an entire solution that enables G to be determined. As a
result, within this research, all desirability levels were set to the arbitrary value of 1.
Similar to real ant colonies, ACOAs aim to incrementally improve the quality of the generated solutions by
gradually adjusting the pheromone intensities so that better path elements have a higher selection probability. The
pheromone updating process consists of a decaying action to mimic the gradual evaporation of real pheromone, and
an additive action to model the pheromone addition of the traversing ants. The decaying action is an important carry-
over from the foraging ant analogy, as it serves to “limit the memory” of the ACOA’s process and allows the search to
446 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

more effectively use newer (and potentially better) information. The pheromone updating equation is given by [10]:
τi, j (t + 1) = ρτi, j (t) + 1τi, j (t) (10)
where ρ is the coefficient representing pheromone persistence (note 0 ≤ ρ ≤ 1) and 1τi, j (t) is the pheromone
addition for edge (i, j) from the solutions found in iteration t. The formulation of 1τi, j (t) is the main underlying
difference between many ACOAs, as it is representative of the way an algorithm uses learned information. Within
this research, the adopted formulation of 1τi, j (t) is that used in the Elitist–Rank Ant System (ASrank ) algorithm [13],
as this algorithm has been found to perform well for other problem types [13,14]. ASrank implements a rank based
scheme where only σ paths receive non-zero pheromone additions at the end of each iteration. The σ paths are the
best path found so far (i.e. the elitist solution S elite ) and the top σ − 1 ranked solutions found within the iteration. The
value of 1τi, j (t) is given as:
σX−1
1τi, j (t) = σ · G(S elite (t)) · I S elite (t) {(i, j)} + (σ − µ) · G(S µ (t)) · I S µ (t) {(i, j)} (11)
| {z } µ=1
I | {z }
II
where G( ) is the mutual information function (see (7)) — note here G is presented as a function of S and not C,
which is valid as there is a one to one mapping from S to C (i.e. S = {edge (i, (ωi )C ) : i = 1, . . . , T }), S elite (t) is
the elitist solution found up to iteration t, I A {x} is the indicator function for set A and is equal to 1 if x ∈ A and zero
otherwise, and S µ (t) is the solution of the µth ranking ant from iteration t. Term I in (11) is the pheromone addition
from the elitist ant and term II corresponds to the addition from the σ − 1 ranking ants. Important points to note about
(11) are: (i) as the problem is a maximisation problem, the pheromone addition based on a particular solution S is
proportional to the objective function value of that solution G(S), so edges resulting in better solutions receive greater
additions (cf. for a minimization problem, the pheromone addition is inversely proportional to the objective value),
(ii) through the indicator function, non-zero pheromone additions based on a solution S are only given to those edges
that are an element of S (i.e. pheromone is only given to the edges traversed by the ant that generated S), and (iii)
greater pheromone addition is given to the edges from higher ranking solutions (i.e. the elitist solution, ranked first,
is scaled up by a factor of σ , but the σ th ranked solution is only scaled by 1). The weighted pheromone addition of
the top σ solutions is aimed at providing a compromise between exploitation of the best information and inclusion of
enough information to allow for guided exploration.
Fig. 4 outlines the main steps taken in the general ACOA process. Before the first iteration, initialisation routines
are undertaken including all pheromone intensities τi, j (1), i = 1, . . . , T , j = 1, . . . , n being set to the initial level of
τ0 . At the beginning of each iteration, the probability weight for each edge at each node is calculated, as in (9), based
on the current pheromone levels. Each ant then generates a solution according to the probabilistic decision rule (9).
After all ants have generated solutions, the objective function values are calculated and the top σ − 1 ranked solutions
are stored. A check is performed to update the elitist solution if required (i.e. S elite ← S 1 if G(S elite ) < G(S 1 ) where
S 1 is the solution of the top ranked ant). After the determination of S elite and S µ , µ = 1, . . . , σ − 1, the pheromone
paths are all updated according to (10) and (11) and the iteration ends. The process loops through all iterations until
the stopping criteria are satisfied.

3. Case study

3.1. Data

The data used in the analyses are the South Australian combined (Autumn/Spring) riffle macroinvertebrate data,
which have been collected by the South Australian Environment Protection Authority (EPA). The data contain 151
reference sites (i.e. T = 151) and information on 67 macroinvertebrate families (i.e. m = 67). In order to compare the
results directly with those obtained using the UPGMA method currently used in the AusRivAS model, the number of
clusters used was 6 (i.e. n = 6). As part of the current AusRivAS model, only presence/absence data are considered
(i.e. s = 2). However, in this research, analyses were conducted using both presence/absence and abundance data.
For the abundance data, the five states adopted by Walley and O’Connor [5] were used (i.e. s = 5), including (i)
abundance = 0, (ii) 1 ≤ abundance ≤ 9, (iii) 10 ≤ abundance ≤ 99, (iv) 100 ≤ abundance ≤ 999, and (v)
1000 ≥ abundance.
H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450 447

Fig. 4. Steps in the Ant Colony Optimisation process.

3.2. Analyses conducted

The presence/absence and abundance data were clustered using MIR-max and the proposed GA-based (GAMI-
max) and ACO-based (ACMI-max) approaches. For each data set, and for each of the three algorithms considered, the
analyses were repeated 20 times with different random starting positions to allow for the effect that random starting
positions in search space, and, for GAMI-max and ACMI-max, the probabilistic nature of the operators that control
the searching behaviour of the algorithms (e.g. population size, number of ants), have on the results.
Eq. (6) was used to calculate the mutual information between the cluster allocation of a particular reference site
and the state of a specific macroinvertebrate community at that site for the current AusRivAS clusters obtained using
the UPGMA method, as well as the clusters obtained using the MIR-max, GAMI-max and ACMI-max approaches.
The overall mutual information for each cluster set was then calculated using (7). The calculations were conducted
using presence/absence and abundance data.
Details of the clustering process for the two new approaches are given below.

3.2.1. Genetic algorithm (GAMI-max)


The software code required for implementing the GAMI-max approach was developed in Fortran 77. A population
size of 30 and a stopping criterion of 5000 generations were used for all simulations. The optimal probabilities of
crossover and mutation were determined by trial and error. The probabilities of crossover investigated ranged from
0.6 to 1.0. The best results obtained were for a probability of crossover of 0.9, although the performance of GAMI-
max was relatively insensitive to this parameter. Probabilities of mutation investigated ranged from 0 (i.e. no mutation)
to 0.1. The best results were obtained when the probability of mutation was 0.001. The performance of GAMI-max
decreased significantly for higher values of probability of mutation, such as 0.01. The same GAMI-max parameters
were found to be optimal for the presence/absence and abundance data.
448 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

Table 1
Overall mutual information statistics (maximum, mean, minimum, standard deviation) obtained using the three clustering approaches considered
(GAMI-max, ACMI-max, MIR-max) for 20 runs with different random starting positions for the South Australian combined (Autumn/Spring) riffle
presence/absence macroinvertebrate data and the corresponding mutual information for the UPGMA clusters

GAMI-max ACMI-max MIR-max UPGMA


Max 7.415 7.395 7.250 6.663
Mean 7.189 7.098 6.967 –
Min 7.023 6.773 6.745 –
Stdev 0.088 0.136 0.127 –

Table 2
Overall mutual information statistics (maximum, mean, minimum, standard deviation) obtained using the three clustering approaches considered
(GAMI-max, ACMI-max, MIR-max) for 20 runs with different random starting positions for the South Australian combined (Autumn/Spring) riffle
abundance macroinvertebrate and the corresponding mutual information for the UPGMA clusters

GAMI-max ACMI-max MIR-max UPGMA


Max 12.033 12.002 11.855 9.143
Mean 11.796 11.689 11.668 –
Min 11.406 11.088 11.151 –
Stdev 0.165 0.249 0.169 –

3.2.2. Ant colony optimisation algorithm (ACMI-max)


The software code required for implementing the ACMI-max approach was developed in Fortran 90. A trial and
error approach to calibrating the algorithm’s parameters was adopted. The tested parameter ranges were as follows:
h ∈ [5, 500], α ∈ [0, 5], ρ ∈ [0.05, 0.999], τ0 ∈ [1, 1000] and σ ∈ [2, 220] (note the range of values that σ can take is
bounded above by h for each simulation). As there is no meaningful definition of desirability for this problem type, β
was set to 1 and thus required no calibration. As with GAMI-max, the maximum number of evaluations used in each
ACMI-max simulation was 150 000. This meant that the maximum number of iterations Imax used in each simulation
was Imax = int{15 0000/ h}.
In general, the results of the parameter sensitivity analyses indicated that for low values of h, the best solution was
found early in the run, implying premature convergence. For high values of h, the best solution was generally found in
the last few iterations, implying that convergence had not yet been achieved. The best values of h were in the middle
third of the tested parameter range. ACMI-max’ performance deteriorated for α 6= 1, as the algorithm was observed
to search almost randomly for lower values of α and converged prematurely for higher values of this parameter. The
algorithm’s performance was better for higher values of ρ, as at these values ACMI-max was observed to explore for
longer periods. ACMI-max was relatively insensitive to changes in τ0 . Higher values of σ resulted in longer search
times before good solutions were found. This can be attributed to the dilution of the influence of the best information
resulting from the inclusion of more information in the pheromone updating phase. The best values of σ occurred in
the vicinity of σ = h/10 (i.e. when, approximately, the top 10% of ants were updating their paths).
When the presence/absence data were used, the best parameter settings were as follows: h = 160 (i.e. Imax = 938),
α = 1, ρ = 0.98, τ0 = 220, and σ = 16. When the abundance data were used, the best parameter settings were as
follows: h = 220 (i.e. Imax = 682), α = 1, ρ = 0.96, τ0 = 160, and σ = 20.

4. Results and discussion

The overall mutual information values of the clusters obtained when the presence/absence and abundance data were
used in conjunction with the three clustering approaches investigated (i.e. MIR-max, GAMI-max and ACMI-max) are
shown in Tables 1 and 2, respectively. The corresponding values for the clusters obtained from the UPGMA approach
are also presented for comparison. For the clusters obtained using MIR-max, GAMI-max and ACMI-max, the statistics
of the results of the 20 runs with different starting values are presented, including the maximum, mean, minimum and
standard deviation of the best overall mutual information values found in each run. It can be seen from Tables 1 and 2
that, for the case study considered, the overall mutual information values of the clusters obtained using the MIR-max,
GAMI-max and ACMI-max approaches were significantly higher than those obtained using the UPGMA approach.
H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450 449

However, as pointed out by Walley and O’Connor [4], it is not surprising that clustering methods that are designed
to optimise mutual information achieve higher MI scores than clustering methods that use an alternative performance
measure. In addition, it should be noted that only the presence/absence data were used to obtain the UPGMA clusters,
whereas separate cluster solutions were obtained for the presence/absence and abundance data when the other three
clustering approaches were used. This is likely to be one of the reasons for the decreased performance of the UPGMA
algorithm relative to that of the other three algorithms for the abundance data.
A comparison of the results presented in Tables 1 and 2 indicates that “better” clusters were obtained when the
abundance data, rather than the presence/absence data, were used. The results in Tables 1 and 2 also suggest that,
for the case study considered, the EAs used as part of GAMI-max and ACMI-max are generally more successful at
maximising the MI between clusters and the attributes of the samples to be clustered than the hill-climbing approach
used in MIR-max. However, the improvements obtained using the EAs are only marginal and further comparative
studies are needed on more challenging case studies, such as that used by O’Connor and Walley [4] to develop RPDS,
which clustered 6038 samples into 250 clusters.
A comparison of the results obtained using GAMI-max and ACMI-max shows that the GA-driven clustering
algorithm performs slightly better than the one using the ACOA, both in terms of the best solution found and in
relation to the variation in the solutions obtained from different random starting positions in the search space. These
results were consistent for both data sets investigated (i.e. presence/absence and abundance).

5. Conclusions

A new approach for clustering ecological data was introduced in this paper, which uses mutual information
as the performance measure and evolutionary algorithms for optimising this performance measure. Formulations
were developed which enable the mutual information of ecological data clusters to be optimised using both genetic
algorithms (GAs) and ant colony optimisation algorithms (ACOAs). The GA-based approach (GAMI-max) and
the ACOA-based approach (ACMI-max) were applied to the South Australian combined season riffle AusRivAS
presence/absence and abundance data.
It was found that the overall mutual information between clusters and the attributes of the samples to be clustered
obtained using the new approaches was significantly higher than that obtained using the UPGMA clustering method,
which is currently used in the South Australian AusRivAS model, and slightly higher than that obtained using the
MIR-max approach, which uses a hill-climbing approach for optimising mutual information. This indicates that the
proposed approach shows promise, but further comparative tests are needed.

Acknowledgments

The authors would like to thank Mark O’Connor from the Centre for Intelligent Environmental Systems at
Staffordshire University for his valuable advice and comments on an earlier version of this manuscript and supplying
the MIR-max software, and Tom Finkemeyer, Tiana Hume and Miranda Butchart from the School of Civil and
Environmental Engineering at the University of Adelaide for their assistance with preliminary modelling work.

References

[1] H.F. Dallas, The derivation of ecological reference conditions for riverine macroinvertebrates, in: Southern Waters Ecological Research and
Consulting, 2000.
[2] P.E. Davies, Development of a national river bioassessment system (AUSRIVAS) in Australia, in: J.F. Wright, D.W. Sytcliffe, M.T. Furse
(Eds.), Assessing the Biological Quality of Fresh Waters: RIVPACS and Other Techniques, Freshwater Biological Association, Ambleside,
UK, 2000, pp. 113–124.
[3] C.P. Hawkins, R.H. Norris, J.N. Houge, J.W. Feminella, Development and evaluation of predictive models for measuring the biological
integrity of streams, Ecological Applications 10 (5) (2000) 1456–1477.
[4] M.A. O’Connor, W.J. Walley, River Pollution Diagnostic System (RPDS) — computer-based analysis and visualisation for bio-monitoring
data, Water Science and Technology 46 (3) (2002) 17–23.
[5] W.J. Walley, M.A. O’Connor, Unsupervised pattern recognition for the interpretation of ecological data, Ecological Modelling 146 (2001)
219–230.
[6] P.E. Davies, River Bioassessment Manual, in: National River Processes and Management Program, Monitoring River Health Initiative, 1994.
[7] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, Inc., Brisbane, 1990.
450 H.R. Maier et al. / Mathematical and Computer Modelling 44 (2006) 439–450

[8] A.M. Fraser, H.L. Swinney, Independent coordinates for strange attractors from mutual information, Physics Review A 33 (2) (1986)
1134–1140.
[9] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989.
[10] M. Dorigo, V. Maniezzo, A. Colorni, The ant system: Optimization by a colony of cooperating ants, IEEE Transactions on Systems, Man, and
Cybernetics. Part B, Cybernetics 26 (1996) 29–42.
[11] A. Sharma, Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1 — A strategy for system
predictor identification, Journal of Hydrology 239 (2000) 232–239.
[12] M. Dorigo, G. Di Caro, L.M. Gambardella, Ant algorithms for discrete optimization, Artificial Life 5 (1999) 137–172.
[13] B. Bullnheimer, R.F. Hartl, C. Strauss, An improved ant system algorithm for the vehicle routing problem, Annals of Operations Research 89
(1999) 319–328.
[14] T. Stützle, H.H. Hoos, MAX-MIN ant system, Future Generation Computer Systems 16 (2000) 889–914.

You might also like