You are on page 1of 15

You can run the examples in this article if you have Perl 5.005 or later installed on your system.

Preferably, your system should be a recent (2000 or later) mainstream UNIX (Linux, Solaris, BSD)
installation, but other types of operating systems may work as well. The examples may work with
earlier versions of Perl and UNIX, and other operating systems, but their failure to function should
be considered an exercise for the reader to solve.

History

The progress of genetics in the 20th century was rivaled in speed and reach only by the evolution of
electronics and computer science. It is fitting that one of the most intriguing algorithms to come
about in the 20 th century is the genetic algorithm.

Emerging in the early 1960s, the genetic algorithm (and evolutionary algorithms in general) took a
place in computer science between deterministic and non-deterministic algorithms. The genetic
algorithm, essentially, is as deterministic as you want to make it, meaning that the user can decide
the number of iterations and termination criteria. It mimics Darwinian natural selection, making
"fitness" (as determined by a formula applied to an individual) the chief selector of individuals for
survival and procreation, together with mutation.

Other evolutionary algorithms attempt to mimic Lamarckian evolution, where behavior as a


survival mechanism can be transferred between generations, and there are even evolutionary
programs that write themselves for a particular purpose. All of those are beyond the scope of this
article.

The main drawback of Perl for genetic algorithms is speed. Because of the computing needs of
genetic algorithms, they are more efficiently implemented in C or other low-level pre-compiled
languages. The Perl examples shown here are not as fast as their equivalents in C, but they will
show you how the genetic algorithm works, and they are fast enough for some problems.

--------------------------------------------------------------------------------
Back to top
So what is the genetic algorithm?

The genetic algorithm is simple enough for anyone to understand, using biological terms taught in
high school. Take a population of individuals, each with DNA of their own. Then measure each
individual's fitness (measured as a function applied to the individual's DNA) and make the
individual more likely to procreate if he is more fit. Extremely unfit individuals are killed off. Every
survivor gets a chance to procreate (and it is important that procreation is never denied to a
survivor, only made less likely if he is less fit). Merging the parents' DNA, and then applying a
random mutation to the merged DNA simulates procreation. The new individual will, in theory, be
as fit as the parents, plus or minus a small variance caused by the mutation. The cycle is then
repeated.

Genetic Algorithms FAQ


See the Resources section for the link to the Genetic Algorithms FAQ. The FAQ includes the
Algorithm GA and points to suites of genetic algorithm software, both free and commercial.
There are, obviously, many variables that can affect the genetic algorithm, including population
size, generations (iterations of the algorithm), merging method, fitness function, how fitness affects
procreation chances, and how much mutation happens.

There are some drawbacks to the algorithm, as well. It works best if the fitness function is applied
to the DNA as a series of bits. In other words, it's best if the DNA is a series of binary options, yes
or no. Blue eyes? Black eyes? Red hair? Black hair? Merging of parent DNA and subsequent
mutation should not allow certain combinations of bits, because the resulting DNA will not be a
valid solution to the original problem. Remember that the "DNA" is nothing more than a solution to
a mathematical fitness formula. Some values used in that formula may be invalid -- for instance,
dividing by zero.

In addition, the genetic algorithm is not bound by time. You pick the number of generations. You
can define some goal -- for instance, "look for an individual with fitness of 0.99999" and stop there,
but then the algorithm may never end because it hasn't found that individual. That can be a problem
if you set unrealistic goals or too few generations. Trial and error, and a good feel for the problem,
are the best ways around this issue.

The fitness formula should return a floating point number between 0 and 1. Other ranges can be
used, but in my experience floating point numbers work best. You may want a range between 0 and
32767, for instance, if you want the fitness to be a 7-bit integer for optimization.

Of course, delaying optimization until you know it's needed is a good idea, so you should at least
start with a simple fitness formula. The fitness formula is the most-used function in the genetic
algorithm (it will be invoked (population size) x (generations times)), so you should make it as
simple and as fast as possible.

There are three "good" ways to exit the genetic algorithm. First, you can decide to exit if there is no
variety in the DNA pool anymore. This is a tricky test, actually, and a CPAN module for finding the
distance of strings might be useful here, as long as you can represent the DNA as a string. Second,
you can exit if you have reached a fitness goal. Unless you have a very good understanding of the
fitness formula (in which case, you probably don't need the genetic algorithm anyway), setting a
fitness goal can result in either infinite loops, or an individual who is only "good enough." Third,
you can exit the algorithm after a set number of iterations, or "generations."

In practice, all three ways (or at least the second and third one) are used to control the genetic
algorithm. A few test runs, maybe 10 or 20, will give you a good feel for how quickly the algorithm
converges, what kind of fitness you can expect.

--------------------------------------------------------------------------------
Back to top
A simple example

The code in Listing 1 uses a single byte as DNA (value between 0 and 255, 8 bits). The fitness
formula is applied once to every new individual, taking the byte value of the DNA and dividing it
by 256. This means that the fitness formula will always return a number between 0 and 255/256, so
it can never reach 1. What do you think the fittest DNA will be?

Listing 1. numbers.pl

#!/usr/bin/perl -w
# GA demonstration with numeric DNA (between 0 and 255)

use strict;
use Data::Dumper;

# individuals in the population - no sense making more than DNA can provide for
my $popsize = 256;
my $mut_rate = 0.01; # the mutation rate
my $min_fitness = 0.1; # the minimum fitness for survival
my $generation_count = 100000; # run for this many generations
my $generation = 0; # generation counter
my $pop_ref = []; # a reference to a population array
init_population($pop_ref, $popsize);

do
{
evaluate_fitness($pop_ref, \&fitness);

# print out a generation summary line


my @sorted_population = sort { $a->{fitness} <=> $b->{fitness} } @$pop_ref;
printf "generation %d: size %d, least fit DNA [%d], most fit DNA [%d]\n",
$generation,
scalar @sorted_population,
$sorted_population[0]->{dna},
$sorted_population[-1]->{dna};

survive($pop_ref, $min_fitness); # select survivors from the population


select_parents($pop_ref);
$pop_ref = recombine($pop_ref); # returns a new population array reference

# from this point on, we are working with a new generation in $pop_ref
mutate($pop_ref, $mut_rate); # apply mutation to the individuals
} while ($generation++ < $generation_count); # run until we are out of generations

sub init_population
{
my $population = shift @_;
my $pop_size = shift @_;

# for each individual


foreach my $id (1 .. $pop_size)
{
# insert an anonymous hash reference in the population array with the individual's data
# the DNA is equal to the individual's number minus 1 (0-255)
push @$population, { dna => $id-1, survived => 1, parent => 0, fitness => 0 };
}
}

sub evaluate_fitness
{
my $population = shift @_;
my $fitness_function = shift @_;
foreach my $individual (@$population)
{
# set the fitness to the result of invoking the fitness function
# on the individual's DNA
$individual->{fitness} = $fitness_function->($individual->{dna});
}
}

sub survive
{
my $population = shift @_;
my $min_fitness = shift @_;

foreach my $individual (@$population)


{
# set the fitness to the result of invoking the fitness function
# on the individual's DNA
$individual->{survived} = $individual->{fitness} >= $min_fitness;

# set the fitness to 0 for unfit individuals (so they won't procreate)
$individual->{fitness} = 0 if $individual->{fitness} < $min_fitness;
}
}

sub select_parents
{
my $population = shift @_;
my $pop_size = scalar @$population; # population size

# create the weights array: select only survivors from the population,
# then use map to have only the fitness come through
my @weights = map { $_->{fitness} } grep { $_->{survived} } @$population;

# if we have less than 2 survivors, we're in trouble


die "Population size $pop_size is too small" if $pop_size < 2;

# we need to fill $pop_size parenting slots, to preserve the population size


foreach my $slot (1..$pop_size)
{
my $index = sample(\@weights); # we pass a reference to the weights array here

# do sanity checking on $index


die "Undefined index returned by sample()"
unless defined $index;
die "Invalid index $index returned by sample()"
unless $index >= 0 && $index < $pop_size;

# increase the parenting slots for this population member


$population->[$index]->{parent}++;

}
}

sub recombine
{
my $population = shift @_;
my $pop_size = scalar @$population; # population size
my @parent_population;
my @new_population;

my $total_parent_slots = 1;

while ($total_parent_slots)
{
# find out how many parent slots are left
$total_parent_slots = 0;
$total_parent_slots += $_->{parent} foreach @$population;

last unless $total_parent_slots;

# if we are here, we're sure we have at least one individual with parent > 0
my $individual = undef; # start with an undefined individual
do
{
# select a random individual
$individual = $population->[int(rand($pop_size))];
# individual is acceptable only if he can be a parent
undef($individual) unless $individual->{parent};
} while (not defined $individual);

push @parent_population, $individual; # insert the individual in the parent population


$individual->{parent}--; # decrease the parenting slots by 1

foreach my $parent (@parent_population)


{
# select a random individual from the parent population (parent #2)
my $parent2 = @parent_population[int(rand($pop_size))];

my $child = { survived => 1, parent => 0, fitness => 0 };

# this is breeding!
my $bitmask = int(rand(256)); # a random byte between 0 and 255
# swap random bits between parents, according to the bitmask
$child->{dna} = ($parent2->{dna} & $bitmask) | ($parent->{dna} & ~$bitmask);

push @new_population, $child; # the child is now a part of the new generation
}

return \@new_population;
}

sub mutate
{
my $population = shift @_;
my $mut_rate = shift @_;

foreach my $individual (@$population)


{
# only mutate individuals if rand() returns more than mut_rate
next if rand > $mut_rate;
# mutate the DNA by and-ing and then or-ing it with two random
# integers between 0 and 255
my $old_dna = $individual->{dna};
$individual->{dna} &= int(rand(256));
$individual->{dna} |= int(rand(256));
# print "Mutated $old_dna to ", $individual->{dna}, "\n";
}
}

sub fitness
{
my $dna = shift @_;
return $dna/256;
}

# Function to sample from an array of weighted elements


# originally written by Abigail <abigail@foad.org>
sub sample
{
# get the reference to the weights array
my $weights = shift @_ or return undef;
# internal counting variables
my ($count, $sample);

for (my $i = 0; $i < scalar @$weights; $i ++)


{
$count += $weights->[$i];
$sample = $i if rand $count < $weights->[$i];
}

# return an index into the weights array


return $sample;
}

There are several interesting things about Listing 1. Its main loop is at the beginning, and you
should understand all the pieces and how they work together on the population (and yet they are
separate, so we can reuse them in the next example).

We build the weights array in the select_parents() function by stacking map() on top of grep().
While it could have been written as a loop, the one-line solution is much cleaner that way, and does
not slow the program down significantly.

Listing 2. One-line solution

my @weights = map { $_->{fitness} } grep { $_->{survived} } @$population;

The $population array reference is de-referenced. Then, only elements of the array with the
"survived" field (set earlier by the survive() function) get through the grep. The survivors are then
distilled to their fitness numbers, which get put in their places in the weights array.

The population size was chosen to be 256, because that way it was easy to initialize every
individual to a number equal to its index. Feel free to start with different population sizes.

Mutation rates of more than 1% caused the maximum and minimum fitness to fluctuate wildly. The
population never stabilized towards high fitness. Low mutation rates cause the population to reach
high fitness as a whole much slower. In the end, 1% was about right for our population size.

The breeding selection algorithm picks one parent by looking at the weights - in effect, every
individual gets a chance to be a parent, but the number of parent slots is finite. The second parent is
picked at random from the parent population. Why? Well, we could have used weights to determine
the second parent as well, but this way we ensure every parenting individual has a chance to
participate in the breeding process.

The actual breeding is accomplished with a random 8-bit bitmask. We just AND the bitmask to the
first parent's DNA (remember, it's just a byte) and AND the inverted bitmask to the second parent's
DNA. The effect is that we pick random bits from one parent, and then the rest of the bits come
from the other parent.

The mutation is accomplished by AND and OR of individual DNA with random 8-bit bitmasks.
Oh, and by the way, the most fit DNA was 255, of course. You don't need to wait 100,000
generations. Just hit Ctrl-C when you are done admiring the status line.

--------------------------------------------------------------------------------
Back to top
Breeding words

For this example, we will make the DNA 32 bits (5 bytes). Each byte will represent a letter or a
space. We could have packed more information into each byte, but it would have obscured the
purpose of the example. Every byte's value (0-255, numerically) will be either A through Z (if the
value is between 65 and 90, conveniently chosen to match the ASCII set), or a space (if the value is
0 to 64, or 91 to 255).

Note how much of the example is similar to the example in Listing 1. The dictionary of words
follows the program.

The main problem with this example was that DNA lengths greater than 32 bits were hard to
manage. I started out trying to do my own bit operations, which were not only unwieldy, but also
extremely slow. Then I tried the Math::BigInt package, which was extremely slow for this purpose.

Finally, I settled on the vec() function -- it is quite fast, and was the right choice for handling DNA
(essentially, the DNA is a bit vector, a built-in Perl data structure). Use "perldoc -f vec" to find out
more about the vec() function.
It is possible to end up with 1024 individuals with 0 fitness. That's why this example has better
safeguards against such an "accident" than the first example.

The init_population(), recombine(), and mutate() functions were changed to handle bit vectors
instead of bytes.

The dna_to_words() function is somewhat inefficient, but is not invoked often enough to make a
difference. The biggest slowdown comes from the fitness() function trying to match all the words in
the dictionary, plus all the letters in the alphabet.

Fitness was calculated as: 2 for each letter in the DNA, plus the frequency of that letter in the
dictionary, plus 2^N for every dictionary word of length N in the DNA. The dictionary array and
letter frequencies hash were obtained only once (using a closure). Feel free to modify the fitness
function and the dictionary to breed your own English words. The fitness formula shown is heavily
biased towards letters, and takes a while to converge on English words (though "on" and "in" were
pretty frequent visitors).

--------------------------------------------------------------------------------
Back to top
Conclusion

The evolutionary genetic algorithm is a fascinating topic that can hardly be exhausted in a single
article. I hope you experiment with the examples and create your own Darwinian breeding grounds.
It is especially entertaining to play with the fitness function in the second example, and watch

You might also like