You are on page 1of 71

Artificial Intelligence:

Machine Learning - 2
Neural Networks & Genetic Algorithms

 Russell & Norvig: Sections 18.1 to 18.3 and Section 4.1.4

Today


Last time:



Neural Networks



Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks

Genetic Algorithms

Neural Networks


Radically different approach to reasoning and


learning
Inspired by biology


Set of many simple processing units (neurons)


connected together
Behavior of each neuron is very simple


the neurons in the human brain

but a collection of neurons can have sophisticated


behavior and can be used for complex tasks

In a neural network, the behavior depends on


weights on the connection between the neurons
The weights are be learned given training data

Biological Neurons


Human brain = 200 billion


neurons, 32 trillion synapses
Each neuron is connected to
thousands other neurons
A neuron is made of:





The soma: body of the


neuron
Dendrites: filaments that
provide input to the neuron
The axon: sends an output
signal
Synapses: connection with
other neurons releases
certain quantities of
chemicals called
neurotransmitters to other
neurons

Behavior of a Neuron



A neuron receives inputs from its neighbors


If enough inputs are received at the same time:
 the neuron is activated
 and fires an output to its neighbors
Repeated firings across a synapse increases its
sensitivity and the future likelihood of its firing
If a particular stimulus repeatedly causes activity in a
group of neurons, they become strongly associated

Today


Last time:



Neural Networks



Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks

Genetic Algorithms

A Simple Perceptron


A simple computational neuron


Input:
 input signals xi
 weights wi for each feature xi

represents the strength of the connection with the neighboring


neurons
Output:





if sum of input weights >= some threshold, neuron fires (output=1)


otherwise output = 0
 If (w1 x1 + + wn xn) >= t
 Then output = 1
 Else output = 0

Learning :
 use the training data to adjust the weights in a network

source: Luger (2005)

The Idea
Features (xi)

Output

Student

First
last
year?

Male?

Works
hard?

Drinks
?

First
this
year?

Richard

Yes

Yes

No

Yes

No

Alan

Yes

Yes

Yes

No

Yes

1.
2.
3.
4.

Step
Step
Step
Step
1.
2.

1: Set weights to random values


2: Feed perceptron with a set of inputs
3: Compute the network outputs
4: Adjust the weights
if output correct  weights stay the same
if output = 0 but it should be 1 
1.

3.

if output = 1 but should be 0 


1.

5.

increase weights on active connections (i.e. input xi =1)


decrease weights on active connections (i.e. input xi =1)

Step 5: Repeat steps 2 to 4 a large number of times until the network


converges to the right results for the given training examples
source: Cawsey (1998)

A Simple Example




Each feature (works hard, male, ) is an xi


 if x1 = 1, then student got an A last year,
 if x1 = 0, then student did not get an A last year,

Initially, set all weights to random values (all 0.2 here)
Assume:
 threshold = 0.55
 constant learning rate = 0.05

source: Cawsey (1998)

A Simple Example (2)


Features (xi)

Output

Student

A last year? Male?

Works hard? Drinks?

A this year?

Richard

Yes

Yes

No

Yes

No

Alan

Yes

Yes

Yes

No

Yes

Alison

No

No

Yes

No

No

Jeff

No

Yes

No

Yes

No

Gail

Yes

No

Yes

Yes

Yes

Simon

No

Yes

Yes

Yes

No

Richard:
 (1 0.2) + (1 0.2) + (0 0.2) + (1 0.2) = 0.6>=0.55-->output is 1
 but he did not get an A this year
 So reduce all weights of active connections (with input 1) by 0.05
 So we get w1= 0.15, w2= 0.15, w3= 0.2, w4= 0.15

10

A Simple Example (3)


Features (xi)




Output

Student

A last year? Male?

Works hard? Drinks?

A this year?

Richard

Yes

Yes

No

Yes

No

Alan

Yes

Yes

Yes

No

Yes

Alison

No

No

Yes

No

No

Jeff

No

Yes

No

Yes

No

Gail

Yes

No

Yes

Yes

Yes

Simon

No

Yes

Yes

Yes

No

Alan:
 (1 0.15) + (1 0.15) + (1 0.2) + (0 0.15) = 0.5 < 0.55  output is 0
 but he got an A this year
 So increase all weights of active connections by 0.05
 So we get w1= 0.2, w2= 0.2, w3= 0.25, w4= 0.15

AlisonJeff Gail ... SimonRichardAlan

After 2 iterations over the training set, we get:




w1= 0.2 w2= 0.1 w3= 0.25 w4= 0.1

11

A Simple Example (3)


Features (xi)

Output

Student

A last year? Male?

Works hard? Drinks?

A this year?

Richard

Yes

Yes

No

Yes

No

Alan

Yes

Yes

Yes

No

Yes

Alison

No

No

Yes

No

No

Jeff

No

Yes

No

Yes

No

Gail

Yes

No

Yes

Yes

Yes

Simon

No

Yes

Yes

Yes

No

Lets check (w1= 0.2 w2= 0.1 w3= 0.25 w4= 0.1)







Richard: (10.2) + (10.1) + (00.25) + (10.1) = 0.4 <= 0.55 -> output is 0 
Alan:
(10.2) + (10.1) + (10.25) + (10.1) = 0.55 <= 0.55 -> output is 1 
Alison: (00.2) + (00.1) + (10.25) + (00.1) = 0.35 <= 0.55 -> output is 0 
Jeff: (00.2) + (10.1) + (00.25) + (10.1) = 0.2 <= 0.55 -> output is 0 
Gail:
(10.2) + (00.1) + (10.25) + (10.1) = 0.55 <= 0.55 -> output is 1 
Simon: (00.2) + (10.1) + (10.25) + (10.1) = 0.45 <= 0.55 -> output is 0 

12

Decision Boundaries of Perceptrons




So we have just learned the function:





If (0.2x1 + 0.1x2 + 0.25x3 + 0.1x4 >= 0.55) then 1 otherwise 0


If (0.2x1 + 0.1x2 + 0.25x3 + 0.1x4 - 0.55 >= 0) then 1 otherwise 0

Assume we only had 2 features:







If (w1x1 + w2x2 -t >= 0) then 1 otherwise 0


The learned function describes a line in the input space
This line is used to separate the two classes C1 and C2
t (the threshold, later called b) is used to shift the line on the horizontal axis

x2
w1x1 + w2x2 -t = 0

decision
boundary

decision
region for C2

w1x1 + w2x2 t >= 0

C1

C2

decision
region for C1

x1

w1x1 + w2x2 -t < 0


13

Decision Boundaries of Perceptrons


 More generally, with n features, the learned function
describes a hyperplane in the input space.

x2

decision
region for C1

decision
boundary

decision
region for C2

x1
x3
14

Adding a Bias


b + xiwi

We can avoid having to


figure out the threshold
by using a bias
A bias is equivalent to a
weight on an extra input
feature that always has
a value of 1.

b w1

x1

w2

x2
15

Perceptron - More Generally


inputs

x1
x2

.
.
.

w1

w2

xn+1 = 1
wn+1

bias input set to 1 (to replace the


threshold)

output

n +1

w x
i

i=1

f wi xi
i=1

n +1

wn

xn
activation function

n +1

y = f wi xi
i=1

final classification
16

Common Activation Functions


Y

n +1

w x
i

n +1

i=1

w x
i

step

sign

y=

w x

i=1

Hard Limit activation functions:

y=

n +1

n +1

1 if wi xi t
i=1

0 otherwise

i=1

Sigmoid function
1

y=
1+e

n +1

wi xi
i=1

n +1

+ 1 if wi xi 0
i=1

- 1 otherwise

[Russell & Norvig, 1995]

17

Delta Rule
1.

Learning rate can be a constant value (as in the previous example)

w = (T - O)
learning rate


So:




2.

Error = target output actual output

if T=zero and O=1 (i.e. a false positive) -> decrease w by


if T=1 and O=zero (i.e. a false negative) -> increase w by
if T=O (i.e. no error) -> dont change w

Or, a fraction of the input feature xi

wi = (T - O) xi


So the update is proportional to the value of x






value of input feature xi

if T=zero and O=1 (i.e. a false positive) -> decrease wi by xi


if T=1 and O=zero (i.e. a false negative) -> increase wi by xi
if T=O (i.e. no error) -> dont change wi

This is called the delta rule or perceptron learning rule

18

Perceptron Convergence Theorem






Cycle through the set of training examples.


Suppose a solution with zero error exists.
The delta rule will find a solution in finite time.

19

Example of the Delta Rule




training data:

source: Luger (2005)

plot of the training data:

perceptron

20

Let's Train the Perceptron




assume random initialization






w1 = 0.75
w2 = 0.5
w3 = -0.6

Assume:


sign function (threshold = 0)

learning rate = 0.2

source: Luger (2005)

21

Training



data #1: f(0.75x1 + 0.5x1 - 0.6x1) = f(0.65) -> 1



data #2: f(0.75x9.4 + 0.5x6.4 - 0.6x1) = f(9.65) -> 1 


data #3: f(-3.01x2.5 -2.06x2.1 -1x1) = f(-12.84) -> -1 




--> error = (-1 - 1) = -2


--> w1 = w1 -2x0.2x9.4
= 0.75 - 3.76 = -3.01
--> w2 = w2 -2x0.2x6.4 = -2.06
--> w3 = w3 -2x0.2x1 = -1.00

--> error = (1 - -1) = 2


--> w1 = -3.01 + 2x0.2x2.5 = -2.01
--> w2 = -2.06 + 2x0.2x2.1 = -1.22
--> w3 = -1.00 + 2x0.2x1 = -0.60

repeat over 500 iterations, we converge to:


w1 = -1.3 w2 = -1.1 w3 = 10.9

source: Luger (2005)

22

Limits of the Perceptron




In 1969, Minsky and Papert showed formally what functions


could and could not be represented by perceptrons
Only linearly separable functions can be represented by a
perceptron

source: Luger (2005)

23

AND and OR Perceptrons

source: Luger (2005)

24

Example: the XOR Function




We cannot build a perceptron to learn the exclusive-or function

To learn the XOR function, with:






i.e. must have:







two inputs x1 and x2


two weights w1 and w2
A threshold t

x1

x2

Output

(1 w1) + (1 w2) < t (for the first line of truth table)


(1 w1) + 0 >= t
0 + (1 w2) >= t
0+0<t

Which has no solution so a perceptron cannot learn the XOR


function

25

The XOR Function - Visually





In a 2-dimentional space (2 features for the X)


No straight line in two-dimensions can separate



(0, 1) and (1, 0) from


(0, 0) and (1, 1).

source: Luger (2005)

26

A Perceptron Network


So far, we looked at
a single perceptron

C1
~C1
C2
~C2


C1
~C1

But if the output


needs to learn more
than a binary
(yes/no) decision

C6
~C6

Ex: learning to
recognize digit --> 10
possible outputs -->
need a perceptron
network
27

Real World Problems




Real-world problems cannot always be represented by linearlyseparable functions


Ex: In speech recognition recognize the vowel sound between the
letters hd (ex: hAd, hID, hEAD,)

This caused a decrease in interest in neural networks in the 1970s

source: Tom Mitchell, Machine Learning (1997)

28

Real World Problems




Solution: Have a hidden layer multi-layered neural networks

hard limit function

sigmoid function
(which is differentiable)

eventually with several output nodes (for non-binary decisions)

29

Today


Last time:



Neural Networks



Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks

Genetic Algorithms

30

Multilayer Neural Networks




A multilayer neural network can learn even if the problem is not


linearly separable
Multilayer network has:




Output layer
(one or more) Hidden layer
Input layer

source: Rich & Knight, Artificial Intelligence, p. 502

31

Decision Boundaries


of a single perceptron

Straight lines (hyperplanes), linear separable

32

Decision Boundaries


of a multilayer perceptron with 1 hidden layer

Convex areas (open or closed)

33

Decision Boundaries


of a multilayer perceptron with 2 hidden layers

Combinations of convex areas

34

Multilayer Neural Networks





Several algorithms to train a Multilayer Neural Network


Learning is the same as in a perceptron:



feed network with training data


if there is an error (a difference between the output and
the target), adjust the weights

So we must assess the blame for an error to the


contributing weights

35

Feed-forward + Backpropagation


Feed-forward:


Backpropagation:



input from the features is


fed forward in the network
from input layer towards the
output layer
Method to asses the blame of errors to weights
error rate flows backwards from the output layer to the input
layer (to adjust the weight in order to minimize the output
error)

Iterate until error rate is minimized







repeat the forward pass and back pass for the next data point
until all data points are examined (1 epoch)
repeat this entire exercise until the overall error is minimised
2
Ex: squared errors = 1
(TargetOutput ActualOutput ) < 0.001

2 iOutputlayer

36

Backpropagation


In a multilayer network



Computing the error in the output layer is clear.


Computing the error in the hidden layer is not clear,
because we dont know what output it should be

Intuitively:


A hidden node h is responsible for some fraction of the


error in each of the output node to which it connects.
So the error values () are divided according to the
weight of their connection between the hidden node and
the output node and are propagated back to provide the
error values () for the hidden layer.

37

Backpropagation Visually
 Goal: minimize
the error

(w1,w2)

(w1+w1,w2 +w2)

The delta rule is a gradient descent technique for


updating the weights in a single-layer perceptron.
Backpropagation is a generalised case of the delta rule.
38

The Sigmoid Function







Backpropagation requires a differentiable activation function


sigmoidal (or squashed or logistic) function
f returns a value between 0 and 1 (instead of 0 or 1)
f indicates how close/how far the output of the network is
compared to the right answer (the error term)

39

Training the Network


Step 0: Initialise the weights of the network randomly
// feedforward

Step 1: Do a forward pass through the network (use sigmoid)


Oi = sigmoid wji xj =
j

1
1+e

w ji xj
j

// propagate the errors backwards



Step 2: For each output unit k, calculate its error term k
note, with sigmoid :

k g'(xi ) Erri = Ok (1 - Ok ) (Tk - Ok )

g'(x) = g(x) (1 - g(x))

Derivative of sigmoid


Step 3: For each hidden unit h, calculate its error term h

h g'(xi ) Erri = Ok (1 - Ok )


Step 4: Update each network weight wij:

wij wij + wij where wij = j xi




kh k
koutputs

Sum of the weighted error


term of the output nodes
that h is connected to (ie.
h contributed to the
errors k)

Repeat Steps 2, 3 & 4 until the error is minimised to a given level

40

Example: XOR

O5

2 input nodes + 2 hidden nodes + 1 output node + 3 bias nodes

source: Negnevitsky, Artificial Intelligence, p. 181


41

Example: Step 0 (initialization)




Step 0: Initialize the network at random


3 = 0.8
w13 = 0.5

w35 = -1.2
5 = 0.3

w23 = 0.4

O5

w14 = 0.9
w45 = 1.1
w24 = 1.0

4 = -0.1

42

Step 1: Feed Forward




Step 1: Feed the inputs and calculate the output

Oi = sigmoid wji xj =
j

1
1+e

w ji xj
j

With (x1=1, x2=1) as input:




Output of the hidden nodes 3:




x2

Target output T

O3= sigmoid(x1 w13 + x2 w23 3) = 1 / (1 + e-(1x.5 + 1x.4 -1x.8))= 0.5250

Output of the hidden nodes 4:




x1

O4= sigmoid(x1 w14 + x2 w24 4) = 1 / (1 + e-(1x.9 + 1x1.0 +1x0.1))= 0.8808

Output of neuron 5:


O5 = sigmoid(O3 w35 + O4 w45 5) = 1 / (1 + e-(0.5250x1.2 + 0.8808x1.1 -1x0.3))= 0.5097

43

Step 2: Calculate error term of


output layer
k g'(xi ) Erri = Ok (1 - Ok ) (Tk - Ok )


Error term of neuron 5 in the output layer:


 5 = O5 (1-O5) (T5 O5)
= (0.5097) x (1-0.5097) x (0-0.5097)
= -0.1274
O5
5 = error will

be used to modify
w35 and w45

44

Step 3: Calculate error term of


hidden layer
h g'(xi ) Erri = Ok (1 - Ok )


kh k
koutputs

Error term of neurons 3 & 4 in the hidden layer:




3 = O3 (1-O3) 5 w35
= (0.5250) x (1-0.5250) x (-0.1274) x (-1.2)
= 0.0381
4 = O4 (1-O4) 5 w45
= (0.8808) x (1-0.8808) x (-0.1274) x (1.1)
= -0.0147

3 to modify

w13 and w23

O5

4 to modify
w14 and w24

45

Step 4: Update Weights


wij wij + wij where wij = j xi

Update all weights (assume a learning rate = 0.1)












w13 = 3 x1 = 0.1 x 0.0381 x 1 = 0.0038


w14 = 4 x1 = 0.1 x -0.0147 x 1 = -0.0015
w23 = 3 x2 = 0.1 x 0.0381 x 1 = 0.0038
w24 = 4 x2 = 0.1 x -0.0147 x 1 = -0.0015
w35 = 5 O3 = 0.1 x -0.1274 x 0.5250 = -0.00669 // O3 is seen as x5 (output of 3 is input to 5)
w45 = 5 O4 = 0.1 x -0.1274 x 0.8808 = -0.01122 // O4 is seen as x5 (output of 4 is input to 5)
3 = 3 (-1) = 0.1x0.0381 x -1 = -0.0038
4 = 4 (-1) = 0.1x-0.0147 x -1 = -0.0015
5 = 5 (-1) = 0.1x-0.1274 x -1 = -0.0127
=0.0381
3

x1=1

x2=1

O5
5=-0.1274
4=-0.0147

46

Step 4: Update Weights (con't)


wij wij + wij where wij = j xi


Update all weights (assume a learning rate = 0.1)











w13 = w13 + w13 = 0.5 + 0.0038 = 0.5038


w14 = w14 + w14 = 0.9 0.0015 = 0.8985
w23 = w23 + w23 = 0.4 + 0.0038 = 0.4038
w24 = w24 + w24 = 1.0 - 0.0015 = 0.9985
w35 = w35 + w35 = -1.2 0.00669 = -1.20669
w45 = w45 + w45 = 1.1 0.01122 = 1.08878
3 = 3 + 3 = 0.8 - 0.0038 = 0.7962
4 = 4 + 4 = -0.1 + 0.0015 = -0.0985
5 = 5 + 5 = 0.3 + 0.0127 = 0.3127

O5

47

Step 4: Iterate through data




after adjusting all the weights, repeat the


forward pass and back pass for the next data
point until all data points are examined
repeat this entire exercise until the overall error
is minimised


2
Ex: squared errors = 1
(Ti Oi ) < 0.001

2 iOutputlayer

48

The Result


After 224 epochs, we get:




(1 epoch = going through all data once)


3 = 7.31
W13 = 4.76

W35 = -10.38
5 = 4.56

W23 = 4.76

O5

W14 = 6.39

W45 = 9.77
W24 = 6.39

4 = 2.84

49

Error is minimized
Inputs

Target Output

x1

x2

T5

Actual Output
O5

1
0
1
0

1
1
0
0

0
1
1
0

0.0155
0.9849
0.9849
0.0175

Error
e

-0.0155
0.0151
0.0151
-0.0175

Squared errors = x (-0.01552 +0.01512 + 0.01512 + 0.01752)


< 0.001 (some threshold)
stop!

50

Applications of Neural Networks




Handwritten digit recognition







Training set = set of handwritten digits (09)


Task: given a bitmap, determine what digit it represents
Input: 1 feature for each pixel of the bitmap
Output: 1 output unit for each possible character (only 1 should
be activated)
After training, network should work for fonts (handwriting)
never encountered

Related pattern recognition


applications:





recognize postal codes


recognize signatures

51

Applications of Neural Networks




Speech synthesis



Learning to pronounce English words


Difficult task for a rule-based system because English
pronunciation is highly irregular
Examples:



letter c can be pronounced [k] (cat) or [s] (cents)


Woman vs Women

NETtalk:





uses the context and the letters around a letter to learn how to
pronounce a letter
Input: letter and its surrounding letters
Output: phoneme

52

NETtalk Architecture

Ex: a cat  c is pronounced K






Network is made of 3 layers of units


input unit corresponds to a 7 character window in the text
each position in the window is represented by 29 input units
(26 letters + 3 for punctuation and spaces)
26 output units one for each possible phoneme
source: Luger (2005)

53

Neural Networks


Disadvantage:


result is not easy to understand by humans (set of


weights compared to decision tree) it is a black box

Advantage:


robust to noise in the input (small changes in input do not


normally cause a change in output) and graceful
degradation

54

Today


Last time:



Neural Networks



Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks

Genetic Algorithms

55

Genetic Algorithms





Learning technique based on evolution


Also called evolutionary algorithm
Based on a biological metaphor (like neural networks)
Learning = competition among a population of evolving
candidate solutions to a problem.
A fitness function evaluates each solution to decide if it
will contribute to the next generation of solutions
Through genetic operators the algorithm creates a new
population of candidate solutions from the previous
generation

56

Genetic Algorithms






Let P(t) be the set of candidate solutions at time t


Set time t 0
Initialize the population P(t) // typically, chosen at random
While P(t) does not include an acceptable solution








Evaluate the fitness of each member of the population P(t)


Select pairs of solutions from P(t) based on fitness
Produce the offspring of these pairs using genetic operators
Replace the weakest candidates (based on fitness) of P(t)
with these offspring
Mutation (optional)
Set time t t+1
57

Solution Representation


Solutions are typically represented as a fixedlength string using a finite alphabet




In real DNA, the alphabet is AGTC (adenine, guanine,


thymine, cytosine)


Ex: AATAGC

In machine learning, the alphabet is typically {0,1,#}


(where # means that the feature is not relevant)
 Ex: the features: A last year? Male? Works hard? Drinks?
#100




The string is called a chromosome


Each element of the string is called a gene
58

The Fitness Function







To determine if a candidate solution is good or


not
Exact function depends a lot on the problem
But very often, the function computes the
proportion of the examples that are correctly
classified by the candidate solution


How many training examples are correctly classified


with: #110, 100#,

59

Genetic Operators


Crossover


Takes 2 candidates and


swaps components to
produce 2 new
candidates

1#0#

+
0#10
1#10

0#0#

Mutation


Takes a single candidate


and randomly changes a
gene
Important because the
initial population may not
include an essential gene

0#10
0#11
60

Example (1)
Features (xi)

A last year?

Male?

Works hard?

Drinks?

A this year?

Richard

Yes / 1

Yes / 1

No / 0

Yes / 1

No / 0

Alan

Yes / 1

Yes / 1

Yes / 1

No / 0

Yes / 1

Alison

No / 0

No / 0

Yes / 1

No / 0

No / 0

Jeff

No / 0

Yes / 1

No / 0

Yes / 1

No / 0

Gail

Yes / 1

No / 0

Yes / 1

Yes / 1

Yes / 1

Simon

No / 0

Yes / 1

Yes / 1

Yes / 1

No / 0

chosen at random

Fitness function:





Student

Let P(0) = {#1##, 0#10, 11##, #11#}




Output f(X)

#1## (only males will get A) predicts correctly 2/6 -- Alan, Alison
0#10 predicts correctly 3/6 -- Richard, Jeff, Simon
11## predicts correctly 4/6 -- Alan, Alison, Jeff, Simon
#11# predicts correctly 4/6 -- Richard, Alan, Alison, Jeff

Best candidates are 11##, #11#

61

Example (2)



Lets cross-over 11## + #11#  111# and #1##


Lets replace the weakest solutions by the 2 new offsprings



Fitness function:


P(0) = {#1##, 0#10, 11##, #11#}


P(1) = {111#, #1##, 11##, #11#}
111#  5/6, #1##  2/6, 11##  4/6, #11#  4/6

Lets go on but in fact, with a split-in-two cross-over, we


could never get better than 5/6
With random mutation



If best solution so far happens to mutate to 1#1#


Fitness function of 1#1#  6/6 !!!

62

Proportional Selection


Main idea: better individuals get higher chance







Selection of parent is proportional to fitness score


assume 000110010111 correctly predicts 3 outputs over 6
So it has 50% chances to get selected for cross-over
Implementation: roulette wheel technique

1/6

A
3/6

= 50%

= 17%

B
2/6

C
= 33%

fitness(A) = 3/6
fitness(B) = 1/6
fitness(C) = 2/6

63

Proportional Cross-over


Main idea: better individuals transmit more genes

Non-equal split cross-over

source: Russel & Norvig (2003)

64

Example: N-Queens Problem




N-Queens dates back to the 19th


century

Problem: Place N queens on an N N


chessboard such that no queen can
attack any other

If N=8


--> 4,426,165,368 (64 choose 8)


possible arrangements of eight queens
on a 88 board
but only 92 solutions

65

12 Unique Solutions

+ 80 others that are symmetrical


66

Problem Representation
Observation that eliminates many arrangements from
consideration:

No queen can reside in a row or a column that


contains another queen






So, assume 1 queen per column


Use position (nb of row) of the 1st queen, 2nd queen,
3rd queen,
Bottom row = 1
Top row = 8

<16257483>
source: Russel & Norvig (2003)

67

Fitness Function



Number of non-attacking pairs of queens


Nb of pairs of different queens = 7+6+5+4+3+2+1 =
28 pairs
So fitness function = 28 nb of possible attacks

fitness = 28-1=27
source: Russel & Norvig (2003)

68

Crossing Over

28-5

= 23
28-4 = 24

23/(24+23+20+11) = 29%
24/(24+23+20+11) = 31% etc

source: Russel & Norvig (2003)

69

Genetic Algorithms


Applications:





Optimization - scheduling problems


Heuristic search
Machine learning

Difficulties:




Finding a good representation and fitness function


Depends a lot on the problem domain
May converge to a local optimum

70

Today


Last time:



Neural Networks



Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks

Genetic Algorithms

71

You might also like