Lect7 PDF

Artificial Intelligence:
Machine Learning - 2
Neural Networks & Genetic Algorithms
Russell & Norvig: Sections 18.1 to 18.3 and Section 4.1.4
Today
Last time:

Neural Networks

Decision Trees
(Evaluation & Unsupervised Learning)
Perceptrons
Multilayer Neural Networks
Genetic Algorithms
Neural Networks
Radically different approach to reasoning and

learning
Inspired by biology
Set of many simple processing units (neurons)

connected together
Behavior of each neuron is very simple
the neurons in the human brain
but a collection of neurons can have sophisticated

behavior and can be used for complex tasks
In a neural network, the behavior depends on

weights on the connection between the neurons
The weights are be learned given training data
Biological Neurons
Human brain = 200 billion

neurons, 32 trillion synapses
Each neuron is connected to
thousands other neurons
A neuron is made of:

The soma: body of the

neuron
Dendrites: filaments that
provide input to the neuron
The axon: sends an output
signal
Synapses: connection with
other neurons releases
certain quantities of
chemicals called
neurotransmitters to other
neurons
Behavior of a Neuron

A neuron receives inputs from its neighbors

If enough inputs are received at the same time:
the neuron is activated
and fires an output to its neighbors
Repeated firings across a synapse increases its
sensitivity and the future likelihood of its firing
If a particular stimulus repeatedly causes activity in a
group of neurons, they become strongly associated
Today
Last time:

Neural Networks

Decision Trees
Perceptrons
Genetic Algorithms
A Simple Perceptron
A simple computational neuron

Input:
input signals xi
weights wi for each feature xi
represents the strength of the connection with the neighboring

neurons
Output:

if sum of input weights >= some threshold, neuron fires (output=1)

otherwise output = 0
If (w1 x1 + + wn xn) >= t
Then output = 1
Else output = 0
Learning :
use the training data to adjust the weights in a network
source: Luger (2005)
The Idea
Features (xi)
Output
Student
First
last
year?
Male?
Works
hard?
Drinks
?
First
this
year?
Richard
Yes
Yes
No
Yes
No
Alan
Yes
Yes
Yes
No
Yes
1.
2.
3.
4.
Step
Step
Step
Step
1.
2.
1: Set weights to random values

2: Feed perceptron with a set of inputs
3: Compute the network outputs
4: Adjust the weights
if output correct weights stay the same
if output = 0 but it should be 1
1.
3.
if output = 1 but should be 0

1.
5.
increase weights on active connections (i.e. input xi =1)

decrease weights on active connections (i.e. input xi =1)
Step 5: Repeat steps 2 to 4 a large number of times until the network

converges to the right results for the given training examples
source: Cawsey (1998)
A Simple Example

Each feature (works hard, male, ) is an xi

if x1 = 1, then student got an A last year,
if x1 = 0, then student did not get an A last year,

Initially, set all weights to random values (all 0.2 here)
Assume:
threshold = 0.55
constant learning rate = 0.05
source: Cawsey (1998)
A Simple Example (2)

Features (xi)
Output
Student
A last year? Male?
Works hard? Drinks?
A this year?
Richard
Yes
Yes
No
Yes
No
Alan
Yes
Yes
Yes
No
Yes
Alison
No
No
Yes
No
No
Jeff
No
Yes
No
Yes
No
Gail
Yes
No
Yes
Yes
Yes
Simon
No
Yes
Yes
Yes
No
Richard:
(1 0.2) + (1 0.2) + (0 0.2) + (1 0.2) = 0.6>=0.55-->output is 1
but he did not get an A this year
So reduce all weights of active connections (with input 1) by 0.05
So we get w1= 0.15, w2= 0.15, w3= 0.2, w4= 0.15
10

Features (xi)

Output
Student
A last year? Male?
Works hard? Drinks?
A this year?
Richard
Yes
Yes
No
Yes
No
Alan
Yes
Yes
Yes
No
Yes
Alison
No
No
Yes
No
No
Jeff
No
Yes
No
Yes
No
Gail
Yes
No
Yes
Yes
Yes
Simon
No
Yes
Yes
Yes
No
Alan:
(1 0.15) + (1 0.15) + (1 0.2) + (0 0.15) = 0.5 < 0.55 output is 0
but he got an A this year
So increase all weights of active connections by 0.05
So we get w1= 0.2, w2= 0.2, w3= 0.25, w4= 0.15
AlisonJeff Gail ... SimonRichardAlan
After 2 iterations over the training set, we get:

w1= 0.2 w2= 0.1 w3= 0.25 w4= 0.1
11

Features (xi)
Output
Student
A last year? Male?
Works hard? Drinks?
A this year?
Richard
Yes
Yes
No
Yes
No
Alan
Yes
Yes
Yes
No
Yes
Alison
No
No
Yes
No
No
Jeff
No
Yes
No
Yes
No
Gail
Yes
No
Yes
Yes
Yes
Simon
No
Yes
Yes
Yes
No
Lets check (w1= 0.2 w2= 0.1 w3= 0.25 w4= 0.1)

Richard: (10.2) + (10.1) + (00.25) + (10.1) = 0.4 <= 0.55 -> output is 0
Alan:
(10.2) + (10.1) + (10.25) + (10.1) = 0.55 <= 0.55 -> output is 1
Alison: (00.2) + (00.1) + (10.25) + (00.1) = 0.35 <= 0.55 -> output is 0
Jeff: (00.2) + (10.1) + (00.25) + (10.1) = 0.2 <= 0.55 -> output is 0
Gail:
(10.2) + (00.1) + (10.25) + (10.1) = 0.55 <= 0.55 -> output is 1
Simon: (00.2) + (10.1) + (10.25) + (10.1) = 0.45 <= 0.55 -> output is 0
12
Decision Boundaries of Perceptrons

So we have just learned the function:

If (0.2x1 + 0.1x2 + 0.25x3 + 0.1x4 >= 0.55) then 1 otherwise 0

If (0.2x1 + 0.1x2 + 0.25x3 + 0.1x4 - 0.55 >= 0) then 1 otherwise 0
Assume we only had 2 features:

If (w1x1 + w2x2 -t >= 0) then 1 otherwise 0

The learned function describes a line in the input space
This line is used to separate the two classes C1 and C2
t (the threshold, later called b) is used to shift the line on the horizontal axis
x2
w1x1 + w2x2 -t = 0
decision
boundary
decision
region for C2
w1x1 + w2x2 t >= 0
C1
C2
decision
region for C1
x1
w1x1 + w2x2 -t < 0

13
Decision Boundaries of Perceptrons

More generally, with n features, the learned function
describes a hyperplane in the input space.
x2
decision
region for C1
decision
boundary
decision
region for C2
x1
x3
14
Adding a Bias
b + xiwi
We can avoid having to

figure out the threshold
by using a bias
A bias is equivalent to a
weight on an extra input
feature that always has
a value of 1.
b w1
x1
w2
x2
15
Perceptron - More Generally

inputs
x1
x2
.
.
.
w1
w2
xn+1 = 1
wn+1
bias input set to 1 (to replace the

threshold)
output
n +1
w x
i
i=1
f wi xi
i=1
n +1
wn
xn
activation function
n +1
y = f wi xi
i=1
final classification
16
Common Activation Functions

Y
n +1
w x
i
n +1
i=1
w x
i
step
sign
y=
w x
i=1
Hard Limit activation functions:
y=
n +1
n +1
1 if wi xi t
i=1
0 otherwise
i=1
Sigmoid function
1
y=
1+e
n +1
wi xi
i=1
n +1
+ 1 if wi xi 0
i=1
- 1 otherwise
[Russell & Norvig, 1995]
17
Delta Rule
1.
Learning rate can be a constant value (as in the previous example)
w = (T - O)
learning rate
So:

2.
Error = target output actual output
if T=zero and O=1 (i.e. a false positive) -> decrease w by

if T=1 and O=zero (i.e. a false negative) -> increase w by
if T=O (i.e. no error) -> dont change w
Or, a fraction of the input feature xi
wi = (T - O) xi
So the update is proportional to the value of x

value of input feature xi
if T=zero and O=1 (i.e. a false positive) -> decrease wi by xi

if T=1 and O=zero (i.e. a false negative) -> increase wi by xi
if T=O (i.e. no error) -> dont change wi
This is called the delta rule or perceptron learning rule
18
Perceptron Convergence Theorem

Cycle through the set of training examples.

Suppose a solution with zero error exists.
The delta rule will find a solution in finite time.
19
Example of the Delta Rule

training data:
plot of the training data:
perceptron
20
Let's Train the Perceptron

assume random initialization

w1 = 0.75
w2 = 0.5
w3 = -0.6
Assume:
sign function (threshold = 0)
learning rate = 0.2
21
Training

data #1: f(0.75x1 + 0.5x1 - 0.6x1) = f(0.65) -> 1

data #2: f(0.75x9.4 + 0.5x6.4 - 0.6x1) = f(9.65) -> 1
data #3: f(-3.01x2.5 -2.06x2.1 -1x1) = f(-12.84) -> -1

--> error = (-1 - 1) = -2

--> w1 = w1 -2x0.2x9.4
= 0.75 - 3.76 = -3.01
--> w2 = w2 -2x0.2x6.4 = -2.06
--> w3 = w3 -2x0.2x1 = -1.00
--> error = (1 - -1) = 2

--> w1 = -3.01 + 2x0.2x2.5 = -2.01
--> w2 = -2.06 + 2x0.2x2.1 = -1.22
--> w3 = -1.00 + 2x0.2x1 = -0.60
repeat over 500 iterations, we converge to:

w1 = -1.3 w2 = -1.1 w3 = 10.9
22
Limits of the Perceptron

In 1969, Minsky and Papert showed formally what functions

could and could not be represented by perceptrons
Only linearly separable functions can be represented by a
perceptron
23
AND and OR Perceptrons
24
Example: the XOR Function

We cannot build a perceptron to learn the exclusive-or function
To learn the XOR function, with:

i.e. must have:

two inputs x1 and x2

two weights w1 and w2
A threshold t
x1
x2
Output
(1 w1) + (1 w2) < t (for the first line of truth table)

(1 w1) + 0 >= t
0 + (1 w2) >= t
0+0<t
Which has no solution so a perceptron cannot learn the XOR

function
25
The XOR Function - Visually

In a 2-dimentional space (2 features for the X)

No straight line in two-dimensions can separate

(0, 1) and (1, 0) from

(0, 0) and (1, 1).
26
A Perceptron Network
So far, we looked at
a single perceptron
C1
~C1
C2
~C2

C1
~C1
But if the output

needs to learn more
than a binary
(yes/no) decision
C6
~C6
Ex: learning to
recognize digit --> 10
possible outputs -->
need a perceptron
network
27
Real World Problems

Real-world problems cannot always be represented by linearlyseparable functions

Ex: In speech recognition recognize the vowel sound between the
letters hd (ex: hAd, hID, hEAD,)
This caused a decrease in interest in neural networks in the 1970s
source: Tom Mitchell, Machine Learning (1997)
28
Real World Problems

Solution: Have a hidden layer multi-layered neural networks
hard limit function
sigmoid function
(which is differentiable)
eventually with several output nodes (for non-binary decisions)
29
Today
Last time:

Neural Networks

Decision Trees
Perceptrons
Genetic Algorithms
30

A multilayer neural network can learn even if the problem is not

linearly separable
Multilayer network has:

Output layer
(one or more) Hidden layer
Input layer
source: Rich & Knight, Artificial Intelligence, p. 502
31
Decision Boundaries
of a single perceptron
Straight lines (hyperplanes), linear separable
32
Decision Boundaries
of a multilayer perceptron with 1 hidden layer
Convex areas (open or closed)
33
Decision Boundaries
of a multilayer perceptron with 2 hidden layers
Combinations of convex areas
34

Several algorithms to train a Multilayer Neural Network

Learning is the same as in a perceptron:

feed network with training data

if there is an error (a difference between the output and
the target), adjust the weights
So we must assess the blame for an error to the

contributing weights
35
Feed-forward + Backpropagation
Feed-forward:
Backpropagation:

input from the features is

fed forward in the network
from input layer towards the
output layer
Method to asses the blame of errors to weights
error rate flows backwards from the output layer to the input
layer (to adjust the weight in order to minimize the output
error)
Iterate until error rate is minimized

repeat the forward pass and back pass for the next data point
until all data points are examined (1 epoch)
repeat this entire exercise until the overall error is minimised
2
Ex: squared errors = 1
(TargetOutput ActualOutput ) < 0.001
2 iOutputlayer
36
Backpropagation
In a multilayer network

Computing the error in the output layer is clear.

Computing the error in the hidden layer is not clear,
because we dont know what output it should be
Intuitively:
A hidden node h is responsible for some fraction of the

error in each of the output node to which it connects.
So the error values () are divided according to the
weight of their connection between the hidden node and
the output node and are propagated back to provide the
error values () for the hidden layer.
37
Backpropagation Visually
Goal: minimize
the error
(w1,w2)
(w1+w1,w2 +w2)
The delta rule is a gradient descent technique for

updating the weights in a single-layer perceptron.
Backpropagation is a generalised case of the delta rule.
38
The Sigmoid Function

Backpropagation requires a differentiable activation function

sigmoidal (or squashed or logistic) function
f returns a value between 0 and 1 (instead of 0 or 1)
f indicates how close/how far the output of the network is
compared to the right answer (the error term)
39
Training the Network

Step 0: Initialise the weights of the network randomly
// feedforward

Step 1: Do a forward pass through the network (use sigmoid)
Oi = sigmoid wji xj =
j
1
1+e
w ji xj
j
// propagate the errors backwards

Step 2: For each output unit k, calculate its error term k
note, with sigmoid :
k g'(xi ) Erri = Ok (1 - Ok ) (Tk - Ok )
g'(x) = g(x) (1 - g(x))
Derivative of sigmoid
Step 3: For each hidden unit h, calculate its error term h
h g'(xi ) Erri = Ok (1 - Ok )
Step 4: Update each network weight wij:
wij wij + wij where wij = j xi

kh k
koutputs
Sum of the weighted error

term of the output nodes
that h is connected to (ie.
h contributed to the
errors k)
Repeat Steps 2, 3 & 4 until the error is minimised to a given level
40
Example: XOR
O5
2 input nodes + 2 hidden nodes + 1 output node + 3 bias nodes
source: Negnevitsky, Artificial Intelligence, p. 181

41
Example: Step 0 (initialization)

Step 0: Initialize the network at random

3 = 0.8
w13 = 0.5
w35 = -1.2
5 = 0.3
w23 = 0.4
O5
w14 = 0.9
w45 = 1.1
w24 = 1.0
4 = -0.1
42
Step 1: Feed Forward

Step 1: Feed the inputs and calculate the output
Oi = sigmoid wji xj =
j
1
1+e
w ji xj
j
With (x1=1, x2=1) as input:

Output of the hidden nodes 3:

x2
Target output T
O3= sigmoid(x1 w13 + x2 w23 3) = 1 / (1 + e-(1x.5 + 1x.4 -1x.8))= 0.5250
Output of the hidden nodes 4:

x1
O4= sigmoid(x1 w14 + x2 w24 4) = 1 / (1 + e-(1x.9 + 1x1.0 +1x0.1))= 0.8808
Output of neuron 5:
O5 = sigmoid(O3 w35 + O4 w45 5) = 1 / (1 + e-(0.5250x1.2 + 0.8808x1.1 -1x0.3))= 0.5097
43
Step 2: Calculate error term of

output layer
k g'(xi ) Erri = Ok (1 - Ok ) (Tk - Ok )
Error term of neuron 5 in the output layer:

5 = O5 (1-O5) (T5 O5)
= (0.5097) x (1-0.5097) x (0-0.5097)
= -0.1274
O5
5 = error will
be used to modify
w35 and w45
44
Step 3: Calculate error term of

hidden layer
h g'(xi ) Erri = Ok (1 - Ok )
kh k
koutputs
Error term of neurons 3 & 4 in the hidden layer:

3 = O3 (1-O3) 5 w35
= (0.5250) x (1-0.5250) x (-0.1274) x (-1.2)
= 0.0381
4 = O4 (1-O4) 5 w45
= (0.8808) x (1-0.8808) x (-0.1274) x (1.1)
= -0.0147
3 to modify
w13 and w23
O5
4 to modify
w14 and w24
45
Step 4: Update Weights

Update all weights (assume a learning rate = 0.1)

w13 = 3 x1 = 0.1 x 0.0381 x 1 = 0.0038

w14 = 4 x1 = 0.1 x -0.0147 x 1 = -0.0015
w23 = 3 x2 = 0.1 x 0.0381 x 1 = 0.0038
w24 = 4 x2 = 0.1 x -0.0147 x 1 = -0.0015
w35 = 5 O3 = 0.1 x -0.1274 x 0.5250 = -0.00669 // O3 is seen as x5 (output of 3 is input to 5)
w45 = 5 O4 = 0.1 x -0.1274 x 0.8808 = -0.01122 // O4 is seen as x5 (output of 4 is input to 5)
3 = 3 (-1) = 0.1x0.0381 x -1 = -0.0038
4 = 4 (-1) = 0.1x-0.0147 x -1 = -0.0015
5 = 5 (-1) = 0.1x-0.1274 x -1 = -0.0127
=0.0381
3
x1=1
x2=1
O5
5=-0.1274
4=-0.0147
46
Step 4: Update Weights (con't)

Update all weights (assume a learning rate = 0.1)

w13 = w13 + w13 = 0.5 + 0.0038 = 0.5038

w14 = w14 + w14 = 0.9 0.0015 = 0.8985
w23 = w23 + w23 = 0.4 + 0.0038 = 0.4038
w24 = w24 + w24 = 1.0 - 0.0015 = 0.9985
w35 = w35 + w35 = -1.2 0.00669 = -1.20669
w45 = w45 + w45 = 1.1 0.01122 = 1.08878
3 = 3 + 3 = 0.8 - 0.0038 = 0.7962
4 = 4 + 4 = -0.1 + 0.0015 = -0.0985
5 = 5 + 5 = 0.3 + 0.0127 = 0.3127
O5
47
Step 4: Iterate through data

after adjusting all the weights, repeat the

forward pass and back pass for the next data
point until all data points are examined
repeat this entire exercise until the overall error
is minimised
2
Ex: squared errors = 1
(Ti Oi ) < 0.001
2 iOutputlayer
48
The Result
After 224 epochs, we get:

(1 epoch = going through all data once)

3 = 7.31
W13 = 4.76
W35 = -10.38
5 = 4.56
W23 = 4.76
O5
W14 = 6.39
W45 = 9.77
W24 = 6.39
4 = 2.84
49
Error is minimized
Inputs
Target Output
x1
x2
T5
Actual Output
O5
1
0
1
0
1
1
0
0
0
1
1
0
0.0155
0.9849
0.9849
0.0175
Error
e
-0.0155
0.0151
0.0151
-0.0175
Squared errors = x (-0.01552 +0.01512 + 0.01512 + 0.01752)

< 0.001 (some threshold)
stop!
50
Applications of Neural Networks

Handwritten digit recognition

Training set = set of handwritten digits (09)

Task: given a bitmap, determine what digit it represents
Input: 1 feature for each pixel of the bitmap
Output: 1 output unit for each possible character (only 1 should
be activated)
After training, network should work for fonts (handwriting)
never encountered
Related pattern recognition

applications:

recognize postal codes

recognize signatures
51
Applications of Neural Networks

Speech synthesis

Learning to pronounce English words

Difficult task for a rule-based system because English
pronunciation is highly irregular
Examples:

letter c can be pronounced [k] (cat) or [s] (cents)

Woman vs Women
NETtalk:

uses the context and the letters around a letter to learn how to
pronounce a letter
Input: letter and its surrounding letters
Output: phoneme
52
NETtalk Architecture
Ex: a cat c is pronounced K

Network is made of 3 layers of units

input unit corresponds to a 7 character window in the text
each position in the window is represented by 29 input units
(26 letters + 3 for punctuation and spaces)
26 output units one for each possible phoneme
53
Neural Networks
Disadvantage:
result is not easy to understand by humans (set of

weights compared to decision tree) it is a black box
Advantage:
robust to noise in the input (small changes in input do not

normally cause a change in output) and graceful
degradation
54
Today
Last time:

Neural Networks

Decision Trees
Perceptrons
Genetic Algorithms
55
Genetic Algorithms

Learning technique based on evolution

Also called evolutionary algorithm
Based on a biological metaphor (like neural networks)
Learning = competition among a population of evolving
candidate solutions to a problem.
A fitness function evaluates each solution to decide if it
will contribute to the next generation of solutions
Through genetic operators the algorithm creates a new
population of candidate solutions from the previous
generation
56
Genetic Algorithms

Let P(t) be the set of candidate solutions at time t

Set time t 0
Initialize the population P(t) // typically, chosen at random
While P(t) does not include an acceptable solution

Evaluate the fitness of each member of the population P(t)

Select pairs of solutions from P(t) based on fitness
Produce the offspring of these pairs using genetic operators
Replace the weakest candidates (based on fitness) of P(t)
with these offspring
Mutation (optional)
Set time t t+1
57
Solution Representation
Solutions are typically represented as a fixedlength string using a finite alphabet

In real DNA, the alphabet is AGTC (adenine, guanine,

thymine, cytosine)
Ex: AATAGC
In machine learning, the alphabet is typically {0,1,#}

(where # means that the feature is not relevant)
Ex: the features: A last year? Male? Works hard? Drinks?
#100

The string is called a chromosome

Each element of the string is called a gene
58
The Fitness Function

To determine if a candidate solution is good or

not
Exact function depends a lot on the problem
But very often, the function computes the
proportion of the examples that are correctly
classified by the candidate solution
How many training examples are correctly classified

with: #110, 100#,
59
Genetic Operators
Crossover
Takes 2 candidates and

swaps components to
produce 2 new
candidates
1#0#
+
0#10
1#10
0#0#
Mutation
Takes a single candidate

and randomly changes a
gene
Important because the
initial population may not
include an essential gene
0#10
0#11
60
Example (1)
Features (xi)
A last year?
Male?
Works hard?
Drinks?
A this year?
Richard
Yes / 1
Yes / 1
No / 0
Yes / 1
No / 0
Alan
Yes / 1
Yes / 1
Yes / 1
No / 0
Yes / 1
Alison
No / 0
No / 0
Yes / 1
No / 0
No / 0
Jeff
No / 0
Yes / 1
No / 0
Yes / 1
No / 0
Gail
Yes / 1
No / 0
Yes / 1
Yes / 1
Yes / 1
Simon
No / 0
Yes / 1
Yes / 1
Yes / 1
No / 0
chosen at random
Fitness function:

Student
Let P(0) = {#1##, 0#10, 11##, #11#}

Output f(X)
#1## (only males will get A) predicts correctly 2/6 -- Alan, Alison
0#10 predicts correctly 3/6 -- Richard, Jeff, Simon
11## predicts correctly 4/6 -- Alan, Alison, Jeff, Simon
#11# predicts correctly 4/6 -- Richard, Alan, Alison, Jeff
Best candidates are 11##, #11#
61
Example (2)

Lets cross-over 11## + #11# 111# and #1##

Lets replace the weakest solutions by the 2 new offsprings

Fitness function:
P(0) = {#1##, 0#10, 11##, #11#}

P(1) = {111#, #1##, 11##, #11#}
111# 5/6, #1## 2/6, 11## 4/6, #11# 4/6
Lets go on but in fact, with a split-in-two cross-over, we

could never get better than 5/6
With random mutation

If best solution so far happens to mutate to 1#1#

Fitness function of 1#1# 6/6 !!!
62
Proportional Selection
Main idea: better individuals get higher chance

Selection of parent is proportional to fitness score

assume 000110010111 correctly predicts 3 outputs over 6
So it has 50% chances to get selected for cross-over
Implementation: roulette wheel technique
1/6
A
3/6
= 50%
= 17%
B
2/6
C
= 33%
fitness(A) = 3/6
fitness(B) = 1/6
fitness(C) = 2/6
63
Proportional Cross-over
Main idea: better individuals transmit more genes
Non-equal split cross-over
source: Russel & Norvig (2003)
64
Example: N-Queens Problem

N-Queens dates back to the 19th

century
Problem: Place N queens on an N N

chessboard such that no queen can
attack any other
If N=8
--> 4,426,165,368 (64 choose 8)

possible arrangements of eight queens
on a 88 board
but only 92 solutions
65
12 Unique Solutions
+ 80 others that are symmetrical

66
Problem Representation
Observation that eliminates many arrangements from
consideration:
No queen can reside in a row or a column that

contains another queen

So, assume 1 queen per column

Use position (nb of row) of the 1st queen, 2nd queen,
3rd queen,
Bottom row = 1
Top row = 8
<16257483>
67
Fitness Function

Number of non-attacking pairs of queens

Nb of pairs of different queens = 7+6+5+4+3+2+1 =
28 pairs
So fitness function = 28 nb of possible attacks
fitness = 28-1=27
68
Crossing Over
28-5
= 23
28-4 = 24
23/(24+23+20+11) = 29%
24/(24+23+20+11) = 31% etc
69
Genetic Algorithms
Applications:

Optimization - scheduling problems

Heuristic search
Machine learning
Difficulties:

Finding a good representation and fitness function

Depends a lot on the problem domain
May converge to a local optimum
70
Today
Last time:

Neural Networks

Decision Trees
Perceptrons
Genetic Algorithms
71

Lect7 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect7 PDF

Uploaded by

Copyright:

Available Formats

Artificial Intelligence:

Russell & Norvig: Sections 18.1 to 18.3 and Section 4.1.4

Radically different approach to reasoning and

Set of many simple processing units (neurons)

the neurons in the human brain

but a collection of neurons can have sophisticated

In a neural network, the behavior depends on

Human brain = 200 billion

The soma: body of the

A neuron receives inputs from its neighbors

A simple computational neuron

represents the strength of the connection with the neighboring

if sum of input weights >= some threshold, neuron fires (output=1)

source: Luger (2005)

1: Set weights to random values

if output = 1 but should be 0

increase weights on active connections (i.e. input xi =1)

Step 5: Repeat steps 2 to 4 a large number of times until the network

Each feature (works hard, male, ) is an xi

source: Cawsey (1998)

A Simple Example (2)

A last year? Male?

Works hard? Drinks?

A Simple Example (3)

A last year? Male?

Works hard? Drinks?

AlisonJeff Gail ... SimonRichardAlan

After 2 iterations over the training set, we get:

w1= 0.2 w2= 0.1 w3= 0.25 w4= 0.1

A Simple Example (3)

A last year? Male?

Works hard? Drinks?

Decision Boundaries of Perceptrons

So we have just learned the function:

If (0.2x1 + 0.1x2 + 0.25x3 + 0.1x4 >= 0.55) then 1 otherwise 0

Assume we only had 2 features:

If (w1x1 + w2x2 -t >= 0) then 1 otherwise 0

w1x1 + w2x2 t >= 0

w1x1 + w2x2 -t < 0

Decision Boundaries of Perceptrons

We can avoid having to

Perceptron - More Generally

bias input set to 1 (to replace the

Common Activation Functions

Hard Limit activation functions:

[Russell & Norvig, 1995]

Learning rate can be a constant value (as in the previous example)

Error = target output actual output

if T=zero and O=1 (i.e. a false positive) -> decrease w by

Or, a fraction of the input feature xi

So the update is proportional to the value of x

value of input feature xi

if T=zero and O=1 (i.e. a false positive) -> decrease wi by xi

This is called the delta rule or perceptron learning rule

Perceptron Convergence Theorem

Cycle through the set of training examples.

Example of the Delta Rule

source: Luger (2005)

plot of the training data:

Let's Train the Perceptron

assume random initialization

sign function (threshold = 0)

data #3: f(-3.01x2.5 -2.06x2.1 -1x1) = f(-12.84) -> -1