You are on page 1of 59

Learning with linear neurons

Adapted from Lectures by Geoffrey Hinton and Others


Updated by N. Intrator, May 2007
Prehistory
W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in
nervous activity”, Bulletin of Mathematical Biophysics, 5, 115-137.

• This seminal paper pointed out that simple artificial “neurons” could
be made to perform basic logical operations such as AND, OR and
NOT.

Truth Table for Logical


x * +1
AND
y * +1 if sum<0 : 0
x+y-2 x y x&y
1 * -2 else : 1 0 0 0
0 1 0
1 0 0
weights
inputs

1 1 1
sum output
inputs output
Nervous Systems as Logical Circuits
Groups of these “neuronal” logic gates could carry out any computation, even
though each neuron was very limited.

• Could computers built from these simple units reproduce the


computational power of biological brains?
• Were biological neurons performing logical operations?

x * +1 Truth Table for Logical


OR
y * +1 if sum<0 : 0
x+y-1 x y x|y
1 * -1 else : 1 0 0 0
0 1 1
1 0 1
weights
inputs

1 1 1
sum output
inputs output
The Perceptron
Frank Rosenblatt (1962). Principles of Neurodynamics, Spartan,
New York, NY.

Subsequent progress was inspired by the invention of learning rules inspired by


ideas from neuroscience…
Rosenblatt’s Perceptron could automatically learn to categorise or classify input
vectors into types.

It obeyed the following rule:


Σxi wi If the sum of the weighted inputs exceeds a
threshold, output 1, else output -1.
weights

1 if Σ inputi * weighti > threshold


inputs

sum output
*

-1 if Σ inputi * weighti < threshold


Linear neurons

• The neuron has a real- • The aim of learning is to


valued output which is a minimize the discrepancy
weighted sum of its inputs between the desired output and
weight the actual output
vector
– How de we measure the
ˆy   wi xi  wT x
discrepancies?
– Do we update the weights
i
after every training case?
input
Neuron’s estimate of vector – Why don’t we solve it
the desired output analytically?
A motivating example

• Each day you get lunch at the cafeteria.


– Your diet consists of fish, chips, and beer.
– You get several portions of each
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out
the price of each portion.
• Each meal price gives a linear constraint on the prices of
the portions:

price  x fish w fish  xchips wchips  xbeer wbeer


Two ways to solve the equations
• The obvious approach is just to solve a set of
simultaneous linear equations, one per meal.
• But we want a method that could be implemented in a
neural network.
• The prices of the portions are like the weights in of a
linear neuron.
w  (w fish , wchips , wbeer )
• We will start with guesses for the weights and then
adjust the guesses to give a better fit to the prices
given by the cashier.
The cashier’s brain
Price of meal = 850

Linear
neuron

150 50 100

2 5 3
portions portions portions
of fish of chips of beer
A model of the cashier’s brain
with arbitrary initial weights
Price of meal = 500 • Residual error = 350
• The learning rule is:
wi   xi ( y  yˆ )
• With a learning rate of 
1/35, the weight changes are
50 50 50 +20, +50, +30
• This gives new weights of
70, 100, 80
2 5 3 • Notice that the weight for
portions of portions of portions of chips got worse!
fish chips beer
Behavior of the iterative learning procedure

• Do the updates to the weights always make them get


closer to their correct values? No!
• Does the online version of the learning procedure
eventually get the right answer? Yes, if the learning
rate gradually decreases in the appropriate way.
• How quickly do the weights converge to their correct
values? It can be very slow if two input dimensions
are highly correlated (e.g. ketchup and chips).
• Can the iterative procedure be generalized to much
more complicated, multi-layer, non-linear nets? YES!
Deriving the delta rule
• Define the error as the squared
residuals summed over all training
E 1
2  n n
n
( y  ˆ
y ) 2

cases:

E yˆ n En
• Now differentiate to get error wi
 1
2 n w yˆ
i n
derivatives for weights
  xi ,n ( yn  yˆ n )
n

• The batch delta rule changes the


weights in proportion to their
E
error derivatives summed over all
wi  
training cases wi
The error surface
• The error surface lies in a space with a horizontal axis
for each weight and one vertical axis for the error.
– For a linear neuron, it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.

E w1

w2
Online versus batch learning

• Batch learning does steepest • Online learning zig-zags


descent on the error surface around the direction of
steepest descent
constraint from
training case 1

w1 w1

constraint from
w2 training case 2
w2
Adding biases
• A linear neuron is a more yˆ  b   xi wi
flexible model if we include
a bias. i
• We can avoid having to
figure out a separate
learning rule for the bias by
using a trick:
– A bias is exactly
equivalent to a weight on b w1 w2
an extra input line that
always has an activity of
1. 1 x1 x2
Preprocessing the input vectors

• Instead of trying to predict the answer directly from


the raw inputs we could start by extracting a layer of
“features”.
– Sensible if we already know that certain
combinations of input values would be useful
– The features are equivalent to a layer of hand-
coded non-linear neurons.
• So far as the learning algorithm is concerned, the
hand-coded features are the input.
Is preprocessing cheating?
• It seems like cheating if the aim to show how
powerful learning is. The really hard bit is done by
the preprocessing.
• Its not cheating if we learn the non-linear
preprocessing.
– This makes learning much more difficult and much
more interesting..
• Its not cheating if we use a very big set of non-linear
features that is task-independent.
– Support Vector Machines make it possible to use a
huge number of features without much
computation or data.
Statistical and ANN Terminology

• A perceptron model with a linear transfer function is


equivalent to a possibly multiple or multivariate linear
regression model [Weisberg 1985; Myers 1986].
• A perceptron model with a logistic transfer function is a
logistic regression model [Hosmer and Lemeshow
1989].
• A perceptron model with a threshold transfer function is
a linear discriminant function [Hand 1981; McLachlan
1992; Weiss and Kulikowski 1991]. An ADALINE is a
linear two-group discriminant.
Transfer functions

• Determines the output from a summation of the weighted


inputs of a neuron.  
O j  f j   wij xi 
 i 
• Maps any real numbers into a domain normally bounded
by 0 to 1 or -1 to 1, i.e. squashing functions. Most
common functions are sigmoid functions:
1
logistic: f ( x) 
1  ex

e x  ex
hyperbolic tangent: f ( x)  x  x
e e
Healthcare Applications of ANNs
• Predicting/confirming myocardial infarction, heart
attack, from EKG output waves
– Physicians had a diagnostic sensitivity and
specificity of 73.3% and 81.1% while ANNs
performed 96.0% and 96.0%
• Identifying dementia from EEG patterns,
performed better than both Z statistics and
discriminant analysis; better than LDA for (91.1%
vs. 71.9%) in classifying with Alzheimer disease.
• Papnet: A Pap Smear screening system by
Neuromedical Systems in used by US FDA
• Predict mortality risk of preterm infants, screening
tool in urology, etc.
Classification Applications of ANNs

• Credit Card Fraud Detection: AMEX,


Mellon Bank, Eurocard Nederland
• Optical Character Recognition (OCR): Fax
Software
• Cursive Handwriting Recognition: Lexicus
• Petroleum Exploration: Arco & Texaco
• Loan Assessment: Chase Manhattan for
vetting commercial loans
• Bomb detection by SAIC
Time Series Applications of ANNs

• Trading systems: Citibank London (FX).


• Portfolio selection and Management: LBS
Capital Management (>US$1b), Deere & Co.
pension fund (US$100m).
• Forecasting weather patterns & earthquakes.
• Speech technology: verification and
generation.
• Medical: Predicting heart attacks from EKGs
and mental illness from EEGs.
Advantages of Using ANNs

• Works well with large sets of noisy data, in


domains where experts are unavailable or there
are no known rules.
• Simplicity of using it as a tool
• Universal approximator.
• Does not impose a structure on the data.
• Possible to extract rules.
• Ability to learn and adapt.
• Does not require an expert or a knowledge
engineer.
• Well suited to non-linear type of problems.
• Fault tolerant
Problem with the Perceptron

• Can only learn linearly separable tasks.


• Cannot solve any ‘interesting problems’-linearly
nonseparable problems e.g. exclusive-or function
(XOR)-simplest nonseparable function. 

X1 X2 Output

0 0 0

0 1 1

1 0 1

1 1 0
The Fall of the Perceptron
Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA.

• Before long researchers had begun to discover the


Perceptron’s limitations.
• Unless input categories were “linearly separable”, a
perceptron could not learn to discriminate between them.
• Unfortunately, it appeared that many important categories
were not linearly separable.
• E.g., those inputs to an XOR gate that give an output of 1
(namely 10 & 01) are not linearly separable from those that
do not (00 & 11).
The Fall of the Perceptron

Successful
Footballers
Academics
Few Many
Hours in Hours in …despite the simplicity of
the Gym the Gym their relationship:
per per
Academics = Successful
Week Week
XOR Gym

Unsuccessful
In this example, a perceptron would not be able to discriminate between the
footballers and the academics…
This failure caused the majority of researchers to walk away.
The simple XOR example masks a deeper problem ...
1. 2. 3. 4.

Consider a perceptron classifying shapes as connected or


disconnected and taking inputs from the dashed circles in 1.
In going from 1 to 2, change of right hand end only must be
sufficient to change classification (raise/lower linear sum thru 0)
Similarly, the change in left hand end only must be sufficient to
change classification
Therefore changing both ends must take the sum even further across
threshold
Problem is because of single layer of processing local knowledge
cannot be combined into global knowledge. So add more layers ...
THE PERCEPTRON CONTROVERSY

There is no doubt that Minsky and Papert's book was a block to the funding of
research in neural networks for more than ten years.
The book was widely interpreted as showing that neural networks are basically
limited and fatally flawed.

What IS controversial is whether Minsky and Papert shared and/or promoted this
belief ?
Following the rebirth of interest in artificial neural networks, Minsky and Papert
claimed that they had not intended such a broad interpretation of the conclusions
they reached in the book Perceptrons.

However, Jianfeng was present at MIT in 1974, and reached


a different conclusion on the basis of the internal reports circulating at MIT. What
were Minsky and Papert actually saying to
their colleagues in the period after the publication of their book?
Minsky and Papert describe a neural network with a hidden layer as
follows:
GAMBA PERCEPTRON: A number of linear threshold systems
have their outputs connected to the in- puts of a linear threshold
system. Thus we have a linear threshold function of many linear
threshold functions.
Minsky and Papert then state:
Virtually nothing is known about the computational capabilities of
this latter kind of machine. We believe that it can do little more
than can a low order perceptron. (This, in turn, would mean,
roughly, that although they could recognize (sp) some relations
between the points of a picture, they could not handle relations
between such relations to any significant extent.) That we cannot
understand mathematically the Gamba perceptron very well is, we
feel, symptomatic of the early state of development of elementary
computational theories.
The connectivity of a perceptron

The input is recoded using hand-


picked features that do not
adapt.
output units
Only the last layer of weights is non-adaptive
learned. hand-coded
features

The output units are binary input units


threshold neurons and are
learned independently.
Binary threshold neurons
• McCulloch-Pitts (1943)
– First compute a weighted sum of the inputs from
other neurons
– Then output a 1 if the weighted sum exceeds the
threshold.

z   xi wi
i
1

1 if z 
y y
0
0 otherwise
threshold z
The perceptron convergence procedure

• Add an extra component with value 1 to each input vector. The


“bias” weight on this component is minus the threshold. Now we
can forget the threshold.
• Pick training cases using any policy that ensures that every
training case will keep getting picked
– If the output unit is correct, leave its weights alone.
– If the output unit incorrectly outputs a zero, add the input
vector to the weight vector.
– If the output unit incorrectly outputs a 1, subtract the input
vector from the weight vector.
• This is guaranteed to find a suitable set of weights if any such set
exists.
Weight space

• Imagine a space in which each


axis corresponds to a weight.
– A point in this space is a
weight vector. bad
• Each training case defines a weights
plane. good
weights
– On one side of the plane
the output is wrong.
• To get all training cases right an input
we need to find a point on the vector
o
origin
right side of all the planes.
Why the learning procedure works
• So consider “generously satisfactory”
• Consider the squared distance weight vectors that lie within the feasible
between any satisfactory weight region by a margin at least as great as the
vector and the current weight largest update.
vector.
– Every time the perceptron makes a
– Every time the perceptron
makes a mistake, the learning mistake, the squared distance to all of
algorithm moves the current these weight vectors is always
weight vector towards all decreased by at least the squared length
satisfactory weight vectors of the smallest update vector.
(unless it crosses the
constraint plane).
What perceptrons cannot do
• The binary threshold output units Data Space
cannot even tell if two single bit
numbers are the same! 0,1 1,1
Same: (1,1)  1; (0,0)  1
Different: (1,0)  0; (0,1)  0
• The following set of inequalities is
impossible:

w1  w2   , 0   0,0 1,0
w1   , w2   The positive and negative cases
cannot be separated by a plane
What can perceptrons do?

• They can only solve tasks if the hand-coded features


convert the original task into a linearly separable one.
How difficult is this?
• The N-bit parity task :
– Requires N features of the form: Are at least m bits
on?
– Each feature must look at all the components of
the input.
• The 2-D connectedness task
– requires an exponential number of features!
The N-bit even parity task
• There is a simple solution
that requires N hidden units.
+1
– Each hidden unit output
computes whether more
than M of the inputs are
on. -2 +2 -2 +2
– This is a linearly
separable problem. >0 >1 >2 >3
• There are many variants of
this solution.
– It can be learned.
– It generalizes well if:
1 0 1 0
2  N
N 2 input
Why connectedness is hard to compute

• Even for simple line drawings, there are


exponentially many cases.
• Removing one segment can break
connectedness
– But this depends on the precise
arrangement of the other pieces.
– Unlike parity, there are no simple
summaries of the other pieces that tell us
what will happen.
• Connectedness is easy to compute with an
iterative algorithm.
– Start anywhere in the ink
– Propagate a marker
– See if all the ink gets marked.
Distinguishing T from C in any orientation and position

• What kind of features are required to


distinguish two different patterns of 5
pixels independent of position and
orientation?
– Do we need to replicate T and C
templates across all positions and
orientations?
– Looking at pairs of pixels will not Replicate the following two feature
work detectors in all positions
– Looking at triples will work if we
assume that each input image only
contains one object. +-
+ -+
+
If any of these equal their threshold of
2, it’s a C. If not, it’s a T.
Beyond perceptrons
• Need to learn the features, not just how to weight them to
make a decision. This is a much harder task.
– We may need to abandon guarantees of finding optimal
solutions.
• Need to make use of recurrent connections, especially for
modeling sequences.
– The network needs a memory (in the activities) for events
that happened some time ago, and we cannot easily put an
upper bound on this time.
• Engineers call this an “Infinite Impulse Response” system.
– Long-term temporal regularities are hard to learn.
• Need to learn representations without a teacher.
– This makes it much harder to define what the goal of
learning is.
Beyond perceptrons
• Need to learn complex hierarchical representations
for structures like: “John was annoyed that Mary
disliked Bill.”
– We need to apply the same computational
apparatus to the embedded sentence as to the
whole sentence.
• This is hard if we are using special purpose hardware in
which activities of hardware units are the
representations and connections between hardware units
are the program.
• We must somehow traverse deep hierarchies using
fixed hardware and sharing knowledge between
levels.
Sequential Perception
• We need to attend to one part of the sensory input at a time.
– We only have high resolution in a tiny region.
– Vision is a very sequential process (but the scale varies)
– We do not do high-level processing of most of the visual input
(lack of motion tells us nothing has changed).
– Segmentation and the sequential organization of sensory
processing are often ignored by neural models.
• Segmentation is a very difficult problem
– Segmenting a figure from its background seems very easy
because we are so good at it, but its actually very hard.
– Contours sometimes have imperceptible contrast, but we still
perceive them.
– Segmentation often requires a lot of top-down knowledge.
Fisher Linear Discrimination

• Lower the problem from multi-dimensional to single-


dimensional

– Let ‘v’ be a vector in our space


– Project the data on the vector ‘v’
– Estimate the ‘scatterness’ of the data as projected
on ‘v’
– Use this ‘v’ to create a classifier
Fisher Linear Discrimination
• Suppose we are in a 2D space
• Which of the three vectors is an optimal ‘v’?
Fisher Linear Discrimination

• The optimal vector maximizes the ratio of between-group-


sum-of-squares to within-group-sum-of-squares, denoted

v t Bv
v tWv

between
within within
Fisher Linear Discrimination
Suppose a case two classes

• Mean of these classes samples:


1
• Mean of the projected samples:
mi  x
n xXi

1 1
• ‘Scatterness’ of the projected samples: mi  
n yYi
y  
n xXi
w t x  w t mi

• Criterion function:
si 2   (y  m )
yYi
i
2

2
m1  m2
J v  
s12  s 22
Fisher Linear Discrimination

• Criterion function should be maximized


• Present J as a function of a vector ‘v’

Wi   ( x  m )( x  m )
x X i
i i
t

W  W1  W2
si2   (v x  v m )
xX i
t t
i
2
  v ( x  m )( x  m ) v  v W v
xX i
t
i i
t t
i

s12  s22  v tWv


B  (m1  m2 )(m1  m2 )t
(m1  m2 )2  (v t m1  v t m2 )2  v t (m1  m2 )(m1  m2 )t v  v t Bv
v t Bv
J v   t
v Wv
Fisher Linear Discrimination

• The matrix version of the criterion works the same


for more than two classes
• J(v) is maximized when

Bv  Wv
Fisher Linear Discrimination

Classification of a new observation ‘x’:


• Let the class of ‘x’ be the class whose mean vector is
closest to ‘x’ in terms of the discriminant variables
• In other words, the class whose mean vector’s
projection on ‘v’ is the closest to the projection of ‘x’
on ‘v’
A Regularized Fisher LDA

• Sw can be singular (inaccurate inversion)


• Regularization: Swreg = Sw + λ*I
– (Ridge-regression)
• Adding to the standard deviation (diagonal) a
compensation of the noise - λ
• Choosing λ as a percentile of the eigenvalues of Sw
• “λ-FLDA”
Linear regression
40

26

Temperature
24

20 22

20

30
40
20
30
10 20
0 10
0 10 20
0 0

Given examples
Predict given a new point
Linear regression
40

26

Temperature
24

20 22

20

30
40
20
30
10 20
0 10
0 20
0 0

Prediction Prediction
Ordinary Least Squares (OLS)

Error or “residual”
Observation

Prediction

0
0 20

Sum squared error


Minimize the sum squared error
Sum squared error

Linear equation

Linear system
Alternative derivation

d Solve the system (it’s better not to


invert the matrix)
LMS Algorithm
(Least Mean Squares)

where

Online algorithm
Beyond lines and planes

40

20
still linear in

0
0 10 20

everything is the same with


Geometric interpretation

20

10
400

0 300

-10 200

100
0
10
20
0

[Matlab demo]
Ordinary Least Squares [summary]
Given examples
Let

For example

Let n

Minimize by solving
Predict
Probabilistic interpretation

0
0 20

Likelihood

You might also like