You are on page 1of 59

Chapter 6:Artificial

Intelligence
Learning

By. Getaneh T.

1
Introduction
knowledge infusion is not always the best way of
providing an agent with knowledge
impractical, tedious
incomplete, imprecise, possibly incorrect

adaptivity
an agent can expand and modify its knowledge base to
reflect changes

improved performance
through learning the agent can make better decisions

autonomy
without learning, an agent can hardly be considered
2
autonomous
Motivation
learning is important for agents to deal with
unknown environments
changes

the capability to learn is essential for the


autonomy of an agent
in many cases, it is more efficient to train an
agent via examples, than to manually extract
knowledge from the examples, and instill it
into the agent
agents capable of learning can improve their
performance
3
Objectives
be aware of the necessity of learning for
autonomous agents
understand the basic principles and limitations
of inductive learning from examples
apply decision tree learning to deterministic
problems characterized by Boolean functions
understand the basic learning methods of
perceptrons and multi-layer neural networks
know the main advantages and problems of
learning in neural networks
4
Learning
an agent tries to improve its behavior through
observation, reasoning, or reflection
learning from experience
memorization of past percepts, states, and actions
generalizations, identification of similar experiences

forecasting
prediction of changes in the environment

theories
generation of complex models based on observations
and reasoning

5
Learning from Observation

Learning Agents
Inductive Learning
Learning Decision Trees

6
Learning Agents
based on previous agent designs, such as
reflexive, model-based, goal-based agents
those aspects of agents are encapsulated into the
performance element of a learning agent

a learning agent has an additional learning


element
usually used in combination with a critic and a
problem generator for better learning

most agents learn from examples


inductive learning

7
Learning Agent Model

Performance
Standard

Sensors
Critic

Feedback
Changes
Learning Element Knowledge Performance Element
Learning
Goals

Problem Generator

Agent Effectors

Environment
8
Forms of Learning
supervised learning
an agent tries to find a function that matches examples
from a sample set
each example provides an input together with the correct
output
a teacher provides feedback on the outcome
the teacher can be an outside entity, or part of the
environment
unsupervised learning
the agent tries to learn from patterns without
corresponding output values
reinforcement learning
the agent does not know the exact output for an input,
but it receives feedback on the desirability of its behavior
the feedback can come from an outside entity, the
environment, or the agent itself
17
the feedback may be delayed, and not follow the respective
action immediately
Two types of learning in AI

Deductive: Deduce rules/facts from already known


rules/facts.

Inductive: Learn new rules/facts from a data set.

We will be dealing with the latter, inductive learning, now


18
Inductive Learning

tries to find a function h (the hypothesis) that


approximates a set of samples defining a
function f
the samples are usually provided as
input-output pairs (x, f(x))
supervised learning method
relies on inductive inference, or induction
conclusions are drawn from specific instances to more
general statements
19
Hypotheses
finding a suitable hypothesis can be difficult
since the function f is unknown, it is hard to tell if the
hypothesis h is a good approximation
the hypothesis space describes the set of
hypotheses under consideration
e.g. polynomials, sinusoidal functions, propositional logic,
predicate logic, ...
the choice of the hypothesis space can strongly influence
the task of finding a suitable function
while a very general hypothesis space (e.g. Turing
machines) may be guaranteed to contain a suitable
function, it can be difficult to find it
Ockhams razor: if multiple hypotheses are
consistent with the data, choose the simplest one
20
Learning and Decision Trees
based on a set of attributes as input, predicted
output value, the decision is learned
it is called classification learning for discrete values
regression for continuous values

Boolean or binary classification


output values are true or false
conceptually the simplest case, but still quite powerful

making decisions
a sequence of test is performed, testing the value of one
of the attributes in each step
when a leaf node is reached, its value is returned
21

good correspondence to human decision-making


Terminology
example or sample
describes the values of the attributes and the goal
a positive sample has the value true for the goal predicate, a
negative sample false

sample set
collection of samples used for training and validation

training
the training set consists of samples used for constructing
the decision tree

validation
the test set is used to determine if the decision tree
performs correctly
22
ideally, the test set is different from the training set
Expressiveness of Decision
Trees
decision trees can also be expressed in logic
as implication sentences
in principle, they can express propositional
logic sentences
each row in the truth table of a sentence can be
represented as a path in the tree
often there are more efficient trees

some functions require exponentially large


decision trees
parity function, majority function 23
Constructing Decision Trees

in general, constructing the smallest possible


decision tree is an intractable problem
algorithms exist for constructing reasonably small
trees
basic idea: test the most important attribute first
attribute that makes the most difference for the
classification of an example
can be determined through information theory
hopefully will yield the correct classification with few tests

24
Decision Tree Algorithm

recursive formulation
select the best attribute to split positive and negative examples
if only positive or only negative examples are left, we are done
if no examples are left, no such examples were observed
return a default value calculated from the majority classification at
the nodes parent
if we have positive and negative examples left, but no
attributes to split them, we are in trouble
samples have the same description, but different classifications
may be caused by incorrect data (noise), or by a lack of information,
or by a truly non-deterministic domain

25
Performance of Decision Tree
Learning

quality of predictions
predictions for the classification of unknown examples that
agree with the correct result are obviously better
can be measured easily after the fact
it can be assessed in advance by splitting the available
examples into a training set and a test set
learn the training set, and assess the performance via the test
set
size of the tree
a smaller tree (especially depth-wise) is a more concise
representation
26
Ensemble Learning
multiple hypotheses (an ensemble) are generated,
and their predictions combined
by using multiple hypotheses, the likelihood for
misclassification is hopefully lower
also enlarges the hypothesis space

boosting is a frequently used ensemble method


each example in the training set has a weight associated
the weights of incorrectly classified examples are increased
and a new hypothesis is generated from this new weighted
training set
the final hypothesis is a weighted-majority combination of a
the generated hypotheses 27
Computational Learning
Theory

relies on methods and techniques from theoretical


computer science, statistics, and AI
used for the formal analysis of learning algorithms
basic principles
a hypothesis is seriously wrong
it will most likely generate a false prediction even for small
numbers of examples
hypothesis is consistent with a large number of examples
most likely it is quite good, or probably approximately correct

28
Probably Approximately
Correct (PAC) Learning

approximately correct hypothesis


its error lies within a small constant of the true result
by testing a sufficient number of examples, one can see if a
hypothesis has a high probability of being approximately
correct
stationary assumption
training and test sets follow the same probability distribution
there is a connection between the past (known) and the
future (unknown)
a selection of non-representative examples will not result in
good learning
29
Knowledge in
Learning

30
A logical formulation of learning

Whatre Goal and Hypotheses


Goal predicate Q - WillWait
Learning is to find an equivalent logical
expression we can classify examples
Each hypothesis proposes such an
expression - a candidate definition of Q

31
.

Hypothesis space is the set of all


hypotheses the learning algorithm is
designed to entertain.
One of the hypotheses is correct:
H1 V H2 VV Hn
Each Hi predicts a certain set of examples
- the extension of the goal predicate.
Two hypotheses with different extensions
are logically inconsistent with each other,
otherwise, they are logically equivalent.

32
What are Examples
An example is an object of some logical
description to which the goal concept
may or may not apply.
Alt(X1)^!Bar(X1)^!Fri/Sat(X1)^
Ideally, we want to find a hypothesis
that agrees with all the examples.
The relation between f and h are: ++, --,
+- (false negative), -+ (false positive). If
the last two occur, example I and h are
logically inconsistent.
33
Inductive learning in the logical
setting
The objective is to find a hypothesis that explains
the classifications of the examples, given their
descriptions.
Hypothesis ^ Description |= Classifications
Hypothesis is unknown, explains the
observations
Descriptions - the conjunction of all the example
descriptions
Classifications - the conjunction of all the
example classifications
Knowledge free learning
Decision trees
Description = Classifications 34
A cumulative learning
process

Observations, K-based learning, Hypotheses,


and prior knowledge.
The new approach is to design agents that
already know something and are trying to
learn some more.
Intuitively, this should be faster and better
than without using knowledge, assuming
whats known is always correct.
How to implement this cumulative learning
with increasing knowledge?
35
Some examples of using
knowledge

One can leap to general conclusions


after only one observation.
Your such experience?
Traveling to Brazil: Language and name
A pharmacologically ignorant but
diagnostically sophisticated medical
student
36
Inductive logical
programming (ILP)

ILP can formulate hypotheses in general


first-order logic
Others like DT are more restricted languages
Prior knowledge is used to reduce the
complexity of learning:
prior knowledge further reduces the H space
prior knowledge helps find the shorter H
Again, assuming prior knowledge is correct
37
Explanation-based learning

A method to extract general rules from individual


observations
The goal is to solve a similar problem faster next
time.
Memoization - speed up by saving results and
avoiding solving a problem from scratch
EBL does it one step further - from observations
to rules
38
Learning using relevant
information
Prior knowledge: People in a country
usually speak the same language
Nat(x,n) ^Nat(y,n)^Lang(x,l)=>Lang(y,l)
Observation: Given nationality, language
is fully determined
Given Fernando is Brazilian & speaks
Portuguese
Nat(Fernando,B) ^ Lang(Fernando,P)
We can logically conclude
Nat(y,B) => Lang(y,P) 39
Inductive logic programming

It combines inductive methods with FOL.


ILP represents theories as logic programs.
ILP offers complete algorithms for inducing
general, first-order theories from examples.
It can learn successfully in domains where
attribute-based algorithms fail completely.
An example - a typical family tree (Fig 19.11)
40
Statistical Learning
Methods

41
Statistical Learning

Data instantiations of some or all of the random


variables describing the domain; they are evidence
Hypotheses probabilistic theories of how the
domain works
The Surprise candy example: two flavors in very
large bags of 5 kinds, indistinguishable from
outside
h1: 100% cherry P(c|h1) = 1, P(l|h1) = 0
h2: 75% cherry + 25% lime
h3: 50% cherry + 50% lime
h4: 25% cherry + 75% lime
h5: 100% lime
42
Learning with Complete Data
Parameter learning - to find the numerical parameters
for a probability model whose structure is fixed
Data are complete when each data point contains
values for every variable in the model
Maximum-likelihood parameter learning: discrete
model
Two examples: one and three parameters
To be seen in next 3 slides (From the book authors slides)
With complete data, ML parameter learning problem for a
Bayesian network decomposes into separate learning
problems, one for each parameter
A significant problem with ML learning 0 may be assigned
to some events that have not been observed
Various tricks are used to avoid this problem. One is to
assign count = 1 instead of 0 to each event
43
Learning with Hidden
Variables
Many real-world problems have hidden
variables which are not observable in the
data available for learning.
Question: If a variable (disease) is not
observed, why not construct a model
without it?
Answer: Hidden variables can dramatically
reduce the number of parameters required
to specify a Bayesian network. This results
in the reduction of needed amount of data
for learning.
44
.

E-step computes the expected value pij of the


hidden indicator variables Zij, where Zij is 1 if xj
was generated by i-th component, 0 otherwise
M-step finds the new values of the parameters
that maximize the log likelihood of the data,
given the expected values of Zij
Fig 20.8(c) is learned from Fig. 20.8(a)
Fig 20.9(a) plots the log likelihood of the data as EM
progresses
EM increases the log likelihood of the data at each
iteration.
EM can reach a local maximum in likelihood. 45
Instance-based Learning
Parametric vs. nonparametric learning
Learning focuses on fitting the parameters of a
restricted family of probability models to an
unrestricted data set
Parametric learning methods are often simple and
effective, but can oversimplify whats really
happening
Nonparametric learning allows the hypothesis
complexity to grow with the data
IBL is nonparametric as it constructs hypotheses
directly from the training data.
46
Nearest-neighbor models
The key idea: Neighbors are similar
Density estimation example: estimate xs probability
density by the density of its neighbors
Connecting with table lookup, NBC, decision trees,
How define neighborhood N
If too small, no any data points
If too big, density is the same everywhere
A solution is to define N to contain k points, where k
is large enough to ensure a meaningful estimate
For a fixed k, the size of N varies (Fig 21.12)
The effect of size of k (Figure 21.13)
For most low-dimensional data, k is usually between 5-10
47
Learning in Neural Networks
Introduction
Neurons and the Brain
Neural Networks
Perceptrons
Multi-layer Networks
Applications

48
What is a neural network ?
A machine designed to model the way in
which the brain performs a particular task
or function of interest; the network is
implemented by using electronic
components or is simulated in software on
a digital computer [1].

Viewed as a adaptive machine, a


massively parallel ,distributed processor,
made up of simple processing units which
has propensity for storing experiential
knowledge and make it available for use
[2]
Neural Networks
complex networks of simple computing
elements
capable of learning from examples
with appropriate learning methods

collection of simple elements performs high-


level operations
thought
reasoning
Consciousness: being able to use your senses and
mental powers to understand what is happening 50
Neural Networks and the
Brain
brain
set of interconnected
modules
performs information
processing operations at
various levels
sensory input analysis
memory storage and retrieval
reasoning
feelings
[Russell & Norvig, 1995]
consciousness

neurons
basic computational elements
heavily interconnected with
51
other neurons
Learning
Neuron Diagram
soma
cell body
dendrites
incoming
branches
axon
outgoing
branch
synapse
junction
between a
dendrite and
an axon from
another
[Russell & Norvig, 1995] neuron

52
Computer vs. Brain
Computer Brain

1-1000 CPUs
Computational units 1011 neurons
107 gates/CPU
1010 bits RAM 1011 neurons
Storage units
1011 bits disk 1014 synapses
Cycle time 10-9 sec (1GHz) 10-3 sec (1kHz)

Bandwidth 109 sec 1014 sec

Neuron updates/sec 105 1014

53
Computer Brain vs. Cat Brain

in 2009 IBM makes a


supercomputer significantly
smarter than cat
IBM has announced a software
simulation of a mammalian
cerebral cortex that's
significantly more complex than
the cortex of a cat. And, just
like the actual brain that it
simulates, they still have to
figure out how it works.

54
Artificial Neuron Diagram

[Russell & Norvig, 1995]

weighted inputs are summed up by the input function


the (nonlinear) activation function calculates the 55
activation value, which determines the output
Common Activation Functions

[Russell & Norvig, 1995]

Step (x)
t = 1 if x >= t, else 0
Sign(x) = +1 if x >= 0, else 1
Sigmoid(x) = 1/(1+e-x) 56
Neural Networks and Logic
Gates

[Russell & Norvig, 1995]

simple neurons with can act as logic gates


appropriate choice of activation function, threshold, and
weights
step function as activation function 57
Perceptrons
an artificial network which
is intended to copy the
brain's ability to recognize
things and see the
differences between
things
single layer, feed-forward
network
historically one of the first
types of neural networks
late 1950s

the output is calculated as


a step function applied to
the weighted sum of
inputs
capable of learning simple
functions
[Russell & Norvig, 1995] linearly separable 58
Perceptrons and Linear
Separability
0,1 1,1 0,1 1,1

0,0 1,0 0,0 1,0

AND XOR

[Russell & Norvig, 1995]

perceptrons can deal with linearly separable functions


some simple functions are not linearly separable 59
XOR function
Perceptrons and Linear
Separability

[Russell & Norvig, 1995]

linear separability can be extended to more than two dimensions


60

more difficult to visualize


Perceptrons and Learning
perceptrons can learn from examples through a
simple learning rule
calculate the error of a unit Erri as the difference between the
correct output Ti and the calculated output Oi
Erri = Ti - Oi
adjust the weight Wj of the input Ij such that the error
decreases
Wij := Wij + *Iij * Errij
is the learning rate

this is a gradient descent search through the weight space


lead to great enthusiasm in the late 50s and early 60s until
Minsky & Papert in 69 analyzed the class of representable
functions and found the linear separability problem 61
Multi-Layer Networks

research in the more complex networks with more


than one layer was very limited until the 1980s
learning in such networks is much more complicated
the problem is to assign the blame for an error to the
respective units and their weights in a constructive way

the back-propagation learning algorithm can be


used to facilitate learning in multi-layer networks

62
Diagram Multi-Layer Network
Oi
two-layer network
Wji input units Ik
usually not counted as a
aj separate layer

hidden units aj
Wkj
output units Oi

Ik usually all nodes of one


layer have weighted
connections to all nodes
of the next layer
63
Back-Propagation Algorithm

assigns blame to individual units in the respective


layers
essentially based on the connection strength
proceeds from the output layer to the hidden layer(s)
updates the weights of the units leading to the layer
essentially performs gradient-descent search on the
error surface
relatively simple since it relies only on local information
from directly connected units
has convergence and efficiency problems
64
Capabilities of Multi-Layer Neural
Networks
expressiveness
weaker than predicate logic
good for continuous inputs and outputs

computational efficiency
training time can be exponential in the number of inputs
depends critically on parameters like the learning rate
local minima are problematic
can be overcome by simulated annealing, at additional cost

generalization
works reasonably well for some functions (classes of
problems) 65

no formal characterization of these functions


Capabilities of Multi-Layer
Neural Networks (cont.)

sensitivity to noise
very tolerant
they perform nonlinear regression
transparency
neural networks are essentially black boxes
there is no explanation or trace for a particular answer
tools for the analysis of networks are very limited
some limited methods to extract rules from networks
prior knowledge
very difficult to integrate since the internal representation of the networks
is not easily accessible

66
Applications
domains and tasks where neural networks are
successfully used
handwriting recognition
control problems
juggling, truck backup problem

series prediction
weather, financial forecasting

categorization
sorting of items (fruit, characters, phonemes, )

67

You might also like