Chapter 6:artificial Intelligence Learning: By. Getaneh T

Chapter 6:Artificial
Intelligence
Learning
By. Getaneh T.
1
Introduction
knowledge infusion is not always the best way of
providing an agent with knowledge
impractical, tedious
incomplete, imprecise, possibly incorrect
adaptivity
an agent can expand and modify its knowledge base to
reflect changes
improved performance
through learning the agent can make better decisions
autonomy
without learning, an agent can hardly be considered
2
autonomous
Motivation
learning is important for agents to deal with
unknown environments
changes
the capability to learn is essential for the

autonomy of an agent
in many cases, it is more efficient to train an
agent via examples, than to manually extract
knowledge from the examples, and instill it
into the agent
agents capable of learning can improve their
performance
3
Objectives
be aware of the necessity of learning for
autonomous agents
understand the basic principles and limitations
of inductive learning from examples
apply decision tree learning to deterministic
problems characterized by Boolean functions
understand the basic learning methods of
perceptrons and multi-layer neural networks
know the main advantages and problems of
learning in neural networks
4
Learning
an agent tries to improve its behavior through
observation, reasoning, or reflection
learning from experience
memorization of past percepts, states, and actions
generalizations, identification of similar experiences
forecasting
prediction of changes in the environment
theories
generation of complex models based on observations
and reasoning
5
Learning from Observation
Learning Agents
Inductive Learning
Learning Decision Trees
6
Learning Agents
based on previous agent designs, such as
reflexive, model-based, goal-based agents
those aspects of agents are encapsulated into the
performance element of a learning agent
a learning agent has an additional learning

element
usually used in combination with a critic and a
problem generator for better learning
most agents learn from examples

inductive learning
7
Learning Agent Model
Performance
Standard
Sensors
Critic
Feedback
Changes
Learning Element Knowledge Performance Element
Learning
Goals
Problem Generator
Agent Effectors
Environment
8
Forms of Learning
supervised learning
an agent tries to find a function that matches examples
from a sample set
each example provides an input together with the correct
output
a teacher provides feedback on the outcome
the teacher can be an outside entity, or part of the
environment
unsupervised learning
the agent tries to learn from patterns without
corresponding output values
reinforcement learning
the agent does not know the exact output for an input,
but it receives feedback on the desirability of its behavior
the feedback can come from an outside entity, the
environment, or the agent itself
17
the feedback may be delayed, and not follow the respective
action immediately
Two types of learning in AI
Deductive: Deduce rules/facts from already known

rules/facts.
Inductive: Learn new rules/facts from a data set.
We will be dealing with the latter, inductive learning, now

18
Inductive Learning
tries to find a function h (the hypothesis) that

approximates a set of samples defining a
function f
the samples are usually provided as
input-output pairs (x, f(x))
supervised learning method
relies on inductive inference, or induction
conclusions are drawn from specific instances to more
general statements
19
Hypotheses
finding a suitable hypothesis can be difficult
since the function f is unknown, it is hard to tell if the
hypothesis h is a good approximation
the hypothesis space describes the set of
hypotheses under consideration
e.g. polynomials, sinusoidal functions, propositional logic,
predicate logic, ...
the choice of the hypothesis space can strongly influence
the task of finding a suitable function
while a very general hypothesis space (e.g. Turing
machines) may be guaranteed to contain a suitable
function, it can be difficult to find it
Ockhams razor: if multiple hypotheses are
consistent with the data, choose the simplest one
20
Learning and Decision Trees
based on a set of attributes as input, predicted
output value, the decision is learned
it is called classification learning for discrete values
regression for continuous values
Boolean or binary classification

output values are true or false
conceptually the simplest case, but still quite powerful
making decisions
a sequence of test is performed, testing the value of one
of the attributes in each step
when a leaf node is reached, its value is returned
21
good correspondence to human decision-making

Terminology
example or sample
describes the values of the attributes and the goal
a positive sample has the value true for the goal predicate, a
negative sample false
sample set
collection of samples used for training and validation
training
the training set consists of samples used for constructing
the decision tree
validation
the test set is used to determine if the decision tree
performs correctly
22
ideally, the test set is different from the training set
Expressiveness of Decision
Trees
decision trees can also be expressed in logic
as implication sentences
in principle, they can express propositional
logic sentences
each row in the truth table of a sentence can be
represented as a path in the tree
often there are more efficient trees
some functions require exponentially large

decision trees
parity function, majority function 23
Constructing Decision Trees
in general, constructing the smallest possible

decision tree is an intractable problem
algorithms exist for constructing reasonably small
trees
basic idea: test the most important attribute first
attribute that makes the most difference for the
classification of an example
can be determined through information theory
hopefully will yield the correct classification with few tests
24
Decision Tree Algorithm
recursive formulation
select the best attribute to split positive and negative examples
if only positive or only negative examples are left, we are done
if no examples are left, no such examples were observed
return a default value calculated from the majority classification at
the nodes parent
if we have positive and negative examples left, but no
attributes to split them, we are in trouble
samples have the same description, but different classifications
may be caused by incorrect data (noise), or by a lack of information,
or by a truly non-deterministic domain
25
Performance of Decision Tree
Learning
quality of predictions
predictions for the classification of unknown examples that
agree with the correct result are obviously better
can be measured easily after the fact
it can be assessed in advance by splitting the available
examples into a training set and a test set
learn the training set, and assess the performance via the test
set
size of the tree
a smaller tree (especially depth-wise) is a more concise
representation
26
Ensemble Learning
multiple hypotheses (an ensemble) are generated,
and their predictions combined
by using multiple hypotheses, the likelihood for
misclassification is hopefully lower
also enlarges the hypothesis space
boosting is a frequently used ensemble method

each example in the training set has a weight associated
the weights of incorrectly classified examples are increased
and a new hypothesis is generated from this new weighted
training set
the final hypothesis is a weighted-majority combination of a
the generated hypotheses 27
Computational Learning
Theory
relies on methods and techniques from theoretical

computer science, statistics, and AI
used for the formal analysis of learning algorithms
basic principles
a hypothesis is seriously wrong
it will most likely generate a false prediction even for small
numbers of examples
hypothesis is consistent with a large number of examples
most likely it is quite good, or probably approximately correct
28
Probably Approximately
Correct (PAC) Learning
approximately correct hypothesis

its error lies within a small constant of the true result
by testing a sufficient number of examples, one can see if a
hypothesis has a high probability of being approximately
correct
stationary assumption
training and test sets follow the same probability distribution
there is a connection between the past (known) and the
future (unknown)
a selection of non-representative examples will not result in
good learning
29
Knowledge in
Learning
30
A logical formulation of learning
Whatre Goal and Hypotheses

Goal predicate Q - WillWait
Learning is to find an equivalent logical
expression we can classify examples
Each hypothesis proposes such an
expression - a candidate definition of Q
31
.
Hypothesis space is the set of all

hypotheses the learning algorithm is
designed to entertain.
One of the hypotheses is correct:
H1 V H2 VV Hn
Each Hi predicts a certain set of examples
- the extension of the goal predicate.
Two hypotheses with different extensions
are logically inconsistent with each other,
otherwise, they are logically equivalent.
32
What are Examples
An example is an object of some logical
description to which the goal concept
may or may not apply.
Alt(X1)^!Bar(X1)^!Fri/Sat(X1)^
Ideally, we want to find a hypothesis
that agrees with all the examples.
The relation between f and h are: ++, --,
+- (false negative), -+ (false positive). If
the last two occur, example I and h are
logically inconsistent.
33
Inductive learning in the logical
setting
The objective is to find a hypothesis that explains
the classifications of the examples, given their
descriptions.
Hypothesis ^ Description |= Classifications
Hypothesis is unknown, explains the
observations
Descriptions - the conjunction of all the example
descriptions
Classifications - the conjunction of all the
example classifications
Knowledge free learning
Decision trees
Description = Classifications 34
A cumulative learning
process
Observations, K-based learning, Hypotheses,

and prior knowledge.
The new approach is to design agents that
already know something and are trying to
learn some more.
Intuitively, this should be faster and better
than without using knowledge, assuming
whats known is always correct.
How to implement this cumulative learning
with increasing knowledge?
35
Some examples of using
knowledge
One can leap to general conclusions

after only one observation.
Your such experience?
Traveling to Brazil: Language and name
A pharmacologically ignorant but
diagnostically sophisticated medical
student
36
Inductive logical
programming (ILP)
ILP can formulate hypotheses in general

first-order logic
Others like DT are more restricted languages
Prior knowledge is used to reduce the
complexity of learning:
prior knowledge further reduces the H space
prior knowledge helps find the shorter H
Again, assuming prior knowledge is correct
37
Explanation-based learning
A method to extract general rules from individual

observations
The goal is to solve a similar problem faster next
time.
Memoization - speed up by saving results and
avoiding solving a problem from scratch
EBL does it one step further - from observations
to rules
38
Learning using relevant
information
Prior knowledge: People in a country
usually speak the same language
Nat(x,n) ^Nat(y,n)^Lang(x,l)=>Lang(y,l)
Observation: Given nationality, language
is fully determined
Given Fernando is Brazilian & speaks
Portuguese
Nat(Fernando,B) ^ Lang(Fernando,P)
We can logically conclude
Nat(y,B) => Lang(y,P) 39
Inductive logic programming
It combines inductive methods with FOL.

ILP represents theories as logic programs.
ILP offers complete algorithms for inducing
general, first-order theories from examples.
It can learn successfully in domains where
attribute-based algorithms fail completely.
An example - a typical family tree (Fig 19.11)
40
Statistical Learning
Methods
41
Statistical Learning
Data instantiations of some or all of the random

variables describing the domain; they are evidence
Hypotheses probabilistic theories of how the
domain works
The Surprise candy example: two flavors in very
large bags of 5 kinds, indistinguishable from
outside
h1: 100% cherry P(c|h1) = 1, P(l|h1) = 0
h2: 75% cherry + 25% lime
h5: 100% lime
42
Learning with Complete Data
Parameter learning - to find the numerical parameters
for a probability model whose structure is fixed
Data are complete when each data point contains
values for every variable in the model
Maximum-likelihood parameter learning: discrete
model
Two examples: one and three parameters
To be seen in next 3 slides (From the book authors slides)
With complete data, ML parameter learning problem for a
Bayesian network decomposes into separate learning
problems, one for each parameter
A significant problem with ML learning 0 may be assigned
to some events that have not been observed
Various tricks are used to avoid this problem. One is to
assign count = 1 instead of 0 to each event
43
Learning with Hidden
Variables
Many real-world problems have hidden
variables which are not observable in the
data available for learning.
Question: If a variable (disease) is not
observed, why not construct a model
without it?
Answer: Hidden variables can dramatically
reduce the number of parameters required
to specify a Bayesian network. This results
in the reduction of needed amount of data
for learning.
44
.
E-step computes the expected value pij of the

hidden indicator variables Zij, where Zij is 1 if xj
was generated by i-th component, 0 otherwise
M-step finds the new values of the parameters
that maximize the log likelihood of the data,
given the expected values of Zij
Fig 20.8(c) is learned from Fig. 20.8(a)
Fig 20.9(a) plots the log likelihood of the data as EM
progresses
EM increases the log likelihood of the data at each
iteration.
EM can reach a local maximum in likelihood. 45
Instance-based Learning
Parametric vs. nonparametric learning
Learning focuses on fitting the parameters of a
restricted family of probability models to an
unrestricted data set
Parametric learning methods are often simple and
effective, but can oversimplify whats really
happening
Nonparametric learning allows the hypothesis
complexity to grow with the data
IBL is nonparametric as it constructs hypotheses
directly from the training data.
46
Nearest-neighbor models
The key idea: Neighbors are similar
Density estimation example: estimate xs probability
density by the density of its neighbors
Connecting with table lookup, NBC, decision trees,
How define neighborhood N
If too small, no any data points
If too big, density is the same everywhere
A solution is to define N to contain k points, where k
is large enough to ensure a meaningful estimate
For a fixed k, the size of N varies (Fig 21.12)
The effect of size of k (Figure 21.13)
For most low-dimensional data, k is usually between 5-10
47
Learning in Neural Networks
Introduction
Neurons and the Brain
Neural Networks
Perceptrons
Multi-layer Networks
Applications
48
What is a neural network ?
A machine designed to model the way in
which the brain performs a particular task
or function of interest; the network is
implemented by using electronic
components or is simulated in software on
a digital computer [1].
Viewed as a adaptive machine, a

massively parallel ,distributed processor,
made up of simple processing units which
has propensity for storing experiential
knowledge and make it available for use
[2]
Neural Networks
complex networks of simple computing
elements
capable of learning from examples
with appropriate learning methods
collection of simple elements performs high-

level operations
thought
reasoning
Consciousness: being able to use your senses and
mental powers to understand what is happening 50
Neural Networks and the
Brain
brain
set of interconnected
modules
performs information
processing operations at
various levels
sensory input analysis
memory storage and retrieval
reasoning
feelings
[Russell & Norvig, 1995]
consciousness
neurons
basic computational elements
heavily interconnected with
51
other neurons
Learning
Neuron Diagram
soma
cell body
dendrites
incoming
branches
axon
outgoing
branch
synapse
junction
between a
dendrite and
an axon from
another
[Russell & Norvig, 1995] neuron
52
Computer vs. Brain
Computer Brain
1-1000 CPUs
Computational units 1011 neurons
107 gates/CPU
1010 bits RAM 1011 neurons
Storage units
1011 bits disk 1014 synapses
Cycle time 10-9 sec (1GHz) 10-3 sec (1kHz)
Bandwidth 109 sec 1014 sec
Neuron updates/sec 105 1014
53
Computer Brain vs. Cat Brain
in 2009 IBM makes a

supercomputer significantly
smarter than cat
IBM has announced a software
simulation of a mammalian
cerebral cortex that's
significantly more complex than
the cortex of a cat. And, just
like the actual brain that it
simulates, they still have to
figure out how it works.
54
Artificial Neuron Diagram
weighted inputs are summed up by the input function

the (nonlinear) activation function calculates the 55
activation value, which determines the output
Common Activation Functions
Step (x)
t = 1 if x >= t, else 0
Sign(x) = +1 if x >= 0, else 1
Sigmoid(x) = 1/(1+e-x) 56
Neural Networks and Logic
Gates
simple neurons with can act as logic gates

appropriate choice of activation function, threshold, and
weights
step function as activation function 57
Perceptrons
an artificial network which
is intended to copy the
brain's ability to recognize
things and see the
differences between
things
single layer, feed-forward
network
historically one of the first
types of neural networks
late 1950s
the output is calculated as

a step function applied to
the weighted sum of
inputs
capable of learning simple
functions
[Russell & Norvig, 1995] linearly separable 58
Perceptrons and Linear
Separability
0,1 1,1 0,1 1,1
0,0 1,0 0,0 1,0
AND XOR
perceptrons can deal with linearly separable functions

some simple functions are not linearly separable 59
XOR function
Perceptrons and Linear
Separability
linear separability can be extended to more than two dimensions

60
more difficult to visualize

Perceptrons and Learning
perceptrons can learn from examples through a
simple learning rule
calculate the error of a unit Erri as the difference between the
correct output Ti and the calculated output Oi
Erri = Ti - Oi
adjust the weight Wj of the input Ij such that the error
decreases
Wij := Wij + *Iij * Errij
is the learning rate
this is a gradient descent search through the weight space

lead to great enthusiasm in the late 50s and early 60s until
Minsky & Papert in 69 analyzed the class of representable
functions and found the linear separability problem 61
Multi-Layer Networks
research in the more complex networks with more

than one layer was very limited until the 1980s
learning in such networks is much more complicated
the problem is to assign the blame for an error to the
respective units and their weights in a constructive way
the back-propagation learning algorithm can be

used to facilitate learning in multi-layer networks
62
Diagram Multi-Layer Network
Oi
two-layer network
Wji input units Ik
usually not counted as a
aj separate layer
hidden units aj
Wkj
output units Oi
Ik usually all nodes of one

layer have weighted
connections to all nodes
of the next layer
63
Back-Propagation Algorithm
assigns blame to individual units in the respective

layers
essentially based on the connection strength
proceeds from the output layer to the hidden layer(s)
updates the weights of the units leading to the layer
essentially performs gradient-descent search on the
error surface
relatively simple since it relies only on local information
from directly connected units
has convergence and efficiency problems
64
Capabilities of Multi-Layer Neural
Networks
expressiveness
weaker than predicate logic
good for continuous inputs and outputs
computational efficiency
training time can be exponential in the number of inputs
depends critically on parameters like the learning rate
local minima are problematic
can be overcome by simulated annealing, at additional cost
generalization
works reasonably well for some functions (classes of
problems) 65
no formal characterization of these functions

Capabilities of Multi-Layer
Neural Networks (cont.)
sensitivity to noise
very tolerant
they perform nonlinear regression
transparency
neural networks are essentially black boxes
there is no explanation or trace for a particular answer
tools for the analysis of networks are very limited
some limited methods to extract rules from networks
prior knowledge
very difficult to integrate since the internal representation of the networks
is not easily accessible
66
Applications
domains and tasks where neural networks are
successfully used
handwriting recognition
control problems
juggling, truck backup problem
series prediction
weather, financial forecasting
categorization
sorting of items (fruit, characters, phonemes, )
67

Chapter 6:artificial Intelligence Learning: By. Getaneh T

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6:artificial Intelligence Learning: By. Getaneh T

Uploaded by

Copyright:

Available Formats

Chapter 6:Artificial

the capability to learn is essential for the

a learning agent has an additional learning

most agents learn from examples

Deductive: Deduce rules/facts from already known

Inductive: Learn new rules/facts from a data set.

We will be dealing with the latter, inductive learning, now

tries to find a function h (the hypothesis) that

Boolean or binary classification

good correspondence to human decision-making

some functions require exponentially large

in general, constructing the smallest possible

boosting is a frequently used ensemble method

relies on methods and techniques from theoretical

approximately correct hypothesis

Whatre Goal and Hypotheses

Hypothesis space is the set of all

Observations, K-based learning, Hypotheses,

One can leap to general conclusions

ILP can formulate hypotheses in general

A method to extract general rules from individual

It combines inductive methods with FOL.

Data instantiations of some or all of the random

E-step computes the expected value pij of the

Viewed as a adaptive machine, a

collection of simple elements performs high-

Bandwidth 109 sec 1014 sec

Neuron updates/sec 105 1014

in 2009 IBM makes a

[Russell & Norvig, 1995]

weighted inputs are summed up by the input function

[Russell & Norvig, 1995]

[Russell & Norvig, 1995]

simple neurons with can act as logic gates

the output is calculated as

0,0 1,0 0,0 1,0

[Russell & Norvig, 1995]

perceptrons can deal with linearly separable functions

[Russell & Norvig, 1995]

linear separability can be extended to more than two dimensions

more difficult to visualize

this is a gradient descent search through the weight space

research in the more complex networks with more

the back-propagation learning algorithm can be

Ik usually all nodes of one

assigns blame to individual units in the respective

no formal characterization of these functions

You might also like