You are on page 1of 65

UNIT V: LEARNING

LEARNING
Learning from Observation Inductive Learning Decision Trees Explanation based Learning Statistical Learning methods Reinforcement Learning

Learning
An agent tries to improve its behavior through observation, reasoning, or reflection
learning from experience
memorization of past percepts, states, and actions generalizations, identification of similar experiences

forecasting
prediction of changes in the environment

theories
generation of complex models based on observations and reasoning
2000-2012 Franz Kurfess Learning

Learning from Observation


Learning Agents Inductive Learning Learning Decision Trees

2000-2012 Franz Kurfess

Learning

Learning Agents
based on previous agent designs, such as reflexive, model-based, goal-based agents
those aspects of agents are encapsulated into the performance element of a learning agent

a learning agent has an additional learning element


usually used in combination with a critic and a problem generator for better learning

most agents learn from examples


inductive learning
2000-2012 Franz Kurfess Learning

Learning Agent Model


Performance Standard

Sensors
Critic
Feedback Changes

Learning Element
Learning Goals

Knowledge

Performance Element

Problem Generator

Agent Effectors
2000-2012 Franz Kurfess

Environment

Learning

Forms of Learning
supervised learning
an agent tries to find a function that matches examples from a sample set
each example provides an input together with the correct output

a teacher provides feedback on the outcome


the teacher can be an outside entity, or part of the environment

unsupervised learning
the agent tries to learn from patterns without corresponding output values

reinforcement learning
the agent does not know the exact output for an input, but it receives feedback on the desirability of its behavior
the feedback can come from an outside entity, the environment, or the agent itself the feedback may be delayed, and not follow the respective action immediately

2000-2012 Franz Kurfess

Learning

Feedback
provides information about the actual outcome of actions supervised learning
both the input and the output of a component can be perceived by the agent directly the output may be provided by a teacher

reinforcement learning
feedback concerning the desirability of the agents behavior is available
not in the form of the correct output

may not be directly attributable to a particular action


2000-2012 Franz Kurfess

feedback may occur only after a sequence of actions

Learning

Prior Knowledge
background knowledge available before a task is tackled can increase performance or decrease learning time considerably many learning schemes assume that no prior knowledge is available in reality, some prior knowledge is almost always available
but often in a form that is not immediately usable by the agent 2000-2012 Franz Kurfess Learning

Inductive Learning
tries to find a function h (the hypothesis) that approximates a set of samples defining a function f
the samples are usually provided as input-output pairs (x, f(x))

supervised learning method relies on inductive inference, or induction


conclusions are drawn from specific instances to more general statements
2000-2012 Franz Kurfess Learning

Hypotheses
finding a suitable hypothesis can be difficult
since the function f is unknown, it is hard to tell if the hypothesis h is a good approximation

the hypothesis space describes the set of hypotheses under consideration


e.g. polynomials, sinusoidal functions, propositional logic, predicate logic, ... the choice of the hypothesis space can strongly influence the task of finding a suitable function while a very general hypothesis space (e.g. Turing machines) may be guaranteed to contain a suitable function, it can be difficult to find it
2000-2012 Franz Kurfess

Ockhams razor: if multiple hypotheses are

Learning

Example Inductive Learning 1


f(x)

input-output pairs displayed as points in a plane the task is to find a hypothesis (functions) that connects the points
either all of them, or most of them

various performance measures


x number of points connected minimal surface lowest tension
Learning

2000-2012 Franz Kurfess

Example Inductive Learning 2


f(x)

hypothesis is a function consisting of linear segments fully incorporates all sample pairs
goes through all points

very easy to calculate has discontinuities at the joints of the segments moderate predictive performance
Learning

2000-2012 Franz Kurfess

Example Inductive Learning 3


f(x)

hypothesis expressed as a polynomial function incorporates all samples more complicated to calculate than linear segments no discontinuities better predictive power
Learning

2000-2012 Franz Kurfess

Example Inductive Learning 4


f(x)

hypothesis is a linear functions does not incorporate all samples extremely easy to compute low predictive power

x
Learning

2000-2012 Franz Kurfess

Learning and Decision Trees


based on a set of attributes as input, predicted output value, the decision is learned
it is called classification learning for discrete values regression for continuous values

Boolean or binary classification


output values are true or false conceptually the simplest case, but still quite powerful

making decisions
a sequence of test is performed, testing the value of one of the attributes in each step 2000-2012 Franz Kurfess when a leaf node is reached, its value is returned Learning

Boolean Decision Trees


compute yes/no decisions based on sets of desirable or undesirable properties of an object or a situation
each node in the tree reflects one yes/no decision based on a test of the value of one property of the object
the root node is the starting point leaf nodes represent the possible final decisions

branches are labeled with possible values

the learning aspect is to predict the value of a 2000-2012 Franz Kurfess goal predicate (also called goal concept) Learning

Terminology
example or sample
describes the values of the attributes and the goal
a positive sample has the value true for the goal predicate, a negative sample has false

sample set
collection of samples used for training and validation

training
the training set consists of samples used for constructing the decision tree

validation
2000-2012 Franz Kurfess Learning the test set is used to determine if the decision tree

Restaurant Sample Set


Example
Alt
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Yes Yes No Yes Yes No No No No Yes No Yes

Attributes
Bar
No No Yes No No Yes Yes No Yes Yes No Yes No No No Yes Yes No No No Yes Yes No Yes Yes Yes No Yes No Yes No Yes No Yes No Yes Some Full Some Full Full Some None Some Full Full None Full $$$ $ $ $ $$$ $$ $ $$ $ $$$ $ $ No No No No No Yes Yes Yes Yes No No No Yes No No No Yes Yes No Yes No Yes No No French Thai Burger Thai French Italian Burger Thai Burger Italian Thai Burger 0-10 30-60 0-10 10-30 >60 0-10 0-10 0-10 >60 10-30 0-10 30-60

Goal Exam
Yes No Yes Yes No Yes No Yes No No No Yes X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12

Fri Hun Pat Price Rain Res Type Est WillWait

2000-2012 Franz Kurfess

Learning

Decision Tree Example


Patrons?

No
Yes Yes

Yes
Hungry? Alternative? Walkable? Yes No No

EstWait?
Bar? No

Alternative? Driveable? Yes No


Learning

Yes

Yes

2000-2012 Franz Kurfess

To wait, or not to wait?

Learning Decision Trees


Problem: find a decision tree that agrees with the training set trivial solution: construct a tree with one branch for each sample of the training set
works perfectly for the samples in the training set may not work well for new samples (generalization) results in relatively large trees

better solution: find a concise tree that still agrees with all samples 2000-2012 Franz Kurfess

Learning

Ockhams Razor
The most likely hypothesis is the simplest one that is consistent with all observations.
general principle for inductive learning a simple hypothesis that is consistent with all observations is more likely to be correct than a complex one

2000-2012 Franz Kurfess

Learning

Constructing Decision Trees


in general, constructing the smallest possible decision tree is an intractable problem algorithms exist for constructing reasonably small trees basic idea: test the most important attribute first
attribute that makes the most difference for the classification of an example
can be determined through information theory
2000-2012 Franz Kurfess

hopefully will yield the correct classification withLearning

Decision Tree Algorithm


recursive formulation
select the best attribute to split positive and negative examples if only positive or only negative examples are left, we are done if no examples are left, no such examples were observed
return a default value calculated from the majority classification at the nodes parent

if we have positive and negative examples left, but no attributes to split them, we are in trouble Learning 2000-2012 Franz Kurfess

Performance of Decision Tree Learning


quality of predictions
predictions for the classification of unknown examples that agree with the correct result are obviously better can be measured easily after the fact it can be assessed in advance by splitting the available examples into a training set and a test set
learn the training set, and assess the performance via the test set
2000-2012 Franz Kurfess

size of the tree

Learning

Noise and Over-fitting


the presence of irrelevant attributes (noise) may lead to more degrees of freedom in the decision tree
the hypothesis space is unnecessarily large

overfitting makes use of irrelevant attributes to distinguish between samples that have no meaningful differences
e.g. using the day of the week when rolling dice over-fitting is a general problem for all learning algorithms

decision tree pruning identifies attributes that are likely to be irrelevant


very low information gain

cross-validation splits the sample data in different training and test sets
results are averaged
2000-2012 Franz Kurfess Learning

Ensemble Learning
Multiple hypotheses (an ensemble) are generated, and their predictions combined
by using multiple hypotheses, the likelihood for misclassification is hopefully lower also enlarges the hypothesis space

Boosting is a frequently used ensemble method


each example in the training set has a weight associated the weights of incorrectly classified examples are increased, and a new hypothesis is generated from 2000-2012 Franznew Kurfess weighted training set this Learning

Explanation-Based Learning
Learning complex concepts using Induction procedures typically requires a substantial number of training instances. But people seem to be able to learn quite a bit from single examples. An EBL system attempts to learn from a single example x by explaining why x is an example of the target concept. The explanation is then generalized, and then systems performance is improved through the availability of this knowledge.

EBL
EBL programs as accepting the following as input:
A training example A goal concept: A high level description of what the program is supposed to learn An operational criterion- A description of which concepts are usable. A domain theory: A set of rules that describe relationships between objects and actions in a domain.

From this EBL computes a generalization of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion.
Explanation-based generalization (EBG) is an algorithm for EBL and has two steps: (1) explain, (2) generalize During the explanation step- prune away all the unimportant aspects of the training example with respect to the goal concept gives explanation The next step is to generalize the explanation as far as possible while still describing the goal concept.

Statistical Learning Methods

Statistical Learning
Data instantiations of some or all of the random variables describing the domain; they are evidence Hypotheses probabilistic theories of how the domain works The Surprise candy example: two flavors in very large bags of 5 kinds, indistinguishable from outside
h1: 100% cherry P(c|h1) = 1, P(l|h1) = 0 h2: 75% cherry + 25% lime h3: 50% cherry + 50% lime

Problem formulation
Given a new bag, random variable H denotes the bag type (h1 h5); Di is a random variable (cherry or lime); after seeing D1, D2, , DN, predict the flavor (value) of DN-1.

Bayesian learning
Calculates the probability of each hypothesis, given the data and makes predictions on that basis
P(hi|d) = P(d|hi)P(hi), where d are observed values of D Predictions use a likelihood-weighted average over hypotheses hi are intermediaries between raw data and predictions No need to pick one best-guess hypothesis

Learning with Complete Data


Parameter learning - to find the numerical parameters for a probability model whose structure is fixed

Data are complete when each data point contains values for every variable in the model
Maximum-likelihood parameter learning: discrete model
With complete data, ML parameter learning

Nave Bayes models


The most common Bayesian network model used in machine learning It assumes that the attributes are conditionally independent of each other, given class A deterministic prediction can be obtained by choosing the most likely class
P(C|x1,x2,,xn) = P(C) i P(xi|C)

NBC has no difficulty with noisy data

Learning with Hidden Variables


Many real-world problems have hidden variables which are not observable in the data available for learning. Question: If a variable (disease) is not observed, why not construct a model without it? Answer: Hidden variables can dramatically reduce the number of parameters required to specify a Bayesian network. This results in the reduction of needed amount of data for learning.

EM: Learning mixtures of Gaussians


The unsupervised clustering problem If we knew which component generated each xj, we can get ,

If we knew the parameters of each component, we know which ci should xj belong to. However, we do not know either,

EM expectation and maximization


Pretend we know the parameters of the model and then to infer the probability that each xj

E-step computes the expected value pij of the hidden indicator variables Zij, where Zij is 1 if xj was generated by i-th component, 0 otherwise M-step finds the new values of the parameters that maximize the log likelihood of the data, given the expected values of Zij

Instance-based Learning
Parametric vs. nonparametric learning
Learning focuses on fitting the parameters of a restricted family of probability models to an unrestricted data set Parametric learning methods are often simple and effective, but can oversimplify whats really happening Nonparametric learning allows the hypothesis complexity to grow with the data IBL is nonparametric as it constructs hypotheses directly from the training data.

Nearest-neighbor models
The key idea: Neighbors are similar
Density estimation example: estimate xs probability density by the density of its neighbors Connecting with table lookup, NBC, decision trees,

How define neighborhood N


If too small, no any data points If too big, density is the same everywhere A solution is to define N to contain k points, where k is large enough to ensure a meaningful estimate
For a fixed k, the size of N varies The effect of size of k For most low-dimensional data, k is usually between 5-10

K-NN for a given query x


Which data point is nearest to x?
We need a distance metric, D(x1, x2) Euclidean distance DE is a popular one When each dimension measures something different, it is inappropriate to use DE (Why?) Important to standardize the scale for each dimension
Mahalanobis distance is one solution

Discrete features should be dealt with differently


Hamming distance

Use k-NN to predict High dimensionality poses another problem

Summary
Bayesian learning formulates learning as a form of probabilistic inference, using the observations to update a prior distribution over hypotheses. Maximum a posteriori (MAP) selects a single most likely hypothesis given the data. Maximum likelihood simply selects the hypothesis that maximizes the likelihood of the data (= MAP with a uniform prior). EM can find local maximum likelihood solutions for hidden variables. Instance-based models use the collection of data to represent a distribution.
Nearest-neighbor method

Reinforcement Learning
In which we examine how an agent can learn from success and failure, reward and punishment.

Introduction
Learning to ride a bicycle:
The goal given to the Reinforcement Learning system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right

Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html

Introduction
Learning to ride a bicycle:
RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT Result: CRASH!!! Receives negative reinforcement

Introduction
Learning to ride a bicycle:
RL system has learned that the state of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over

Passive Learning in a Known Environment


Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

Passive Learning in a Known Environment


In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment

Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

Passive Learning in a Known Environment


the object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning)

Active Learning in an Unknown Environment

An active agent must consider : what actions to take what their outcomes may be how they will affect the rewards received

Active Learning in an Unknown Environment


Minor changes to passive learning agent :

environment model now incorporates the probabilities of transitions to other states given a particular action maximize its expected utility agent needs a performance element to choose an action at each step

Learning in Neural Networks


Neurons and the Brain Neural Networks Perceptrons Multi-layer Networks Applications

2000-2012 Franz Kurfess

Learning

Neural Networks
complex networks of simple computing elements capable of learning from examples
with appropriate learning methods

collection of simple elements performs highlevel operations


thought reasoning consciousness
2000-2012 Franz Kurfess Learning

Neural Networks and the Brain


brain
set of interconnected modules performs information processing operations at various levels

[Russell & Norvig, 1995]

sensory input analysis memory storage and retrieval reasoning feelings consciousness

neurons
basic computational elements heavily interconnected with other neurons
Learning

2000-2012 Franz Kurfess

Artificial Neuron Diagram

[Russell & Norvig, 1995]

weighted inputs are summed up by the input function the (nonlinear) activation function calculates the activation value, which determines the output
2000-2012 Franz Kurfess Learning

Common Activation Functions

[Russell & Norvig, 1995]

Stept(x) = Sign(x) = Sigmoid(x) =


2000-2012 Franz Kurfess

1 if x >= t, else 0 +1 if x >= 0, else 1 1/(1+e-x)


Learning

Network Structures
in principle, networks can be arbitrarily connected
occasionally done to represent specific structures
semantic networks logical sentences

makes learning rather difficult

layered structures
networks are arranged into layers interconnections mostly between two layers some networks have feedback connections 2000-2012 Franz Kurfess

Learning

Perceptrons
single layer, feed-forward network historically one of the first types of neural networks
late 1950s

the output is calculated as a step function applied to the weighted sum of inputs capable of learning simple functions
linearly separable

[Russell & Norvig, 1995]

2000-2012 Franz Kurfess

Learning

Perceptrons and Learning


perceptrons can learn from examples through a simple learning rule
calculate the error of a unit Erri as the difference between the correct output Ti and the calculated output Oi Erri = Ti - Oi adjust the weight Wj of the input Ij such that the error decreases Wij := Wij + *Iij * Errij
is the learning rate

this is a gradient descent search through the weight space 2000-2012 Franz Kurfess Learning

Multi-Layer Networks
research in the more complex networks with more than one layer was very limited until the 1980s
learning in such networks is much more complicated the problem is to assign the blame for an error to the respective units and their weights in a constructive way

the back-propagation learning algorithm can be used to facilitate learning in multi-layer


2000-2012 Franz Kurfess Learning

Diagram Multi-Layer Network


Oi Wji aj Wkj Ik two-layer network
input units Ik
usually not counted as a separate layer

hidden units aj output units Oi

usually all nodes of one layer have weighted connections to all nodes of the next layer

2000-2012 Franz Kurfess

Learning

Back-Propagation Algorithm
assigns blame to individual units in the respective layers
essentially based on the connection strength proceeds from the output layer to the hidden layer(s) updates the weights of the units leading to the layer

essentially performs gradient-descent search on the error surface


2000-2012 Franz Kurfess

relatively simple since it relies only on local

Learning

Capabilities of Multi-Layer Neural Networks


expressiveness
weaker than predicate logic good for continuous inputs and outputs

computational efficiency
training time can be exponential in the number of inputs depends critically on parameters like the learning rate local minima are problematic
can be overcome by simulated annealing, at additional cost

generalization
2000-2012 Franz Kurfess

works reasonably well for some functions (classes Learning of

Capabilities of Multi-Layer Neural Networks (cont.)


sensitivity to noise
very tolerant they perform nonlinear regression

transparency
neural networks are essentially black boxes there is no explanation or trace for a particular answer tools for the analysis of networks are very limited some limited methods to extract rules from networks 2000-2012 Franz Kurfess Learning

Applications
domains and tasks where neural networks are successfully used
handwriting recognition control problems
juggling, truck backup problem

series prediction
weather, financial forecasting

categorization
sorting of items (fruit, characters, phonemes, )
2000-2012 Franz Kurfess Learning

You might also like