Lecture 1

ECE-517: Reinforcement Learning in Artificial Intelligence
Lecture 1: Course Logistics, Introduction

August 18, 2011
Dr. Itamar Arel

College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2011
1
But first, a quick anonymous survey
ECE-517 - Reinforcement Learning in AI
Outline Course logistics and requirements
Course roadmap & outline

Introduction
Course Objectives
Introduce the concepts & principles governing reinforcementbased machine learning systems Review fundamental theory

RL learning schemes (Q-Learning, TD-Learning, etc.) Limitations of existing techniques and how they can be improved Discuss software and hardware implementation considerations
Long-term goal: to contribute to your understanding of the formalism, trends and challenges in constructing RL-based agents
4
Markov Decision Processes (MDPs) Dynamic Programming (DP) Practical systems; role of Neural Networks in NDP
Course Prerequisites A course on probability theory or background in probability theory is required Matlab/C/C++ competency
A Matlab tutorial has been posted on the course website (under the schedule page)
Open mindedness & imagination
Course Assignments
2 small projects
Main goal provide students with basic hands-on experience in ML behavioral simulation and result interpretation Analysis complemented by simulation
MATLAB programming oriented
Reports should include all background, explanations and results Will cover the majority of the topics discussed in class Assignments should be handed in before the beginning of the class Each student/group is assigned a topic Project report & in-class presentation
6
5 problem assignment sets

Final project

Sony AIBO Lab Located at SERF 204 6 Sony AIBO dog robots (3rd generation) Local wireless network (for communicating with the dogs) Code for lab project/s will be written in Matlab Interface has been prepared Time slots should be coordinated with Instructor & TA
Textbooks & Reference Material

Lecture notes will be posted weekly on the course website (www.ece.utk.edu/~itamar/courses/ECE-517) as well as

Updated schedule Assignment sets, sample codes Grades General announcements
Reading assignments will be posted on schedule page Textbook:
R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 1998. (available online!)
Grading policy & office hours

2 Small Projects 25% (12.5 points each) 5 Assignment sets 25% (5 points each)
Midterm (in-class) 20%

Final project 30%
Instructor: Dr. Itamar Arel

Office Hours: T/Tr 2:00 3:00 PM (FH 401-A) TA: Derek Rose (derek@utk.edu) office @ SERF 213
My email: itamar@eecs.utk.edu
Office hours: contact TA
Students are strongly encouraged to visit the course website (www.ece.utk.edu/~itamar/courses/ECE-517) for announcements, lecture notes, updates etc.
9
UTK Academic Honesty Statement
An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic honesty. As a student of the university, I pledge that I will neither knowingly give nor receive any inappropriate assistance in academic work, thus affirming my own personal commitment to honor and integrity.
Bottom line: DO YOUR OWN WORK!
10
Understanding and constructing intelligence

What is intelligence? How do we defined/evaluate it? How can we design adaptive systems that optimize their performance as time goes by? What are the limitations of RL based algorithms? How can artificial Neural Networks help us scale intelligent systems? In what ways can knowledge be efficiently represented? This course is NOT about

Robotics Machine learning (in the general sense) Legacy AI symbolic reasoning, logic, etc. Image/vision/signal processing Control systems theory
Why the course name RL in AI ?
11
Course Outline & Roadmap

Introduction Review of basic probability theory

Discrete-time/space probability theory Discrete Markov Chains
Dynamic Programming

Markov Decision Processes (MDPs)

Partially Observable Markov Decision Processes (POMDPs) Temporal Difference (TD) Learning, Planning
Approximate Dynamic Programming (a.k.a. Reinforcement Learning)
Midterm Tuesday, Oct 4, 2011

Neuro-Dynamic Programming

Feedfoward & Recurrent Neural Networks Neuro-dynamic RL architecture
Applications and case studies Final project presentations Nov 15 Nov 29, 2011 A detailed schedule is posted at the course website
12
Outline Course logistics and requirements
Course outline & roadmap

Introduction
13
What is Machine Learning?

Discipline focusing on computer algorithms that learn to perform intelligent tasks Learning is based on observation of data Generally: learning to do better in the future based on what has been observed/experienced in the past ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc. Examples of ML Problems

Optical character recognition Face detection Spoken language understanding Customer segmentation Weather prediction, etc.
14
Introduction
Why do we need good ML technology?

Human beings are lazy creatures Service robotics

$10B market in 2015 Japan only!
Pattern recognition (speech, vision) Data mining Military applications many more
Many ML problems can be formulated as RL problems
15
Introduction (cont.)
Learning by interacting with our environment is probably the first to occur to us when we think about the nature of learning Humans have no direct teachers We do have direct sensormotor connection to the environment We learn as we go along
Interaction with environment teaches us what works and what doesnt We construct a model of our environment
This course explores a computational approach to learning from interaction with the environment
16
What is Reinforcement Learning

Reinforcement learning is learning what to do how to map situations to actions in order to maximize a long-term objective function driven by rewards.
It is a form of unsupervised learning
Two key components at the core of RL:
Trial-and-error adapting internal representation, based on experience, to improve future performance Delayed reward actions are produced so as to yield longterm (not just short-term) rewards Sense its environment Produce actions that can affect the environment Have a goal (momentary cost metric) relating to its state
17
The agent must be able to:

What is Reinforcement Learning (cont)

RL attempts to solve the Credit Assignment problem what is the long-term impact of an action taken now Unique to RL systems - major challenge in ML
Necessitates an accurate model of the environment being controlled/interacted-with Something animals and humans do very well, and computers do very poorly
Philosophically Computationally Practically (implementation considerations)
Well spend most of the semester formalizing solutions to this problem

18
The Big Picture

Artificial Intelligence Machine Learning Reinforcement Learning
Types of Machine Learning

Supervised Learning: learn from labeled examples Unsupervised Learning: process unlabeled examples
Reinforcement Learning: learn from interaction
Example: clustering data into groups Defined by the problem Many approaches are possible (including evolutionary) Here we will focus on a particular family of approaches Autonomous learning
19
Software vs. Hardware

Historically, ML has been in CS turf
Confinement to the Von Neumann architecture Human brain has 1011 processors operating at once However, each runs at ~150 Hz Its the massive parallelism that gives it power
Software limits scalability
Even 256 processors is not massive parallelism Computer Engineering perspective

FPGA devices (reconfigurable computing) GPUs ASIC Prospect UTK/MIL group focus
20
Exploration vs. Exploitation A fundamental trade-off in RL
Exploitation of what worked in the past (to yield high reward) Exploration of new, alternative action paths so as to learn how to make better action selections in the future
The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward We will review mathematical methods proposed to address this basic issue
21
The Reinforcement Learning Framework

Environment (a.k.a. Plant)
Observations
Actions
Agent (a.k.a. Controller)

22
Reward
Some Examples or RL
A master chess player

A mobile robot decides whether to enter a room or try to find its way back to a battery-charging station Playing backgammon
Planning-anticipating replies and counter-replies Immediate, intuitive judgment
Obviously, a strategy is necessary Some luck involved (stochastic game)
In all cases, the agent tries to achieve a goal despite uncertainty about its environment The affect of an action cannot be fully predicted In all cases, experience allows the agent to improve its performance over time
23
Origins of Reinforcement Learning
Artificial Intelligence Control Theory (MDP) Operations Research Cognitive Science and Psychology More recently, Neuroscience RL has solid foundations and is a well-established research field
24
Elements of Reinforcement Learning

Beyond the agent and the environment, we have the following four main elements in RL
1) Policy - defines the learning agent's way of behaving at any given time. Roughly speaking, a policy is a mapping from (perceived) states of the environment to actions to be taken when in those states.
o Usually stochastic (adapts as you go along) o Enough to determine the agents behavior
2) Reward function - defines the goal in a RL learning problem. Roughly speaking, it maps each perceived state (or stateaction pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state
o Agents goal is to maximize the reward over time o May be stochastic o Drive the policy employed and its adaptation
25
Elements of Reinforcement Learning (cont.)

3) Value function Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
It allows the agent to look over the horizon Actions are derived from value estimations, not rewards We measure rewards, but we estimate and act upon values corresponds to strategic/long-term thinking
Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals)

Obtaining a good value function is a key challenge in designing good RL systems
26
Elements of Reinforcement Learning (cont.)

4) Model an observable entity that mimics the behavior of the environment.
For example, given a state and action, the model might predict the resultant next state and next reward As we will later discuss, predictability and auto-associative memory, are key attributes of the mammal brain Models are used for planning any way of deciding on a course of action by considering possible future scenarios prior to them actually occurring Note that RL can work (sometimes very well) with an incomplete model Well go over a range of model platforms to achieve the above
As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems that are extremely hard to (optimally) solve.
27
An Extended Example: Tic-Tac-Toe

Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, vertically or diagonally Lets assume:

We are playing against an imperfect player Draws and losses are equally bad for us
Q: Can we design a player thatll find imperfections in the opponents play and learn to maximize chances of winning?
Classical machine learning schemes would never visit a state that has the potential to lead to a loss We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of leading to a loss
28
An Extended Example: Tic-Tac-Toe (cont.) Using dynamic programming (DP), we can compute an optimal solution for any opponent
However, we would need specifications of the opponent (e.g. state-action probabilities) Such information is usually unavailable to us
In RL we estimate this information from experience We later apply DP, or other sequential decision making schemes, based on the model we obtained by experience A policy tells the agent how to make its next move based on the state of the board
Winning probabilities can be derived by knowing opponent
29
An Extended Example: Tic-Tac-Toe (cont.)

How do we solve this in RL
Set up a table of numbers one for each state of the game

This number will reflect the probability of winning from that particular state This is treated as the states value, and the entire learned table denotes the value function If V(a)>V(b) state a is preferred over state b
All states with three Xs in a row have win prob. of 1 All states with three Os in a row have win prob. of 0 All other states are preset to prob. 0.5 When playing the game, we make a move that we predict would result in a state with the highest value (exploitation) Occasionally, we chose randomly among the non-zero valued states (exploratory moves)
30
31

While playing, we update the values of the states The current value of the earlier state is adjusted to be closer to the value of the later state
V(s) V(s) + [V(s) V(s)]
where: 0< is a learning parameter (step-size param.) s state before move / s state after move
This update rule is an example of Temporal-Difference Learning method This method performs quite well converges to the optimal policy (for a fixed opponent) Can be adjusted to allow for slowly-changing opponents
32

RL features encountered
Emphasis on learning from interaction in this case with the opponent Clear goal correct planning takes into account delayed rewards
No model of opponent exists a priori
For example setting up traps for a shortsighted opponent
Although this example is a good one, RL methods can

be applied to infinite horizon problems (not state-terminal) be applied to cases where there is no external adversary (e.g. game against nature)
Backgammon example 1020 states using Neural Nets
33
For next class

Read chapter 1 in SB (up to and including section 1.5) Read chapter 2 in SB
34

Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

ECE-517: Reinforcement Learning in Artificial Intelligence

Lecture 1: Course Logistics, Introduction

Dr. Itamar Arel

But first, a quick anonymous survey

ECE-517 - Reinforcement Learning in AI

Outline Course logistics and requirements

Course roadmap & outline

ECE-517 - Reinforcement Learning in AI

Open mindedness & imagination

ECE-517 - Reinforcement Learning in AI

5 problem assignment sets

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

Textbooks & Reference Material

Updated schedule Assignment sets, sample codes Grades General announcements

Reading assignments will be posted on schedule page Textbook:

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 1998. (available online!)

ECE-517 - Reinforcement Learning in AI

Grading policy & office hours

Midterm (in-class) 20%

Instructor: Dr. Itamar Arel

Office hours: contact TA

UTK Academic Honesty Statement

Bottom line: DO YOUR OWN WORK!

ECE-517 - Reinforcement Learning in AI

Understanding and constructing intelligence

Why the course name RL in AI ?

ECE-517 - Reinforcement Learning in AI

Course Outline & Roadmap

Discrete-time/space probability theory Discrete Markov Chains

Markov Decision Processes (MDPs)

Approximate Dynamic Programming (a.k.a. Reinforcement Learning)

Midterm Tuesday, Oct 4, 2011

Feedfoward & Recurrent Neural Networks Neuro-dynamic RL architecture

Outline Course logistics and requirements

Course outline & roadmap

ECE-517 - Reinforcement Learning in AI

What is Machine Learning?

ECE-517 - Reinforcement Learning in AI

Why do we need good ML technology?

Human beings are lazy creatures Service robotics

Many ML problems can be formulated as RL problems

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

What is Reinforcement Learning

It is a form of unsupervised learning

Two key components at the core of RL:

The agent must be able to:

ECE-517 - Reinforcement Learning in AI

What is Reinforcement Learning (cont)

Well spend most of the semester formalizing solutions to this problem

ECE-517 - Reinforcement Learning in AI

The Big Picture

Types of Machine Learning

ECE-517 - Reinforcement Learning in AI

Software vs. Hardware

Software limits scalability

Even 256 processors is not massive parallelism Computer Engineering perspective

ECE-517 - Reinforcement Learning in AI

Exploration vs. Exploitation A fundamental trade-off in RL

The Reinforcement Learning Framework

Agent (a.k.a. Controller)

Planning-anticipating replies and counter-replies Immediate, intuitive judgment

Obviously, a strategy is necessary Some luck involved (stochastic game)