Professional Documents
Culture Documents
Course Objectives
Introduce the concepts & principles governing reinforcementbased machine learning systems Review fundamental theory
RL learning schemes (Q-Learning, TD-Learning, etc.) Limitations of existing techniques and how they can be improved Discuss software and hardware implementation considerations
Long-term goal: to contribute to your understanding of the formalism, trends and challenges in constructing RL-based agents
ECE-517 - Reinforcement Learning in AI
4
Markov Decision Processes (MDPs) Dynamic Programming (DP) Practical systems; role of Neural Networks in NDP
Course Prerequisites A course on probability theory or background in probability theory is required Matlab/C/C++ competency
A Matlab tutorial has been posted on the course website (under the schedule page)
Course Assignments
2 small projects
Main goal provide students with basic hands-on experience in ML behavioral simulation and result interpretation Analysis complemented by simulation
MATLAB programming oriented
Reports should include all background, explanations and results Will cover the majority of the topics discussed in class Assignments should be handed in before the beginning of the class Each student/group is assigned a topic Project report & in-class presentation
6
Final project
Sony AIBO Lab Located at SERF 204 6 Sony AIBO dog robots (3rd generation) Local wireless network (for communicating with the dogs) Code for lab project/s will be written in Matlab Interface has been prepared Time slots should be coordinated with Instructor & TA
Office Hours: T/Tr 2:00 3:00 PM (FH 401-A) TA: Derek Rose (derek@utk.edu) office @ SERF 213
My email: itamar@eecs.utk.edu
Students are strongly encouraged to visit the course website (www.ece.utk.edu/~itamar/courses/ECE-517) for announcements, lecture notes, updates etc.
ECE-517 - Reinforcement Learning in AI
9
An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic honesty. As a student of the university, I pledge that I will neither knowingly give nor receive any inappropriate assistance in academic work, thus affirming my own personal commitment to honor and integrity.
10
Robotics Machine learning (in the general sense) Legacy AI symbolic reasoning, logic, etc. Image/vision/signal processing Control systems theory
11
Dynamic Programming
Applications and case studies Final project presentations Nov 15 Nov 29, 2011 A detailed schedule is posted at the course website
ECE-517 - Reinforcement Learning in AI
12
13
Optical character recognition Face detection Spoken language understanding Customer segmentation Weather prediction, etc.
14
Introduction
Pattern recognition (speech, vision) Data mining Military applications many more
15
Introduction (cont.)
Learning by interacting with our environment is probably the first to occur to us when we think about the nature of learning Humans have no direct teachers We do have direct sensormotor connection to the environment We learn as we go along
Interaction with environment teaches us what works and what doesnt We construct a model of our environment
This course explores a computational approach to learning from interaction with the environment
16
Trial-and-error adapting internal representation, based on experience, to improve future performance Delayed reward actions are produced so as to yield longterm (not just short-term) rewards Sense its environment Produce actions that can affect the environment Have a goal (momentary cost metric) relating to its state
17
Necessitates an accurate model of the environment being controlled/interacted-with Something animals and humans do very well, and computers do very poorly
Philosophically Computationally Practically (implementation considerations)
18
Supervised Learning: learn from labeled examples Unsupervised Learning: process unlabeled examples
Reinforcement Learning: learn from interaction
Example: clustering data into groups Defined by the problem Many approaches are possible (including evolutionary) Here we will focus on a particular family of approaches Autonomous learning
19
Confinement to the Von Neumann architecture Human brain has 1011 processors operating at once However, each runs at ~150 Hz Its the massive parallelism that gives it power
FPGA devices (reconfigurable computing) GPUs ASIC Prospect UTK/MIL group focus
20
Exploitation of what worked in the past (to yield high reward) Exploration of new, alternative action paths so as to learn how to make better action selections in the future
The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward We will review mathematical methods proposed to address this basic issue
ECE-517 - Reinforcement Learning in AI
21
Actions
Reward
Some Examples or RL
A master chess player
A mobile robot decides whether to enter a room or try to find its way back to a battery-charging station Playing backgammon
In all cases, the agent tries to achieve a goal despite uncertainty about its environment The affect of an action cannot be fully predicted In all cases, experience allows the agent to improve its performance over time
ECE-517 - Reinforcement Learning in AI
23
Artificial Intelligence Control Theory (MDP) Operations Research Cognitive Science and Psychology More recently, Neuroscience RL has solid foundations and is a well-established research field
24
2) Reward function - defines the goal in a RL learning problem. Roughly speaking, it maps each perceived state (or stateaction pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state
o Agents goal is to maximize the reward over time o May be stochastic o Drive the policy employed and its adaptation
25
26
As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems that are extremely hard to (optimally) solve.
ECE-517 - Reinforcement Learning in AI
27
We are playing against an imperfect player Draws and losses are equally bad for us
Q: Can we design a player thatll find imperfections in the opponents play and learn to maximize chances of winning?
Classical machine learning schemes would never visit a state that has the potential to lead to a loss We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of leading to a loss
ECE-517 - Reinforcement Learning in AI
28
An Extended Example: Tic-Tac-Toe (cont.) Using dynamic programming (DP), we can compute an optimal solution for any opponent
However, we would need specifications of the opponent (e.g. state-action probabilities) Such information is usually unavailable to us
In RL we estimate this information from experience We later apply DP, or other sequential decision making schemes, based on the model we obtained by experience A policy tells the agent how to make its next move based on the state of the board
29
All states with three Xs in a row have win prob. of 1 All states with three Os in a row have win prob. of 0 All other states are preset to prob. 0.5 When playing the game, we make a move that we predict would result in a state with the highest value (exploitation) Occasionally, we chose randomly among the non-zero valued states (exploratory moves)
30
31
where: 0< is a learning parameter (step-size param.) s state before move / s state after move
This update rule is an example of Temporal-Difference Learning method This method performs quite well converges to the optimal policy (for a fixed opponent) Can be adjusted to allow for slowly-changing opponents
32
Emphasis on learning from interaction in this case with the opponent Clear goal correct planning takes into account delayed rewards
be applied to infinite horizon problems (not state-terminal) be applied to cases where there is no external adversary (e.g. game against nature)
33
34