You are on page 1of 34

ECE-517: Reinforcement Learning in Artificial Intelligence

Lecture 1: Course Logistics, Introduction


August 18, 2011

Dr. Itamar Arel


College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2011
1

But first, a quick anonymous survey

ECE-517 - Reinforcement Learning in AI

Outline Course logistics and requirements

Course roadmap & outline


Introduction

ECE-517 - Reinforcement Learning in AI

Course Objectives
Introduce the concepts & principles governing reinforcementbased machine learning systems Review fundamental theory

RL learning schemes (Q-Learning, TD-Learning, etc.) Limitations of existing techniques and how they can be improved Discuss software and hardware implementation considerations
Long-term goal: to contribute to your understanding of the formalism, trends and challenges in constructing RL-based agents
ECE-517 - Reinforcement Learning in AI
4

Markov Decision Processes (MDPs) Dynamic Programming (DP) Practical systems; role of Neural Networks in NDP

Course Prerequisites A course on probability theory or background in probability theory is required Matlab/C/C++ competency

A Matlab tutorial has been posted on the course website (under the schedule page)

Open mindedness & imagination

ECE-517 - Reinforcement Learning in AI

Course Assignments
2 small projects

Main goal provide students with basic hands-on experience in ML behavioral simulation and result interpretation Analysis complemented by simulation
MATLAB programming oriented

Reports should include all background, explanations and results Will cover the majority of the topics discussed in class Assignments should be handed in before the beginning of the class Each student/group is assigned a topic Project report & in-class presentation
6

5 problem assignment sets


Final project

ECE-517 - Reinforcement Learning in AI

Sony AIBO Lab Located at SERF 204 6 Sony AIBO dog robots (3rd generation) Local wireless network (for communicating with the dogs) Code for lab project/s will be written in Matlab Interface has been prepared Time slots should be coordinated with Instructor & TA

ECE-517 - Reinforcement Learning in AI

Textbooks & Reference Material


Lecture notes will be posted weekly on the course website (www.ece.utk.edu/~itamar/courses/ECE-517) as well as

Updated schedule Assignment sets, sample codes Grades General announcements

Reading assignments will be posted on schedule page Textbook:

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 1998. (available online!)

ECE-517 - Reinforcement Learning in AI

Grading policy & office hours


2 Small Projects 25% (12.5 points each) 5 Assignment sets 25% (5 points each)

Midterm (in-class) 20%


Final project 30%

Instructor: Dr. Itamar Arel


Office Hours: T/Tr 2:00 3:00 PM (FH 401-A) TA: Derek Rose (derek@utk.edu) office @ SERF 213

My email: itamar@eecs.utk.edu

Office hours: contact TA

Students are strongly encouraged to visit the course website (www.ece.utk.edu/~itamar/courses/ECE-517) for announcements, lecture notes, updates etc.
ECE-517 - Reinforcement Learning in AI
9

UTK Academic Honesty Statement

An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic honesty. As a student of the university, I pledge that I will neither knowingly give nor receive any inappropriate assistance in academic work, thus affirming my own personal commitment to honor and integrity.

Bottom line: DO YOUR OWN WORK!

ECE-517 - Reinforcement Learning in AI

10

Understanding and constructing intelligence


What is intelligence? How do we defined/evaluate it? How can we design adaptive systems that optimize their performance as time goes by? What are the limitations of RL based algorithms? How can artificial Neural Networks help us scale intelligent systems? In what ways can knowledge be efficiently represented? This course is NOT about

Robotics Machine learning (in the general sense) Legacy AI symbolic reasoning, logic, etc. Image/vision/signal processing Control systems theory

Why the course name RL in AI ?

ECE-517 - Reinforcement Learning in AI

11

Course Outline & Roadmap


Introduction Review of basic probability theory

Discrete-time/space probability theory Discrete Markov Chains

Dynamic Programming

Markov Decision Processes (MDPs)


Partially Observable Markov Decision Processes (POMDPs) Temporal Difference (TD) Learning, Planning

Approximate Dynamic Programming (a.k.a. Reinforcement Learning)

Midterm Tuesday, Oct 4, 2011


Neuro-Dynamic Programming

Feedfoward & Recurrent Neural Networks Neuro-dynamic RL architecture

Applications and case studies Final project presentations Nov 15 Nov 29, 2011 A detailed schedule is posted at the course website
ECE-517 - Reinforcement Learning in AI
12

Outline Course logistics and requirements

Course outline & roadmap


Introduction

ECE-517 - Reinforcement Learning in AI

13

What is Machine Learning?


Discipline focusing on computer algorithms that learn to perform intelligent tasks Learning is based on observation of data Generally: learning to do better in the future based on what has been observed/experienced in the past ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc. Examples of ML Problems

Optical character recognition Face detection Spoken language understanding Customer segmentation Weather prediction, etc.

ECE-517 - Reinforcement Learning in AI

14

Introduction

Why do we need good ML technology?


Human beings are lazy creatures Service robotics


$10B market in 2015 Japan only!

Pattern recognition (speech, vision) Data mining Military applications many more

Many ML problems can be formulated as RL problems

ECE-517 - Reinforcement Learning in AI

15

Introduction (cont.)
Learning by interacting with our environment is probably the first to occur to us when we think about the nature of learning Humans have no direct teachers We do have direct sensormotor connection to the environment We learn as we go along

Interaction with environment teaches us what works and what doesnt We construct a model of our environment

This course explores a computational approach to learning from interaction with the environment

ECE-517 - Reinforcement Learning in AI

16

What is Reinforcement Learning


Reinforcement learning is learning what to do how to map situations to actions in order to maximize a long-term objective function driven by rewards.

It is a form of unsupervised learning

Two key components at the core of RL:

Trial-and-error adapting internal representation, based on experience, to improve future performance Delayed reward actions are produced so as to yield longterm (not just short-term) rewards Sense its environment Produce actions that can affect the environment Have a goal (momentary cost metric) relating to its state
17

The agent must be able to:


ECE-517 - Reinforcement Learning in AI

What is Reinforcement Learning (cont)


RL attempts to solve the Credit Assignment problem what is the long-term impact of an action taken now Unique to RL systems - major challenge in ML

Necessitates an accurate model of the environment being controlled/interacted-with Something animals and humans do very well, and computers do very poorly
Philosophically Computationally Practically (implementation considerations)

Well spend most of the semester formalizing solutions to this problem


ECE-517 - Reinforcement Learning in AI

18

The Big Picture


Artificial Intelligence Machine Learning Reinforcement Learning

Types of Machine Learning


Supervised Learning: learn from labeled examples Unsupervised Learning: process unlabeled examples
Reinforcement Learning: learn from interaction
Example: clustering data into groups Defined by the problem Many approaches are possible (including evolutionary) Here we will focus on a particular family of approaches Autonomous learning

ECE-517 - Reinforcement Learning in AI

19

Software vs. Hardware


Historically, ML has been in CS turf

Confinement to the Von Neumann architecture Human brain has 1011 processors operating at once However, each runs at ~150 Hz Its the massive parallelism that gives it power

Software limits scalability

Even 256 processors is not massive parallelism Computer Engineering perspective


FPGA devices (reconfigurable computing) GPUs ASIC Prospect UTK/MIL group focus

ECE-517 - Reinforcement Learning in AI

20

Exploration vs. Exploitation A fundamental trade-off in RL

Exploitation of what worked in the past (to yield high reward) Exploration of new, alternative action paths so as to learn how to make better action selections in the future

The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward We will review mathematical methods proposed to address this basic issue
ECE-517 - Reinforcement Learning in AI

21

The Reinforcement Learning Framework


Environment (a.k.a. Plant)
Observations

Actions

Agent (a.k.a. Controller)


ECE-517 - Reinforcement Learning in AI
22

Reward

Some Examples or RL
A master chess player

A mobile robot decides whether to enter a room or try to find its way back to a battery-charging station Playing backgammon

Planning-anticipating replies and counter-replies Immediate, intuitive judgment

Obviously, a strategy is necessary Some luck involved (stochastic game)

In all cases, the agent tries to achieve a goal despite uncertainty about its environment The affect of an action cannot be fully predicted In all cases, experience allows the agent to improve its performance over time
ECE-517 - Reinforcement Learning in AI

23

Origins of Reinforcement Learning

Artificial Intelligence Control Theory (MDP) Operations Research Cognitive Science and Psychology More recently, Neuroscience RL has solid foundations and is a well-established research field

ECE-517 - Reinforcement Learning in AI

24

Elements of Reinforcement Learning


Beyond the agent and the environment, we have the following four main elements in RL
1) Policy - defines the learning agent's way of behaving at any given time. Roughly speaking, a policy is a mapping from (perceived) states of the environment to actions to be taken when in those states.
o Usually stochastic (adapts as you go along) o Enough to determine the agents behavior

2) Reward function - defines the goal in a RL learning problem. Roughly speaking, it maps each perceived state (or stateaction pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state
o Agents goal is to maximize the reward over time o May be stochastic o Drive the policy employed and its adaptation

ECE-517 - Reinforcement Learning in AI

25

Elements of Reinforcement Learning (cont.)


3) Value function Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
It allows the agent to look over the horizon Actions are derived from value estimations, not rewards We measure rewards, but we estimate and act upon values corresponds to strategic/long-term thinking

Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals)


Obtaining a good value function is a key challenge in designing good RL systems

ECE-517 - Reinforcement Learning in AI

26

Elements of Reinforcement Learning (cont.)


4) Model an observable entity that mimics the behavior of the environment.
For example, given a state and action, the model might predict the resultant next state and next reward As we will later discuss, predictability and auto-associative memory, are key attributes of the mammal brain Models are used for planning any way of deciding on a course of action by considering possible future scenarios prior to them actually occurring Note that RL can work (sometimes very well) with an incomplete model Well go over a range of model platforms to achieve the above

As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems that are extremely hard to (optimally) solve.
ECE-517 - Reinforcement Learning in AI

27

An Extended Example: Tic-Tac-Toe


Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, vertically or diagonally Lets assume:

We are playing against an imperfect player Draws and losses are equally bad for us

Q: Can we design a player thatll find imperfections in the opponents play and learn to maximize chances of winning?

Classical machine learning schemes would never visit a state that has the potential to lead to a loss We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of leading to a loss
ECE-517 - Reinforcement Learning in AI
28

An Extended Example: Tic-Tac-Toe (cont.) Using dynamic programming (DP), we can compute an optimal solution for any opponent

However, we would need specifications of the opponent (e.g. state-action probabilities) Such information is usually unavailable to us

In RL we estimate this information from experience We later apply DP, or other sequential decision making schemes, based on the model we obtained by experience A policy tells the agent how to make its next move based on the state of the board

Winning probabilities can be derived by knowing opponent

ECE-517 - Reinforcement Learning in AI

29

An Extended Example: Tic-Tac-Toe (cont.)


How do we solve this in RL

Set up a table of numbers one for each state of the game


This number will reflect the probability of winning from that particular state This is treated as the states value, and the entire learned table denotes the value function If V(a)>V(b) state a is preferred over state b

All states with three Xs in a row have win prob. of 1 All states with three Os in a row have win prob. of 0 All other states are preset to prob. 0.5 When playing the game, we make a move that we predict would result in a state with the highest value (exploitation) Occasionally, we chose randomly among the non-zero valued states (exploratory moves)

ECE-517 - Reinforcement Learning in AI

30

An Extended Example: Tic-Tac-Toe (cont.)

ECE-517 - Reinforcement Learning in AI

31

An Extended Example: Tic-Tac-Toe (cont.)


While playing, we update the values of the states The current value of the earlier state is adjusted to be closer to the value of the later state

V(s) V(s) + [V(s) V(s)]

where: 0< is a learning parameter (step-size param.) s state before move / s state after move
This update rule is an example of Temporal-Difference Learning method This method performs quite well converges to the optimal policy (for a fixed opponent) Can be adjusted to allow for slowly-changing opponents

ECE-517 - Reinforcement Learning in AI

32

An Extended Example: Tic-Tac-Toe (cont.)


RL features encountered

Emphasis on learning from interaction in this case with the opponent Clear goal correct planning takes into account delayed rewards

No model of opponent exists a priori

For example setting up traps for a shortsighted opponent

Although this example is a good one, RL methods can


be applied to infinite horizon problems (not state-terminal) be applied to cases where there is no external adversary (e.g. game against nature)

Backgammon example 1020 states using Neural Nets

ECE-517 - Reinforcement Learning in AI

33

For next class


Read chapter 1 in SB (up to and including section 1.5) Read chapter 2 in SB

ECE-517 - Reinforcement Learning in AI

34

You might also like