Diplom Arbeit

The Reinforcement Learning Toolbox,
Reinforcement Learning for Optimal Control Tasks

Diplomarbeit am
Institut f ur Grundlagen der Informationsverarbeitung (IGI)
Technisch-Naturwissenschaftliche Fakult at der
Technischen Universit at (University of Technology)
Graz
vorgelegt von
Gerhard Neumann
Studienrichtung: Telematik
Betreuer: O. Univ.-Prof. Dr.rer.nat. DI Wolfgang Maass
Graz, Mai 2005
2
Contents
0.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1 Reinforcement Learning 13
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Problems of reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Successes in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 RL for Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 RL for Control Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 RL for Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Basic Denitions for RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.2 Partially Observable Markov Decision Processes (POMDP) . . . . . . . . . . . . . 19
2 The Reinforcement Learning Toolbox 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Programming Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3 Libraries and Utility classes used . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Structure of the Learning system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 The Listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 The Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 The Environment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 The Action Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5 The Agent Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.6 The State Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.7 Logging the Training Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.8 Parameter representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.9 A general interface for testing the learning performance . . . . . . . . . . . . . . . 36
3 State Representations in RL 39
3.1 Discrete State Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Discretization of continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 State Discretization in the RL Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.3 Discretizing continuous state variables . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.4 Combining discrete state variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.5 Combining discrete state objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3
4 CONTENTS
3.1.6 State substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Linear Feature States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Tile coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 RBF-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Linear features in the RL Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.5 Laying uniform grids over the state space . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.6 Calculating features from a single continuous state variable . . . . . . . . . . . . . . 45
3.3 States for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 General Reinforcement Learning Algorithms 48
4.1 Theory on Value based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Q-Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.3 Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4 Implementation in the RL Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Evaluating the V-Function of a given policy . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Evaluating the Q-Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.4 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.5 The Dynamic Programming Implementation in the Toolbox . . . . . . . . . . . . . 53
4.3 Learning the V-Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Temporal Dierence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 TD () V-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.3 Eligibility traces for Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Learning the Q-Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 TD Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2 TD() Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Action Selection with V-Functions using Planning . . . . . . . . . . . . . . . . . . 61
4.6 Actor-Critic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1 Actors for two dierent actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.2 Actors for a discrete action set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Exploration in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.1 Undirected Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.2 Directed Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.3 Model Free and Model Based directed exploration . . . . . . . . . . . . . . . . . . 68
4.7.4 Distal Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7.5 Selective Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8 Planning and model based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
CONTENTS 5
4.8.1 Planning and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.2 The Dyna-Q algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.3 Prioritized Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Hierarchical Reinforcement Learning 75
5.1 Semi Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Value and action value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Hierarchical Reinforcement Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 The Option framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.2 Hierarchy of Abstract Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 MAX-Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.4 Hierarchical RL used in optimal control tasks . . . . . . . . . . . . . . . . . . . . . 83
5.3 The Implementation of the Hierarchical Structure in the Toolbox . . . . . . . . . . . . . . . 84
5.3.1 Extended actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 The Hierarchical Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.3 Hierarchic SMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.4 Intermediate Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.5 Implementation of the Hierarchic architectures . . . . . . . . . . . . . . . . . . . . 88
6 Function Approximators for Reinforcement Learning 90
6.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1.1 Ecient Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 The Gradient Calculation Model in the Toolbox . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 Representing the Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.2 Updating the Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.3 Calculating the Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.4 Calculating the Gradient of V-Functions and Q-Functions . . . . . . . . . . . . . . 94
6.2.5 Calculating the gradient of stochastic Policies . . . . . . . . . . . . . . . . . . . . 95
6.2.6 The supervised learning framework . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Function Approximation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Linear Approximators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Gaussian Softmax Basis Function Networks (GSBFN) . . . . . . . . . . . . . . . . 98
6.3.4 Feed Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.5 Gauss Sigmoid Neural Networks (GS-NNs) . . . . . . . . . . . . . . . . . . . . . . 100
6.3.6 Other interesting or utilized architectures for RL . . . . . . . . . . . . . . . . . . . 100
7 Reinforcement learning for optimal control tasks 106
7.1 Using continuous actions in the Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.1 Continuous action controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.2 Gradient calculation of Continuous Policies . . . . . . . . . . . . . . . . . . . . . . 108
7.1.3 Continuous action Q-Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.4 Interpolation of Action Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.5 Continuous State and Action Models . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 CONTENTS
7.1.6 Learning the transition function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.1 Direct Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.2 Residual Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.3 Residual Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.4 Generalizing the Results to TD-Learning . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.5 TD() with Function approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3 Continuous Time Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.1 Continuous Time RL formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.2 Learning the continuous time Value Function . . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Continuous TD() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3.4 Finding the Greedy Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Advantage Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4.1 Advantage Learning Update Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 Policy Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.1 Policy Gradient Update Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.5.2 Calculating the learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.5.3 The GPOMDP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5.4 The PEGASUS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6 Continuous Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.6.1 Stochastic Real Valued Unit (SRV) Algorithm . . . . . . . . . . . . . . . . . . . . 134
7.6.2 Policy Gradient Actor Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8 Experiments 139
8.1 The Benchmark Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.1.1 The Pendulum Swing Up Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.1.2 The Cart-Pole Swing Up Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.1.3 The Acrobot Swing Up Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.1.4 Approaches from Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2 V-Function Learning Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2.1 Learning the Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2.2 Action selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.3 Comparison of Dierent Time Scales . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2.4 The inuence of the Eligibility Traces . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.2.5 Directed Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2.6 N-step V-Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2.7 Hierarchical Learning with Subgoals . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.3 Q-Function Learning Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.3.1 Learning the Q-Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3.2 Comparison of dierent time scales . . . . . . . . . . . . . . . . . . . . . . . . . . 165
CONTENTS 7
8.3.3 Dyna-Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.4 Actor-Critic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.4.1 Actor-Critic with Discrete Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.4.2 The SRV algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.4.3 Policy Gradient Actor-Critic Learning . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.5 Comparison of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.6 Policy Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.6.1 GPOMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.6.2 The PEGASUS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A List of Abbreviations 179
B List of Notations 180
C Bibliography 182
List of Figures
1.1 The Cart-pole Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 The Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 The Truck Backer Upper Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 The robot stand up task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 The structure of the learning system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Interaction of the agent with the environment . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Using an individual transition function for the Environment . . . . . . . . . . . . . . . . . 28
2.4 Action objects, Action sets and action data objects . . . . . . . . . . . . . . . . . . . . . . 29
2.5 The interaction of the agent with the controller . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 State Objects and State Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Calculating modied states and storing them in state collections . . . . . . . . . . . . . . . 32
2.8 The adaptable parameter representation of the Toolbox . . . . . . . . . . . . . . . . . . . . 35
3.1 Discretizing a single continuous state variable . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Combining several discrete state objects with the and operator . . . . . . . . . . . . . . . . 41
3.3 State Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Tilings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Grid Based RBF-Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Single state feature calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 The value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Q-Functions for a nite action set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Representation of the Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 V-Function Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Stochastic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 The general Actor-Critic architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 The class architecture of the Toolbox for the Actor-Critic framework . . . . . . . . . . . . . 66
4.8 Dyna-Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Illustration of the execution of a MDP, SMDP and a MDP with options, . . . . . . . . . . . 76
5.2 Temporal Dierence Learning with options. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 State transition structure of a simple HAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Illustration of the taxi task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 MAX-Q task decomposition for the Taxi problem . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Subgoals dened for the robot stand-up task . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 The hierarchic controller architecture of the Toolbox . . . . . . . . . . . . . . . . . . . . . 85
8
LIST OF FIGURES 9
5.8 The hierarchic Semi-MDP is used for learning in dierent hierarchy levels. . . . . . . . . . 87
5.9 Intermediate steps of an option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.10 Realization of the option framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.11 Realization of the MAX-Q framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Interface for updating the weights of a parameterized FA . . . . . . . . . . . . . . . . . . . 93
6.2 Interface for parameterized FAs which provide the gradient calculation . . . . . . . . . . . . 94
6.3 Value Function class which uses an gradient function as function representation . . . . . . . 95
6.4 Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Gaussian-Sigmoidal Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1 Continuous action controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Limited control policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Direct and Residual Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1 The Pendulum Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 The Cart-pole Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3 The Acrobot Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.4 Pendulum V-RBF Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.5 Pendulum V-RBF Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.6 Pendulum FF-NN Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.7 Pendulum FF-NN Learning Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.8 Cart-Pole FF-NN Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.9 Pendulum GS-NN Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.10 Performances of dierent action selection schemes . . . . . . . . . . . . . . . . . . . . . . 152
8.11 Comparison with dierent time scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.12 Pendulum V-RBF Performance with dierent e-traces (Direct Gradient) . . . . . . . . . . . 154
8.13 Pendulum V-RBF Performance with dierent e-traces (Residual) . . . . . . . . . . . . . . . 154
8.14 Pendulum V-FF-NN Performance with dierent e-traces (Direct) . . . . . . . . . . . . . . . 155
8.15 Pendulum V-FF-NN Performance with dierent e-traces (Residual = 0.6) . . . . . . . . . 156
8.16 Pendulum V-FF-NN Performance with dierent e-traces (variable ) . . . . . . . . . . . . . 156
8.17 Pendulum V-GS-NN Performance with dierent e-traces (Direct) . . . . . . . . . . . . . . . 157
8.18 Pendulum V-GS-NN Performance with dierent e-traces (Residual = 0.6) . . . . . . . . . 157
8.19 Performance of directed exploration schemes . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.20 Performance of directed exploration schemes . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.21 Performance of V-Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.22 Performance of Hierarchic Learning with Subgoals . . . . . . . . . . . . . . . . . . . . . . 161
8.23 Pendulum Q-RBF Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.24 Cart-Pole Q-RBF Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.25 Pendulum Q-FF-NN Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.26 Performance of Q-Learning with dierent time scales and Dyna-Q Learning . . . . . . . . . 165
8.27 Performance of discrete Actor-Critic Learning . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.28 Pendulum performance of the SRV algorithm with RBF critic . . . . . . . . . . . . . . . . . 168
8.29 Pendulum performance of the SRV algorithm with FF-NN critic . . . . . . . . . . . . . . . 169
8.30 Pendulum performance of the SRV algorithm with FF-NN critic . . . . . . . . . . . . . . . 169
8.31 CartPole performance of the SRV algorithm with an RBF critic . . . . . . . . . . . . . . . . 170
8.32 Pendulum performance of the PGAC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 171
10 LIST OF FIGURES
8.33 Cart-Pole performance of the PGAC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 172
8.34 Cart-Pole learning curve of the PGAC algorithm with a FF-NN critic . . . . . . . . . . . . . 172
8.35 Comparison of the algorithms for the pendulum task with a RBF network . . . . . . . . . . 174
8.36 Comparison of the algorithms for the cart-pole task with an RBF network . . . . . . . . . . 174
8.37 Comparison of the algorithms for the Pendulum-Task with an FF-NN network . . . . . . . . 175
8.38 Performance of the GPOMDP algorithm for the pendulum task . . . . . . . . . . . . . . . . 176
8.39 Performance of the PEGASUS algorithm for the pendulum task . . . . . . . . . . . . . . . 177
0.1. Abstract 11
0.1 Abstract
This thesis investigates the use of reinforcement learning for optimal control problems and tests the perfor-
mance of the most common existing RL algorithms on dierent optimal control benchmark problems. The
tests consist of an exhaustive comparision of the introduced RL algorithms with dierent parameter settings
and using dierent function approximators. The tests also demonstrate the inuence of specic parameters
on the algorithms. To our knowledge these tests are the most exhaustive benchmark tests done for RL.
We also developed a software framework for RL which makes our tests easily extendable. This framework
is called the Reinforcement Learning Toolbox 2.0 (RLT 2.0), which is a general C++ Library for all kind
of reinforcement learning problems. The Toolbox was designed to be for general use, to be extendable
and to provide a satisfying computational speed performance. Nearly all common RL algorithms such as
TD() learning for the V-Function and Q-Function, discrete Actor-Critic learning, dynamic programming
approaches and prioritized sweeping (see [49] for an introduction to these algorithms), are included in the
Toolbox as well as special algorithms for continuous state and action spaces which are particularly suited
for optimal control tasks. These algorithms are TD()-Learning with value function approximation (a new,
slightly extended version of the TD learning residual algorithm (see [6]) has been used), continuous time
RL [17], Advantage Learning [6], the stochastic real valued algorithm (SRV, see [19], [17]) as Actor-Critic
algorithm for continuous action spaces and also two policy gradient algorithms, which are the GPOMDP
algorithm [11] and modied versions of the PEGASUS algorithm presented in [33]. In addition to these
mostly already existing algorithms, a new Actor-Critic algorithm is introduced. This new algorithm is
referred to as policy-gradient Actor-Critic (PGAC) learning, and will be presented in section 7.6.2.
Most of these algorithms can be used with dierent kinds of function approximators, we implemented
constant and also adaptive normalized RBF networks (GSBFNs, see [30]), feed forward neural networks
(FF-NNs) and gaussian sigmoidal neural networks [42]. The Toolbox uses a modular design, so the exten-
sion of the Toolbox with new algorithms is very easy because much of the necessary functionality is likely
to have already been implemented.
The second part of this thesis concerns the evaluation of the learning algorithms for continuous control tasks
and how they can cope with dierent function approximators. The benchmark tests are done for all the
algorithms mentioned above. These algorithms were tested for the pendulum, cart-pole and acrobot swing
up tasks with constant normalized RBF networks, feed forward neural networks and also gaussian-sigmoidal
neural networks [42] as far as it was possible. The inuence of certain parameters of the algorithms like the
value for dierent function approximators also evaluated, as is the use of dierent time scales. Further we
investigated the use of planning, directed exploration and hierarchic learning to boost the performance of
the algorithms.
0.2 Thesis Structure
The thesis is divided into two parts: the rst part is comprised of the Reinforcement Learning Toolbox and
RL algorithms in general. We will discuss several algorithms and other theoretical aspects of RL. At the
end of every theoretical discussion the implementation issues of the Toolbox are explained, so this part of
the thesis can also be used as a manual for the Toolbox. The second part of this thesis (chapter 8) covers the
benchmark tests for optimal control tasks.
The thesis begins with a brief look at reinforcement learning itself, the successes and problems of reinforce-
ment learning in general and specically for continuous control tasks.
The next two sections of the thesis are more software related and deal with the general requirements and
structure of the Toolbox. These sections will cover the agent, environment, actions and state models of
12 LIST OF FIGURES
the Toolbox. We will then take a look at the general reinforcement learning approaches, which includes at
the rst place a theoretical discussion of value based algorithms such as dynamic programming approaches
and temporal dierence learning. Actor critic learning and planning methods such as prioritized sweeping
are also discussed, as is the problem of eciently exploring the state space. The next section will cover
hierarchical reinforcement learning and how this is done in the Toolbox. In chapter six we will take a look at
function approximation using gradient descent in general, and in particular the use in RL. In this chapter we
will also introduce the function approximator schemes used in RL and in our benchmark tests . In chapter
seven we will discuss more specialized algorithms for dealing with continuous state and action spaces.
Firstly we will cover algorithms for value function approximation [6], then we will come to continuous
time RL [17] and advantage learning [6]. After this two policy gradient algorithms GPOMDP [11] and
PEGASUS [33] are introduced, and general issues about policy gradient algorithms are discussed. In the
end of this chapter two dierent Actor-Critic algorithms are introduced, the stochastic real valued algorithm
(SRV, [19]) and the new proposed policy gradient Actor-Critic algorithms.
Chapter eight will cover the experiments with the pendulum, cart-pole and acrobot swing up benchmark
tasks, which are explained in the beginning of this chapter. For each algorithm the best parameter settings
are pointed out and the potentials, traps, advantages and disadvantages are discussed. The tests include
exhaustive tests with V-Function learning, using dierent approximation algorithms and dierent function
approximators. The inuences of crucial parameters of the algorithms are also evaluated, as are the per-
formance of the algorithms using dierent time scales. The results are compared to Q-Function learning
algorithms and Actor-Critic approaches. The use of planning methods, directed exploration and hierarchical
learning are also investigated in the case of V-Function learning in this section. At the end the experiments
with policy gradient algorithms are presented.
In the conclusion, we summarize the results and talk about further possibilities to improve the performance
of the algorithms.
Chapter 1
Reinforcement Learning
In this chapter we will explain the basics of reinforcement learning; what it is, its achievements and prob-
lems.
1.1 Introduction
We dene RL as learning from a reward signal to choose an optimal (or near optimal) action a
in the
current state s
t
of the agent. Generally the goal of all reinforcement learning algorithms is to nd a good
action-selection policy which optimizes the long-term reward. There are algorithms for optimizing the nite
horizon, un-discounted reward V(t
0
) =
_
T
i
r(t
i
), the (in)nite horizon discounted reward V(t
0
) =
_
i

i
r(t
i
)
( is the discount factor) or also the average reward V
A
(t
0
) = lim
T
1
T
_
T
i
r(t
i
), but the innite horizon
discounted reward is most commonly used. The agent learns from trial and error and attempts to adapt his
action selection policy according to the received rewards.
Reinforcement learning is an unsupervised learning approach and which is why reinforcement learning is
so popular. In the best case, we have only to dene our reward function, start our learning algorithm and we
get an action selection policy maximizing the long-term reward. Usually it is not that easy.
There is a huge variety of RL algorithms, the most common are value based algorithms (those which try to
learn the expected discounted horizon reward for each state) or policy search algorithms, where the search
is done directly in the space of the policy parameters. For policy search algorithms we can actually use
any optimization algorithm we want, so there are approaches which use genetic algorithms or simulated
annealing to search for a good policy (this is mentioned here to point out that RL does not necessarily mean
that we need to learn a value function). In this thesis we will emphasize on the value based algorithms, the
use of policy search algorithms is not discussed and tested as exhaustively.
1.2 Problems of reinforcement learning
In practice, a learning problem faces many restrictions in order to achieve an optimal (or at least good)
policy. In general Reinforcement Learning algorithms suer from the following:
The curse of dimensionality: Many algorithms need to discretize the state space, which is impossible
for control problems with high dimensionality, because the number of discrete states would explode.
The choice of the state space is crucial in reinforcement learning, so a lot of time has to be spent
to designing the state space. In chapter six, we will discuss function approximation methods, which
overcome this problem, but we will also see that new problems are introduced by this approach.
13
14 Chapter 1. Reinforcement Learning
Many learning trials: Most algorithms need a huge number of learning trials, specically if the state
space size is high, so it is very dicult to apply reinforcement learning to real world tasks like robot
learning. However, RL can also be very time-consuming for learning a simulated control task.
Finding good parameters for the algorithms: Many algorithms work well, but only with the right
parameter setting. Searching for a good parameter setting is therefore crucial, in particular for time-
consuming learning processes. Thus algorithms which work with fewer parameters or allow a wider
range of the parameter setting are preferable.
Exploration-Exploitation Dilemma: Often, even with a good state representation and enough learn-
ing trials for the learning process, the agent will become mired in suboptimal solutions, because the
agent has not searched through the state space thoroughly enough. On the other hand, if too many
exploration steps are used, the agent will not nd a good policy at all. So the amount of exploration is
another parameter to be considered (in a few cases, we can set an individual parameter for exploration,
for example, the noise of a controller).
A skilled reinforcement learner is needed: Since merely dening the reward function (which, in
itself is not always easy) is not enough, we must also dene a good state space representation or a
good function approximator, choose an appropriate algorithm and set the parameters of the algorithm.
Consequently much knowledge and experience is needed when dealing with reinforcement learning.
The argument that anybody can program a reinforcement learning agent, (because one needs only
dene a reward function intuitively) is not true in the most cases.
It appears that there are many problems to solve, but nevertheless, reinforcement learning has been applied
in many dierent domains successfully, the restrictions were only mentioned in order to emphasize that RL
is by far no panacea.
1.3 Successes in Reinforcement Learning
Because of the generality of Reinforcement Learning, many researchers have tried to apply reinforcement
learning to dierent elds, many of them successfully.
1.3.1 RL for Games
Games often have the problem of a huge state space, which can not be represented as a table. Thus a very
good function approximator is needed for learning, making the learning time very high, so that supervised
learning from human experts, search and planning methods or rule based systems are often preferred to RL
in games. But with clever choices of the function approximator and possibly adding a hierarchic structure
RL is entirely applicable to the eld of games.
Backgammon: The most popular and impressive RL approach is TD-Gammon by Gerauld Tesauro
[50]. The algorithm uses a feed forward neural network to determine the probability of winning for
a given state. For learning, TD() Q-Learning is used. For training trials, the algorithm uses self
play (over one million self-play games were used). The performance of this algorithm outperforms
all other AI approaches and meets the performance of the world best human players.
Chess and Checkers: Other approaches to apply RL to chess (The KnightCap programm by Baxter
[10]) or checkers (Schaeer [41]) were successful as well, but could not compete with human experts.
1.3. Successes in Reinforcement Learning 15
Settlers of Catan: M. Pfeier [38] used Reinforcement Learning to learn a good policy for the game
settlers of catan. He employed hierarchical reinforcement learning and a model tree as function
approximator for his Q-Function. Even though the game is quite complex, the algorithm manages to
compete with skilled human players.
1.3.2 RL for Control Tasks
Reinforcement Learning for control tasks is a challenging problem, because we typically have continuous
state and action spaces. For learning with continuous state and action space, a function approximator must
be used. Since in the most cases RL with function approximators needs many learning steps to converge,
results only exists for simulated tasks. RL has been used to solve the following problems:
Cart-Pole Swing Up: The task is to swing up a pole hinged on a cart by applying a bounded force to
the cart. The cart must not leave a specied area. This task has one control variable (the force applied
to the cart) and four state variables. The task was solved successfully by Doya [17] with an RBF
network and also by Coulom [15] with a feed-forward neural network (FF-NN). Other approaches
using Actor-Critic learning have been investigated by Morimoto [31] and Si [43].
Figure 1.1: The Cart-pole Problem, taken from Coulom [15]
Acrobot: The acrobot has two joints, one attached at the end of the other. There is one motor between
the of the two joints which can apply a limited force. The task is to swing both links in the upwards
position. Here again we have one control variable and four state variables, but this task is more
complex than the cart pole swing up task. The task was solved by Coulom [15] with a FF-NN, or
by Yonemura [56] with a hierarchic switching approach between several controllers from the optimal
control theory.
Double Pendulum: Here we have the same constellation as in the acrobot task, but there is an ad-
ditional motor at the base of the pendulum. Thereby we have two control variables and four state
variables. Randlov [39] solved this problem with the help of an LQR (linear quadratic regulator)
controller around the area of the target state.
Figure 1.2: The Acrobot, in the double pendulum problem, an additional torque can also be applied to the
xed joint. The gure is taken from Yoshimoto [57]
Bicycle Problem: Randlov has written a bicycle simulator where the agent must balance a bike.
Dierent tasks have been learned with this simulator, such as simply balancing the bicycle (the riding
direction of motion does not matter) or riding in a specied direction. The agent can use two control
variables: the torque applied to the handle bars and the displacement of the center of mass from the
bicycles plane. Depending on the task, the problem has four state variables (in balancing the bike) or
seven state variables (riding to a specic place). The simulator has been used by several researchers.
Randlov [39] solved both tasks with Q-Learning; she used tilings for her state representation. Ng and
Jordan [33] solved this task much more eciently with the PEGASUS policy search algorithm.
Truck-Backer-Upper Problem: In this task the agent must navigate a trailer truck backwards to a
docking point (see gure 1.3). The truck has to avoid an inner blocking between the cab and the
trailer, a too large steering angle and hitting the wall. The goal is to navigate the trailer to a specied
position perpendicular to the wall. The dynamics of the TBU task are highly nonlinear, the standard
task conguration used by Vollbrecht [53] has six dierent continuous state variables, the x and y
position of the trailer, the orientation of the trailer
trailer
, the inner orientation of the cab
cab
, its
derivative

cab
and the current steering angle. To control the truck we can change the steering angle,
so that the truck moves with a constant velocity. Vollbrecht successfully learned the TBU task which
employed a hierarchic Q-Learning approach using an adaptive kd-tree for the state discretization.
Inverted Helicopter Flight: Ng [32] managed to learn the inverted helicopter ight on the simulator
with very good results. Inverted helicopter ight at a constant position is a highly non linear and
instable process, making it a dicult task for human experts. Ng created a model of the inverted
helicopter ight by standard system identication methods. Learning was done with parameterized
regulators using the PEGASUS policy search algorithm in the simulator. The learned policy could
also be transferred successfully to the real model helicopter.
Swimmer Problem: The simulated swimmer consists of three or more connected links. The swimmer
must move in a two dimensional pool. The goal is to swim as quickly as possible in a given direction
by using the friction of the water. The state of a n-segment swimmer is dened by the n-angles and
the angular velocities of the segments, as well as the x and y velocity of the center of mass. This gives
1.3. Successes in Reinforcement Learning 17
Figure 1.3: The Truck Backer Upper Problem. The gure is taken from Vollbrecht [53]
us a state space of 2 n +2 dimensions. We can control every joint separately to arrive at n 1 control
variables.
Coulom [15] managed to learn the swimmer task for a 3, 4 and 5-segment swimmer (which gives
us a maximum of 12 state variables, which is quite respectable). He used continuous time RL and a
feed forward neural network with 30 neurons for the simpler and 60 neurons for the more complex
swimmers as function approximators. Training was done for more than 2 million learning trials to
get good policies. Consequently the learning time was huge. The learning performance also showed
many instabilities, but the learning system always managed to recover from these instabilities, except
in the case of the ve segment swimmer, where the learning performance collapsed after over two
million learning trials.
Racetrack Problem: For this problem the Robot Auto Racing Simulator (RARS) is used to learn to
drive a car. The simulator uses a very simple two dimensional model of car driving, where a single car
has four state variables (the 2-dimensional position p and the velocity v) and two control variables.
Additional state information about the track can be added to the state space (e.g. if dierent tracks
are used during learning). The aim of the task is to drive in the track as fast as possible, either on
the empty track or in a race with opponents. There are annual championships where several dierent
algorithm can compete in a race. Current algorithms either calculate an optimal path o-line rst,
which results in very good lap times, but which is poor if passing an opponent is necessary. Other
approaches try to nd a good policy by observing the current state, and by using clever heuristics.
These policies are usually good at passing opponents, but the lap times are not as good as the o-
policy calculation. Coulom [15] tried to learn a policy which has good lap times and is also good at
passing, but he had very limited success, the learned controller could not compete with either one of
the existing approaches. Coulom tried two dierent approaches using continuous time RL, one with
a 30-neuron feed forward neural network and one using specic useful features, which performed
better. The best policy managed to solve the given training track in 38 seconds, which is 8 seconds
slower than one of the fastest existing policies.
Robot Stand-Up Problem In this task, a three linked planar robot has to stand up from the lying
position. The robot has up to 10 state variables (
1
,

1
,
2
,

2
,
3
,

3
, x, x, y, y), but for the stand up
task only the rst six state variables are used. The robot can be controlled by applying torques to the
two joints of the robot. Morimoto and Doya [30] successfully used hierarchic RL to learn this task.
Q-Learning was used for the upper hierarchy level and an Actor-Critic approach was used for the
lower hierarchy. An adaptive normalized RBF-network (GSBFN) was used as function approximator.
Figure 1.4: The robot stand up task, the gure is taken from Morimoto [30]
1.3.3 RL for Robotics
RL is rather dicult to apply to robotics because it needs many learning trials, and in many cases we can
only measure a part of the state of the robot. Nevertheless, RL has been applied successfully to robots by
several researchers, usually policy gradient approaches are used in this case due to the high dimensional
state space of these problems.
Quadruped Gait Control and Ball Acquisition: Stone, Kohl [24] and Fidelman [18] used a policy
gradient algorithm to learn fast locomotion with the dog-like robot Aibo. The same approach was
used to learn Ball Acquisition, where the task was to capture the ball under the chin of the Robot
without kicking it away. They used a parameterized open-loop controller with 12 or 4 parameters, the
gradient of the policy was estimated using a numerical dierentiation approach directly on the AIBO
robot. The learnt locomotion policy was faster than all the other existing hand coded and learned
policies. The ball acquisition task could be learned succesfully, as well this task had to be manually
optimized for dierent gait controls or dierent walling surfaces.
Robot Navigation Tasks: Smart [45], [47] uses Q-Learning in combination with locally weighted
learning [4] for navigating a mobile robot, including obstacle avoidance tasks.
1.4. Basic Denitions for RL 19
Humanoid Robots: Peters [37] uses the natural Actor-Critic algorithm [23], for point to point move-
ments with a humanoid robot arm, given the desired trajectory. A pre-dened regulator is used for
the policy, and the parameters of the regulator are optimized. The successful approach in this paper
indicates that RL can also be useful for very high dimensional tasks like humanoid robot learning.
As we can see many interesting problems have been solved using RL in robotics and optimal control, but
also for many other interesting learning tasks. If used correctly, RL can solve very hard learning problems.
1.4 Basic Denitions for RL
1.4.1 Markov Decision Process (MDP)
In the formal denition (taken from Suttons book [49]) an MDP consists of:
The State-Space S : The space of all possible states. Can be discrete (S N), continuous (S R
n
) or
a mixture of both.
Action Space A: Space of all actions the agent can choose from, again, it can be discrete (a set of
actions), continuous or a mixture of both.
A numerical reward function r : S A R
The state transition function: f : S A S .
An initial state distribution d : S [0, 1] over the states space.
The agent is the acting object in the environment, at each step the agent can choose an action to execute,
which aects the state of the agent (according to the state transition function). The typical task of the agent
is to nd a policy : S A that maximizes the future discounted reward at time t: V
t
=
_
k=0

k
r(t + k).
is the discount factor and is restricted to the interval [0, 1]. For < 1 the sum is bounded if the reward
function is bounded, for = 1 the sum can diverge even for bounded rewards as well. The state transition
function f , the reward function r and the policy may be stochastic functions. For MDPs we have to
make an additional assumption on the state transition function and the reward function, which is called
the Markov property. Both functions may only depend on the current state and action, not on any state,
action or reward, that occurred in the past. Consequently, we can write r(t) = r(s
t
, a
t
, s
t+1
) for the reward
function and s
t+1
= f (s
t
, a
t
) for the transition function (or when talking about probability distributions
P(s
t+1
|s
t
, a
t
, s
t1
, a
t1
, s
t2
, a
t2
, ...) = P(s
t+1
|s
t
, a
t
)). Most of the algorithms require the Markov property for
their proved convergence to the optimal policy, but may still work if the Markov property is not violated too
drastically.
1.4.2 Partially Observable Markov Decision Processes (POMDP)
POMDPs lack the Markov property, the next state depends not only on the current state and action, it can
depend on the whole history. There are two points of view for POMDPs. If the current state of the POMDP
can be denitely determined by the history of the state, we can convert a POMDP in to an MDP by adding
the whole history to the current state. Then the decision process would have the Markov property again. But
this approach vastly increases the state space size, so it is not applicable.
The second point of view is that we can see a POMDP as an MDP with belief states. This is applicable
when parts of the state of the POMDP are not visible to the agent. Here, the agent maintains a probability
distribution of what the agent believes about the current not observable state, these distributions can then be
updated according to the Bayes rules. The distribution itself can now be seen as the state of the decision
process, as a consequence the process is a MDP again.
After having xed the basic denitions of RL, we can take a closer look at the RL Toolbox in the next
chapter.
Chapter 2
The Reinforcement Learning Toolbox
2.1 Introduction
The Reinforcement Learning Toolbox (RLT) is a general C++ Library for all kinds of reinforcement learning
problems (not just continuous ones). The Toolbox was designed to be of general use, to be extentable and to
provide a satisfactory computational speed performance. The library can be used with Windows and Linux.
The RLT is a general tool for researchers, and also for students, who want use reinforcement learning, and it
spares them a lot of additional programming work and allows the researcher to concentrate on the learning
problem instead. Since it requires more eort to write a Toolbox of general use, instead of coding the
programs just for a specic case, the Toolbox is a main part of this thesis.
The Toolbox contains a large selection of the most common RL algorithms. There are TD() for Q and V-
Learning, all with two dierent residuals for discrete and continuous time learning, Residual (Gradient) al-
gorithm, Advantage Learning, several Actor-Critic methods and policy gradient algorithms (like GPOMDP
[11]) and a version of PEGASUS [32]. Most of these algorithm can be used with dierent kind of function
approximators, we implemented constant and also adaptive normalized RBF networks (GSBFNs, see [30]),
feed forward neural networks (FF-NNs) and gaussian sigmoidal neural networks [42].
One main goal of the Toolbox is to enable the end-user to use RL without having to do any programming.
The Toolbox (the current and the old version) has been used by 20 to 30 researchers from all over the world,
and hopefully further users will test the Toolbox in the future.
2.1.1 Design Issues
For the design of the Toolbox, we attached importance to the following points:
Adaptable Learning System: The learning system should be very adaptable, so that new algorithms
can be added easily. A general interface for the learning algorithms is needed. In order to be able to
try many dierent algorithms for a learning problem, it should be possible to exchange the learning
algorithm easily. The possibility of learning with more than one algorithm at a time (can be used for
o-policy learning) also has to be considered. Since reinforcement learning problems should have the
Markov property, we decided to provide each algorithm with just the tuple < s
t
, a
t
, s
t+1
> at each time
step.
Learning from other Controllers: One good way to induce prior knowledge is to show the learning
algorithm how to solve the learning problem (usually not in an optimal way, otherwise you would
not need a learning tool). You can do this with a self written controller. It is often easy to write a
21
22 Chapter 2. The Reinforcement Learning Toolbox
simple policy which solves the learning problem, but for the learning algorithm these simple policies
are dicult do nd if the state space is very large. There has to be the possibility of using a controller
independent from the learning algorithm, so a general interface for controllers is needed.
The algorithms should be independent from the used state representation: Very few algorithms
depend on a single kind of state representation (e.g. discrete states or linear features) or on a sin-
gle kind of function approximator. So the algorithms should work with any kind of representation we
want to use for the Q-Functions or learned policies. In general, there are three dierent representations
we can learn with the dierent algorithms. We can learn V-Functions, Q-Functions or the policy di-
rectly, depending on the algorithm used. These three dierent kinds of representations need a general
interface for getting, setting and updating the value of the function for a specic state. Consequently,
the algorithms will work, no matter what function approximator (Tables, Linear approximators, Feed
Forward NNs) is used for the learned representation.
Easy methods for constructing your state space individually: An RL system should provide tools
for constructing and adapting the state space very easily, because this is one of the most crucial aspects
of RL. We have to provide tools for partitioning continuous state variables, combining discrete state
variables and substituting a discrete state space (which is more accurate) for a specic discrete state
number of another discrete state space (see chapter 3).
Tools for logging, analyzing policies and error recognition: In order to provide the opportunity
to analyze the learning process we have to construct tools for logging the episodes, analyzing V-
Functions, Q-Functions and Controllers. There are also a few areas like robotics where episodes are
very expensive to obtain (i.e. time consuming). There also has to be the possibility of learning also
from stored episodes instead of learning online. Learning from stored episodes will not work as well
as online learning (since it is a o policy learning), but the stored episodes can be used as kind of prior
knowledge for the agent. The stored episode data is also used by a few planning algorithms.
We also added tools for analyzing and visualizing policies, V and Q-Functions and a sophisticated
debugging system.
Representation of the actions: Since there is such a wide range of application for RL a single
data structure that actually matches all possible actions does not exist. There are actions having
only a specic index, having continuous action values or having dierent durations. For hierarchical
learning actions can consist of other actions. A class structure must be developed to match all these
requirements.
Hierarchical Reinforcement Learning: In Hierarchical Reinforcement Learning we can construct
dierent layers of hierarchy for the learning problem. In each hierarchy level we then have again a
Markov decision process with the same requirements as we had for the original learning system. So,
simply, we have an agent in each hierarchy level which must decide what to do. The design of the
agent has to consider that it can also act in a hierarchic MDP instead of directly in the environment.
Speed: For most reinforcement learning problems we need a huge number of trials to learn from. Very
often the parameters were not chosen correctly, so the learning process must be repeated often to nd
good parameter regimes. Thus speed is a crucial design issue of the Toolbox in order to be useable.
In many situations, a more complex implementation has been chosen to get a better performance.
There is always a trade-o between good performance and generality of a software; consequently,
the Toolbox will never be as quick as those specialized solutions which are optimized just for one
2.1. Introduction 23
algorithm and for a specic learning problem. But with a good implementation of the classes, a quite
impressive performance can be reached.
Easy to Use: After all, the Toolbox should be user friendly, and not just for RL experts. Thus the
class system has to be intuitive.
2.1.2 Programming Issues
The Toolbox conforms to the following programming standards (if not explicitly mentioned otherwise):
All class names begin with a C prex.
All method names begin with lower case, all words of the method name should begin with upper case.
The use of references instead of pointers is intentionally avoided
For standard input and output operations, the ansi-c functions printf and scanf has always been used.
The use of cout, cin and other c++ streams has been avoided deliberately.
All objects that are created by the Toolbox are deleted if they are not used any more. If the object is
instantiated by the user, the Toolbox will never delete such an object.
Nearly all objects or data elds that are needed more than once and must be dynamically allocated are
created just once globally for the class and will not be deleted until after they are used for the nal
time. This is done very rigorously especially if the data eld is needed each step of the learning trial.
If a class uses a data array given from outside the class, it always creates a new one and never stores
just the pointer of a given data array. This is done for usability, so we can also pass statically allocated
arrays to objects, which lose their scope later on.
Please follow these standards when extending the Toolbox.
2.1.3 Libraries and Utility classes used
The Torch library
The Torch library (www.torch.ch) is a c++ class framework for the most common supervised learning
algorithms. We use the Torch library because of the good neural network support. With the Torch library
we can create arbitrary feed forward NNs with an arbitrary number of dierent layers. The layers can be
interconnected as needed (but usually a straight forward NN is used). For these neural networks the gradient
with respect to the weights can be calculated, given a specic input and output. For the gradient calculation
the Torch library uses the back propagation algorithm, the gradient is needed by many dierent algorithms
to update the policy or the V-Function. We integrated the whole Torch library gradient calculation in the
Toolbox, so that every Torch Gradient Machine can be used. For further details consult the online reference
of the Torch library (www.torch.ch).
The Math Utilities
For convenience we also developed a small mathematical utility library for vector/matrix calculations. There
are two main classes:
CMyVector: The vector class represents an n-dimensional row or column vector containing real num-
bers. The main mathematical operations such as the scalar product, dot product, vector addition and
the matrix multiplication have been implemented.
CMyMatrix: The matrix class represents an n m matrix for real numbers. Again basic mathematical
operations have been implemented (matrix multiplication, multiplication with a vector), but there are
no complex operations like the calculation of the inverse.
We intentionally did not use any existing library, because this approach best suited in our class system, and
moreover, only basic mathematical operations are needed. Many objects that can be represented as a vector
(e.g. a state or a continuous-action values vector) are directly derived from the vector class, so mathematical
calculations with these objects are very easy.
Debugging Tools
Since the Toolbox and reinforcement learning processes in general are very complex systems, we need a
good debugging tool to recognize errors more easily. If the learning process does not show the expected
results, it is usually very hard to distinguish bugs from incorrect parameter settings or the wrong use of an
algorithm. So the learning process must be exactly trackable. But tracking the learning process is very time
consuming and since we usually do not need the whole learning process, rather only parts of it, we decided
on the following system:
The Debug output is written to specied les via the call of DebugPrint.
Debug outputs are always assigned to a specic symbol (e.g. q stands for Q-Functions). We can
assign an individual debug le for each symbol or a general debug le for all symbols.
The Debug output of a certain symbol will only be written to a le if a le has been specied with
DebugInit(symbol, lename). Otherwise the DebugPrint call is ignored.
The Symbol + when using DebugInit is an abbreviation for all debug symbols, so it enables all debug
outputs.
Nearly all objects needed for learning have their individual debugging messages. But usually, the
output is not totally intuitive; thus looking in the source code where a debug message has occured is
often necessary.
2.2 Structure of the Learning system
To learn an optimal or good policy for a MDP we need the tuple < s
t
, a
t
, r
t
, s
t+1
>, no matter what algorithm
we use. Thus we need a good and robust systemwhich provides these values for our learning algorithms. The
learning system is structured into three main parts. The agent (class CAgent) interacts with its environment.
It has an internal state and can execute actions in its environment which aects this internal state. The
second part of the learning system is the listeners (class CSemiMDPListener). The agent maintains a list of
listener objects. At each step the agent informs the listeners about the current state, the executed action and
the next state (so the listeners obtain the tuple < s
t
, a
t
, s
t+1
>). The agent also informs the listeners when a
new episode is started.
What the listener class does with this information is not determined at this point. The listeners can be used
for dierent learning algorithms, but also for logging or parameter adaptation. Through this principle, many
2.2. Structure of the Learning system 25
Figure 2.1: The structure of the learning system, with the agent, the environment, the agent listeners as
interface for the learning algorithms and the agent contollers
listeners can trace the training trials, so we can do logging and learning simultaneously. It is also possible to
use more than one learning algorithm at a time, but we typically have to do o-policy learning when using
several learning algorithms at once, so it is only partially recommendable.
The nal main part is the controller of the agent. A controller tells the agent what action to execute in the
current state. This controller commonly calculates the action with a Q-Function, but some algorithms use
another representation of the policy. It would also be convenient to be able to use self-coded controllers
to improve the learning performance. For that reason we must design a general interface for controllers
which is decoupled from any learning algorithm, so that we can use arbitrary controllers with any learning
algorithm as listener.
2.2.1 The Listeners
Listeners are all inherited from the class CSemiMDPListener. This class denes the interface for the agent
to send the step information < s
t
, a
t
, s
t+1
> and the beginning of a new episode to the listeners. The interface
basically consists of two functions which must be overridden by the subclasses.
nextStep(CStateCollection*, CAction *, CStateCollection*): called by the agent to send the < s
t
, a
t
, s
t+1
>
tuple to the listener.
newEpisode(): when this function is called, the agent indicates that a new episode has begun.
So, as already discussed, each listener gets the < s
t
, a
t
, s
t+1
> tuple , but for learning we also need to know
the reward output of this step.
Reward Functions and Reward Listeners
A reward function has to return the reward for each step, so it implements the function r
t
= r(s
t
, a
t
, s
t+1
).
Since we want to be as exible as possible, we do not just want to use one reward function for one learning
problem. We decouple the reward function from the rest of the environment model. Reward functions
are implemented by the interface CRewardFunction, where the function getReward must be implemented.
There is also a class representing the reward function r
t
= r(s
t
), which depends only on the current state.
This class is called CStateReward.
The information coming from a reward function is provided by the class CSemiMDPRewardListener to
the listeners. This kind of listener gets a reward function object as an argument in the constructor and
transfers the reward as additional information to the nextStep method. Thus we can dene dierent reward
functions for dierent listeners. The nextStep method has the following signature for reward listeners:
nextStep(CStateCollection*, CAction *, double reward, CStateCollection*)
Nearly all learning algorithms implement the CSemiMDPRewardListener interface.
2.2.2 The Agent
As already mentioned, the agent is the acting object. The agent has a current internal state and can execute
actions to change the current state. At each step it stores the current state, executes the action and stores
the next state. This information is then sent to all listeners. Since this approach should work for dierent
environments, we must decouple the internal state representation and the state transitions from the agent
class. Therefore we introduce environment models. An environment model stores the current state and
implements the state transitions. The agent itself is independent from the learning problem and implemented
in the class CAgent. The agent has a set of actions from which it can choose. Usually he follows a policy
coming from a controller object (CAgentController set by setController(CAgentController *)).
The agent provides functions for executing a single step or a given number of episodes with a specied
maximum number of steps. The agent class also maintains a function for starting a new episode. With this
function, the model is reset and the new episode event is sent to the listeners. Of course this is only possible
for simulated tasks where we have the possibility of resetting the environment. In a robotic task where the
robot has to be placed at a given starting position, this would not be possible.
2.2.3 The Environment Models
The environment maintains the current internal state and describes the agents internal state transitions when
executing an action. It also determines whether an episode has ended. In the Toolbox we distinguish state
transitions which can be described accurately (e.g all simulated tasks) from those which can not (e.g. robotic
tasks). We provide an individual interface for each of these two dierent types of environment models:
The CEnvironmentModel class
The class CEnvironmentModel represents the agents environment model. It provides functions for fetching
the current state into a state object, executing an action and determining whether the episode has ended. For
this functionality the user has to implement the following methods:
The doNextState(CPrimitiveAction *) function has to calculate the internal state transitions (or execute
an action and measures the new state). To indicate that the model has to be reset after the current step
(since the episode has ended) we must set the reset ag; to indicate that the episode failed, we set the
failed ag.
The getState(CState *state) function allows the agent to fetch the current state. The internal state
variables have to be written in the state object. We will discuss the state model later.
doResetModel(): Here we have to reset the internal model variables. For example in simulated tasks
we can set the internal state to an initial position; in a robot learning task we would have to wait until
the robot has been moved to his initial position.
Figure 2.2: Interaction of the agent with the environment
The CTransitionFunction Class
This interface represents the transition function which is used for the internal state transitions. This class
should be implemented instead of the CEnvironmentModel class if the transition function is known, which
is generally true for all (self coded) simulated tasks. The class has to implement the function s
= f (s, a).
Additionally there are functions for retrieving an initial state for a new episode and determining whether the
model should be reset in a given state. For this functionality we provide the following interface methods to
the user:
transitionFunction(CState *oldState, CAction *action, CState *newState, CActionData *actionData):
Here the state transition s
t+1
= f (s
t
, a
t
) must be implemented. Our calculated new state has to be
written in the specied newState object.
getResetState(CState *resetState): Here we can specify our initial states for the episodes. This can be
random states or a set of specied states. The initial state has to be written in the specied resetState
object. A few initial state sampling methods (like random or initialization with zero) are already
implemented, the kind of initial states we want to use is specied by the function setResetType.
isFailedState(CState *state): returns if the episode has failed in the given state.
isResetState(CState *state) : returns if the model should be reset after visiting this state (similar to
isFailedState).
This transition function, which is specied by the user, can now be used to create an environment model,
which maintains the current state of the agent and uses the transition function for the state transitions.
It also resets the model according to the specied functions. This functionality is provided by the class
CTransitionFunctionEnvironment.
Figure 2.3: Using an individual transition function for the Environment
2.2.4 The Action Model
As already discussed in the design issues section, the action model has to match a wide variety of applica-
tions. When only using a nite set of actions, the index of the action in the action set is typically the only
information the action contains.
But there are actions which take more than one step to be executed, for that reason, we should also be able
to store the number of steps needed. This number does not have to be xed, but can depend on the current
state.
For a continuous action space, the action object must store the action values which have been used. Again,
these actions can last for more than one step. There should also be the possibility of intermixing continuous
actions with discrete ones (e.g. robot soccer: navigating and shooting).
Other actions contain primitive actions, as in hierarchic learning, but we will discuss this kind of actions
later.
We decided on the following action model: discrete actions coming from an action set do not contain any
information, the information is just the index of the action in an action set. An action object is always created
only once and the action object pointer serves as search criteria in an action set. This approach works well
for actions which do not contain any changeable data, but it will not work, for example, for continuous
actions. For this changeable data we provide a general interface for setting, obtaining and creating these
data objects. We introduce action data objects for decoupling the action from its changeable information.
Each type of action with changeable data has a specic kind of action data object; multi step actions store
the duration in the action data object, while continuous actions store the action values vector. The current
action data is stored in the actions own action data object.
But this approach leads to another problem. What happens if an algorithm wants to change the action
data (for example, set other continuous action values and determine an other Q-Value)? If the action data
object is changed, all other listeners will get this falsied action data object, because all listeners receive the
same action object. So we would need to rely on all listeners to change at least the action data back to the
original state after using it. This would not be a good approach, therefore we introduce additional action
data parameters for all methods which receive action objects. This action data parameter always has higher
priority than the action data object of the action itself. As a result the action data implied by the parameter
is always used. If the action data object is omitted, the actions native action data object is used. A listener
must not change the action data object of an action; instead he has to use its individual action data objects.
The data object of the action is only changed by the agent itself and always represents the action executed
in the current step.
All actions provide a general interface for obtaining the action data object of the action (a null pointer is
returned if no action data object is used) and creating a new action data object of the correct type. Addition-
ally all action data objects provide functions for setting and copying the action data object by another action
data object, so it is easy to create new action data objects and use them, if needed.
Figure 2.4: Action objects, Action sets and action data objects
The following action types with dierent action data objects are implemented in the Toolbox.
Discrete Actions
As already mentioned, all the information of a discrete action is contained in the action index, which is
already represented by the action pointer and an action set. What else do we need to represent discrete
actions? In some states a specic action may not be available, so it must be possible to restrict the action set
for certain states. As a result the action object has to provide a function which determines whether the action
is available in a given state. An action can also last for more than one step (we only allow a xed duration
here). These functionalities are already implemented in the action base class CAction. The CAction class
already denes the interface for the action data objects, but its individual action data object is always empty,
because we do not have any data to store.
Multi-step Actions
For actions which do not have a xed duration, we need another implementation, because this variable
duration must be stored in an action data object. The action data object contains the following information:
The number of steps the action has already executed.
Whether the action has nished in the current step (usually used for hierarchical learning).
Multi-Step actions are represented by the class CMultiStepAction. Whether an action has been nished has
to be decided in the current step (so it depends on < s
t
, s
t+1
>). This approach gives us two possibilities for
using multi-step actions.
The duration can be set by the environment model. This is useful in robotics, for example where we
do not know the exact duration of a specic action before execution. After execution, the duration can
be measured and then stored in the multi-step action data object. In this case, the nished ag will
always be true to indicate that in the next step another action can be chosen.
The duration and the nished ag can be set by an hierarchic controller, which increments the duration
of each step and decides whether it should continue executing the action. We will discuss this approach
later in the hierarchical reinforcement learning section.
Continuous Actions
Continuous actions store an action value vector with one action value per control variable. In order to
facilitate the calculation with continuous action data objects, the data objects are derived directly from the
CMyVector class.
2.2.5 The Agent Controllers
For controllers, we need an interface which is decoupled from the used learning algorithms, so that we can
use any controller we want, no matter what listeners we use. This is accomplished with relative ease by
introducing an individual controller object for the agent, which can be set by the user. The controller can
choose from a given action set and has to return an action for a given state. This approach works well
when using discrete actions which contain no changeable information, in this case we can return the action
pointer and we are nished. But this approach does not work for our action data model. The controller is
not allowed to change the action data object of the action itself (only the agent is allowed to), so how can
we return changeable action data from a controller?
At this point we introduce action data sets. An action data set is a companion piece of an action set, for
each action in an action set we store a new corresponding action data object in that set (provided that type
of action has an action data object). Note that this can be action data of any kind, so we are able to mix the
dierent action types. When the agent (or any other object) wants to retrieve an action from a controller, he
always passes an additional action data set to the controller. The controller now chooses a specic action,
modies the action data object assigned to the specied action and returns the pointer of the chosen action.
In order to access the action data object, the agent must get it from his individual action data set. After
retrieving the action from the agent, the agent changes the content of the action data of the current action to
the action data calculated by the controller. The native action data objects can only be changed by the agent.
Figure 2.5: The interaction of the agent with the controller
2.2.6 The State Model
Since the choice of which state representation is used for learning problems is one of the most essential
steps, we must design a very powerful state model.
In order to avoid misunderstandings due to the dierent formulations we use the following notations for the
state model:
state: everything that is a state object.
state variable: a single state variable from a state object (so, for example, continuous state variable
number one, which could be the x location of the agent).
model state: the state object obtained from the environment model, thus the agents internal state.
modied state: state object that can be calculated from the model state. For instance, this can be a
discretization of the continuous model state.
For general reinforcement learning tasks, we have an arbitrary number of continuous and discrete state
variables. Our state model collects these state variables in one state object.
A state is represented by the class CState and consists of an arbitrary number of continuous and discrete
state variables.
The state properties object CStateProperties stores the number of discrete and continuous state variables
a state object maintains. It also stores the discrete state sizes for the discrete state variables and the valid
ranges for continuous state variables. For continuous state variables we can additionally specify whether
the variables is periodic or not (e.g. for angles). The state properties are either created by the environment
model (where the user has to specify the exact properties) or for modied states by the state modier. All
state objects describing the same state maintain a pointer to the specied state properties object. They do
not create their own copy.
The model state should contain all information about the agents internal state: usually a few continuous and
discrete state variables. There is no need for the model state to contain any discretization of the continuous
state variables because this type of information is stored elsewhere.
Figure 2.6: State Objects and State Properties
The State Modiers
Up to this point, we have dened a general representation for the agents internal state, but we can usually not
use this state directly for learning. In general we need to discretize the model state or calculate the activation
factors of dierent RBF centers. All these new state representations can also be represented by our state
class (for example a discrete state is a CState object containing just one discrete state variable). So we
need to dene an interface which takes the model state and calculates a modied state representation from
this model state. This is done by the class CStateModier. A state modier gets the model state (or even
other modied states) and returns a modied state, which can now be, for example, a state containing only
one discrete state variable for the discretization. Thus every component accessible to the model state can
calculate a discretization of that state if needed. To mantain of exibility we do not restrict the components
to use just one specic state representation.
All modied state representations have their individual state properties, therefore we derive the class from
the state properties class so that the state modiers can be used to create new state objects. The state mod-
iers have to implement the interface function getModiedState(CStateCollection *originalStates, CState
*modiedState), where the modied state is calculated and stored in the corresponding modiedState ob-
ject.
The State Collections
It is likely that one and the same modied state is used in more than one component (for example the
dierent V-Functions of a Q-Function usually use the same state representation). In order to avoid redundant
calculations of modied states we introduce state collections (class CStateCollection). State collections
maintain a collection of dierent state objects, the model state and each modied state needed by the whole
learning system. Now we transfer state collections instead of state objects to our learning components.
Whenever a component needs to access a specic state, it retrieves the specied state from the collection.
The modied state is only calculated the rst time it is needed for the current model state and subsequently
stored in the state collection. The state modier gets a state collection as input, so it can also use other
modied states for its calculation. The modied state is also marked as valid, so the state collection knows
that the modied state does not have to be calculated repeatedly.
Figure 2.7: Calculating modied states and storing them in state collections
Our state collections use the state properties pointer as the index for a state object. Thus in order to retrieve
a specic state, we only need to transfer the state properties object of the desired state. If no state properties
object is specied the collection will always return the model state.
2.2.7 Logging the Training Trials
For the Toolbox we also need tools for logging the training trials in order to trace the learned policy or to
reuse the stored trails for learning. We must store the states and the actions, as well as also the reward values.
For the states it would be convenient if we could log more than one state of the state collection. For example
if we wanted to store the features from a RBF-network too, because calculating these features can be quite
time consuming.
In a few areas, it is very dicult to gather the learning data (e.g. in robotics), so it is useful to have a tool
that is able to log the entire training trials and then use these trials to learn again with other parameters
for the algorithm or even with another learning algorithm. Of course, the stored episodes can only be used
for o-policy learning, that is, a dierent policy is learned, other than the one that is followed. O-policy
learning often leads to a worse performance, but can be used as a kind of prior knowledge before starting
real learning. Due to our listener design, creating logging tools is very easy, because the loggers can be
implemented as agent listeners. The listening of the most important classes for logging follows below.
State Lists
For storing a whole episode in memory we need a list of states. Creating a new state object for each step and
placing it in a list would be possible, but it is rather slow, because a state object muste be allocated each time
dynamically. Thus we decided to design an individual state list class (CStateList) which maintains a vector
for each state variable (double vectors for continuous state variables, integer vectors for discrete variables).
Since we use STL (Standard Template Library) vectors, the vectors are dynamically enlarged as needed.
The class provides functions for placing a state at the end of the list or retrieving the state of a given index
(an already existing state object is passed as a buer). The class already supports saving/loading a state list
to/from disk. The output format is patterned on the structure of the state list class, a vector for each state
variable is stored separately in a new line. So if we look at the output le, we will see the state transitions
for each state variable separately.
State Collection Lists
The class CStateCollectionList stores a list of state collections. Therefore, this class contains a set of state
lists; we can choose which states we want to store from the state collection.
It is therefore possible to store not only the model states of a learning trial, but also other states such as the
calculated RBF-features (which is rather time consuming for big RBF-networks). Storing RBF-features, on
the other hand requires quite a lot memory, but that should not be a problem nowadays.
Action Lists
As already discussed, an action consists mainly of its action pointer and the action data object. Storing the
action pointer does not make sense, so we store only the index of the action in a given action set. We also
have to store the action data object from the action. For the action data object, we store copies of the current
action data object, so these have to be dynamically allocated.
In the output format we can see the sequence of action indices. Each index is followed by the action data of
the action (if there is one).
The Episodes
The episode objects (CEpisode) can store one episode in memory. They are already designed as listeners.
Since only one episode can be stored, the episode object dismisses all stored data once a new episode begins.
The class maintains a state collection list and an action list. So we can specify which states we want to store.
In an episode there are obviously numS teps + 1 states and numS teps actions to store. The episode objects
can already be used to store the current episode to a le. At rst the state collection list is stored and then
the action list is written to a le.
Logging the entire learning process
The agent logger (CAgentLogger) is able to store more than one episode. It also implements the CSemiMD-
PListener interface in order to get the data from the agent. The agent logger maintains a list of episode
objects, so you can retrieve whole episodes from the logger. From the episodes you can again retrieve the
single states. The number of episodes the logger should hold in memory can be set. When storing the whole
learning trial to a le, the output function of the episode objects is used. This is also true when loading from
a le.
The Episode Output class
The output format of the agent logger class is easy for the computer to understand when reloading it again,
but since it is not readable for humans, we furthermore create an individual class for better readability.
For us it is more practical to read the logged learning trials as a sequence of the < s
t
, a
t
, r
t
, s
t+1
> tuples.
The class CEpisodeOutput provides this functionality. It does not cope with holding anything in memory,
but writes the step tuple directly to the le. For this class we can only dene one single state from the state
collection.
There is also a second class (CEpisodeOutputStateChanged) which does the same, but only when the spec-
ied state changes. This is useful if we want to trace discrete states, which do not change very often.
Using stored episodes for learning
We already mentioned that we would like a tool which enables us to learn from a stored learning trial. In
order to provide this functionality we create the interface CEpisodeHistory. This interface represents a set
of stored episodes.
For presenting the stored episodes to the listeners, we need an environment model which goes through
the stored states of an agent logger and a controller class which goes through the stored actions. Both
the controller class and the environment class are implemented by the class CStoredEpisodeModel. If the
episodes of the agent logger contain more than the model state, the class copies the additional states in the
state collection of the agent logger and marks them as valid. Therefore, stored modied states can be reused
and we do not need to calculate them again.
Another method of using previous episodes is by performing Batch updates as mentioned in [49]. When
performing Batch updates, after each episode, we show one or more previous episodes to the learning
algorithm again. This can improve learning for certain kinds of algorithm (e.g. Q-Learning), but it also
falsies the state transition distributions, so we have to be careful with model based algorithms. Batch
updates are represented by the class CBatchEpisodeUpdate, which presents a specied number of stored
episodes to a specied listener after each new episode. Therefore the Batch Update class also denes its
own public functions for presenting a specic episode, N-random episodes or all episodes to the listener.
Only the episodes the agent logger is currently holding in memory can be used, which gives us the ability to
use only the newest K episodes for the batch updates.
Using the stored steps for learning
It is also possible to use the stored single step information for learning from past episodes. Here we send ran-
domly chosen (thereby temporally unrelated) steps to the listeners. Since the steps are temporally unrelated
we can see each step as an individual episode.
Again we introduce an individual interface, the CStepHistory interface, which represents a (time indepen-
dent) set of steps. This interface also implements functions for presenting N randomly chosen or all steps to
a given listener.
Similar to batch updates, we may use past step information during the learning process. We can perform the
previous step updates as long as there is time (an the next step begins). For batch step updates we create the
class CBatchStepUpdate. We can specify the number of steps from the history simulated to a listener after
each real step and after each episode. The steps are randomly chosen.
It is not clear whether the episode or the step update is better, this depends on the problem. Both approaches
can only be used for Q-Value based algorithms. The advantage of the episode update is that algorithms with
e-traces can be used, the advantage of the step update is that the steps can be randomly. Through intermixing
the step information, the algorithm might discover a better action selection strategy.
The idea of doing the step updates is also strongly connected with combining planning and learning ap-
proaches like the Dyna-Q algorithm [49]. We will discuss this algorithm in the next chapter.
2.2.8 Parameter representation
In our design we also need a general interface for the algorithms parameters. We decided on the following
concept. Each parameter is represented as a < string, double > pair where the string represents the name
and the double value naturally determines the parameters value. The parameters of an object are stored in
a string-map. We provide functions for adding a new parameter, and getting and setting a parameters value
with the class CParameters. All objects that maintain some sort of parameters implement this interface, thus
a parameters value is always retrieved or changed in the same way.
Figure 2.8: The adaptable parameter representation of the Toolbox
Parameters of dierent object hierarchies
An object A is on a lower hierarchical level than another object B if A is referenced by B. We will call A a
child object from B (note that this has a dierent meaning than child classes, where the class information is
inherited; in this case child objects are only referenced). Often objects with a low hierarchical level contain
additional parameters. In general, it is complicated to retrieve these objects, for that reason we decided on a
more complex parameter representation.
Each object referencing other objects with parameters take on the parameters of its children. Thus all
parameters from objects on a lower hierarchical level are added to the parameter map of thes parent object.
All objects also maintain a list of their child objects and every time a parameter is changed at the parent
object, the corresponding parameter of the child object(s) is adapted too. Thus we can change a parameter
directly at the learning algorithm, which is actually a parameter of child object, for instance the e-traces of
a learning algorithm.
All listeners, all agent controllers and all classes which represent learned data are subclasses of the CParam-
eterObject class, so the parameter dealing with all these objects is generalized.
Adaptive parameter calculation
With our design it is also easy to add the ability to adapt the parameter values dynamically. In the area of
RL, this approach can be used for many dierent parameters, such as the learning rate or an exploration rate.
In the normal case, the parameters value depends on one of the following quantities:
the number of steps
the number of episodes
the average reward of the last N steps
the estimated future discounted reward (coming from a value function)
For each of these classes we provide a corresponding adaptive parameter calculator class (CAdaptivePa-
rameterFromNStepsCalculator, CAdaptiveParameterFromNEpisodesCalculator, CAdaptiveParameterFro-
mAverageRewardCalculator, CAdaptiveParameterFromValueCalculator). All these are subclasses of the
CAdaptiveParameterCalculator and therefore can be assigned to a parameter of a parameter object. If an
adaptive parameter calculator has been assigned to a certain parameter, the value coming from the calcu-
lator is used instead of the constant value from the map. These adaptive parameter classes also provide
a huge degree of freedom to calculate your parameters value. We can set dierent osets and scales for
both the target value (number of steps, episodes, average reward...) and the parameter value. For the target-
value/parameter-value mapping we can choose from dierent functions like a linear, square or a logarithmic
function.
2.2.9 A general interface for testing the learning performance
For this thesis, we need to test the performance of specic algorithms with a specic parameter setting. We
will call an algorithm with a specic parameter setting a test suite. Test suites are particularly important
for the benchmark test of this thesis. With an interface for evaluating a test suite, we could also write tools
for nding good parameter settings of the algorithm automatically. In order to design such tools we need
the following preliminaries.
An interface for learned data
Every RL algorithm needs to store the learned data in a specic kind of representation (e.g. a Q-Function
or directly the policy). Due to the wide range of algorithms, these learned data can be stored in many
dierent representations. Nevertheless there are functionalities for such learned data objects which are
needed for all these representations, such as resetting, storing or loading the learned data. Therefore we
create the interface CLearnedDataObject, which provides abstract functions for these functionalities. All
classes which maintain some kind of learned data implement this interface.
Policy Evaluation
How do we decide if a policy is a good or bad one? We can estimate the future discounted reward for a
certain number of states or the average reward during a certain number of episodes from real or simulated
experience. These methods are also called Monte Carlo Methods. We provide the classes CValueCalculator
and CAverageRewardCalculator. Both classes are subclasses of CPolicyEvaluator and thus can be used in
the same way. We can set the number of episodes used for evaluation for both classes. The initial states of
the episodes are sampled as usual (determined by the environment model). If the initial states are sampled
randomly, we need a large number of episodes in particular for large initial state spaces in order to get a
reliable result. We also make it possible to use the same set of initial states every time for each evaluation
with the classes CSameStateValueCalculator and CSameStateAverageRewardCalculator. Always using the
same state does not make the result more reliable, but because there is not so much variance between the
results of dierent policy evaluation trials, so this method is better suited for tracing the learning process.
Test Suites
As already mentioned, we refer to test suites as a specic algorithm-parameter setting. We require that the
test suites can be evaluated with a scalar value, i.e. we want a scalar value indicating how good the learning
performance of this test suite is. We also want to be able to change some of the algorithms parameters and
then evaluate the test suite once more.
In our approach a test suite consists of one or more listeners (representing the learning algorithm) as well as
one or more learned data objects (representing the Q-Functions, V-Functions or policies). The test suite class
(CListenerTestSuite) maintains a list of both object categories, the listeners and the learned data object. The
class also has access to the agent, the controller used during learning and the controller used for evaluation.
The user has the opportunity to employ dierent controllers for evaluation and learning. For example, the
latter can use exploration steps, which are not desirable for the evaluation process.
The test suite class already provides the functionality needed for learning a given number of steps and
episodes (therefore the agent is needed).
Evaluating test suites
There are many ways to evaluate a test suite. For example, we can measure the average reward or some other
quantity during the learning process. This gives us a good estimation of the algorithms performance. But it
is also possible to count the number of episodes needed for the algorithm to achieved the goal of the learning
task several times. We created the test suite evaluation interface CTestSuiteEvaluator as an interface for all
these kinds of test suite evaluation. We only implemented the rst approach.
In our test suite evaluation approach we begin by learning for a given number of steps and episodes. Then
the learners are disabled (removed from the agents list) and the test suites evaluation policy is evaluated
with a given policy evaluator. This value is stored and then the learning is resumed. This is not a very
fast approach, quite a bit simulation time is spent for policy evaluation, but it is more reliable because an
individual evaluation policy can be used without the falsifying eect of exploration. The average of the
values evaluated during the learning trial is used as the result. But evaluating the learning process for just
one learning trial is not very reliable, so it can be repeated several times.
The policy evaluation values gotten during the learning process are also stored in a database like le format.
If a test suite with a specic parameter setting has already been evaluated, the stored values are reused. They
are also used for the creation of the diagrams.
Searching for good parameter settings
The Toolbox provides tools for searching the parameter space of one specied parameter. We may simply
search at specic, given parameter values and return the best value. Or we can specify a starting point,
the number of search iterations, and the search interval, and the Toolbox will try to nd a good parameter
setting. This is done by the class CParameterCalculator. When searching within the specied interval, the
class begins with the starting point and then evaluates the policy at the double and at the half of the starting
parameters value. It continues with this process until it nds a specic maximum (more than 25% better
than the worst result), or it leaves the given interval. After this procedure it tries to locate the maximum
value more accurately if there are any iterations left.
Chapter 3
State Representations in RL
In RL there are three common state representations which are used for learning:
Discrete States: Discrete states identify the current state of the agent with just one discrete state
number. This state number is then used for look up tables.
Feature States: These states are used for linear function approximators. A feature state consists of n
features, each having an activation factor typically between [0,1].
Other function approximators (like feed-forward neural networks): States for other function ap-
proximators usually have no requirements; they can actually consist of any number of discrete and
continuous state variables.
All of the discussed state representations can be used for most learning algorithms, which use these states
as input for their Q (or V) Functions or directly for their learned policy. We will discuss each of these state
models, as well as the function representations that can be used for these states.
3.1 Discrete State Representations
3.1.1 Discretization of continuous Problems
For continuous learning tasks, a discrete state representation can be problematic. The continuous MDP can
lose its Markov property if the state discretization is too coarse. As a consequence, there are states which
are not distinguishable by the agent, but which have quite dierent eects on the agents future. It also
follows that the probabilities P(s
|s, a) change for dierent policies (since we lost the Markov property).
Nevertheless, if this eect is not too dramatic, most of the algorithms can cope with it. There are some
successful examples of using a discrete state representation for continuous state problems such as Suttons
Actor-Critic cart-pole balancing task [49] p. 183. These approaches obviously could not calculate the
optimal policy, but worked suciently well, at least for such easy cases. In general, it is advisable to use
discrete state representations only for discrete problems.
3.1.2 State Discretization in the RL Toolbox
For calculating a discrete state representation of the model state, we introduce the class CAbstractState-
Discretizer. The user can already implement any state discretization he wants by deriving this class and
overwriting the getDiscreteStateNumber method, which returns a discrete state number given the current
39
40 Chapter 3. State Representations in RL
model state. But because calculating this state discretization is normally tedious, so we provide tools to
simplify the process. When dealing with discrete state variables, it is necessary to consider the following
scenarios:
We have one or more continuous state variables and want to discretize them.
We have two or more discrete state variables within the same state object and we want to combine
them.
We have two or more discrete state objects and want to combine them.
We want to add a more precise state representation only for one or more discrete state numbers. For
example the discrete state object A contains useful information only if the discrete state object B is
in state X. Otherwise we can neglect the information in A. Combining A and B would give us a state
size of |A| |B| states, but by substituting the state object A for the state X of state object B (instead of
being in state X, we can now be in one of the states of A) we get a state size of |A| + |B| 1. We will
call this approach a state substitution.
We design classes which support all these scenarios, so that a discrete state representation can be easily
built. All the classes mentioned below are subclasses of CAbstractStateDiscretizer, so they t in to our state
modier model.
3.1.3 Discretizing continuous state variables
In our approach we can only discretize a single continuous state variable at a time, so we get a discrete state
object for each continuous state variable. These objects can then be combined later on.
The class CSingleStateDiscretizer implements this approach. We can specify an arbitrary partition array and
the continuous state variable to be discretized.
Figure 3.1: Discretizing a single continuous state variable
3.1.4 Combining discrete state variables
In a few model states we have more than one discrete state variable, e.g. the x and y coordinates in a grid
world. The class CModelStateDiscretizer combines these state variables to one discrete state object. We
can also specify which discrete state variables we want to use. The new discrete state size is obviously the
product of all initial discrete state sizes.
3.1. Discrete State Representations 41
3.1.5 Combining discrete state objects
If we have more than one discrete state object (e.g. if we discretized several continuous variables), we
can combine them with the class CDiscreteStateOperatorAnd to one discrete state object. We can use an
arbitrary number of discrete state objects for the and operator, and the new discrete state size is again the
product of all discrete state sizes.
Figure 3.2: Combining several discrete state objects with the and operator
3.1.6 State substitutions
The circumstances in which state substitions are needed have already been expained. It would be advanta-
geous to use such state substitutions for each discretizer. We therefore add this functionality to the abstract
discretizer class. With the function addStateSubstitution we can substitute one discrete state representation
coming from another discretizer object for a given discrete state number.
Figure 3.3: Substituting state object B for state a
5
of state object A. Two state scenarios are sketched in
green and yellow. In the green case state object A is in state a
5
, so the state b
3
from state object B is used.
In the yellow case, state object A is in state a
2
, so the information from B is neglected.
With these functionalities we have covered the most common use cases, so a discrete state representation
can dened easily be with these classes in the most common cases.
3.2 Linear Feature States
These states are used for linear function approximators. Linear function approximators are very popular,
because they can generalize better than discrete states and are also easy to learn at least when using local
features. A feature state consists of N features, each having an activation factor between [0, 1]. Linear
approximators calculate their function value with
f (x) =
N
i=1
i
(x) w
i
(3.1)
where
i
(x) is the activation function and w
i
is the weight of the feature i. Note that the discrete state
representation is a special case of a linear function approximator, where only one feature has the activation
factor 1.0 and all others the activation factor 0.0. It is therefore possible to treat linear feature states and
discrete states in the same way, because we can divide the feature states in to several discrete states with a
dierent weighting. If a feature has only local inuence, such as in RBF-networks, adapting the weights of
the approximator changes the function value only within a neighborhood. That is one reason why learning
with these linear function approximators yields a much better performance than learning with feed forward
NNs. Another reason is, of course, that it is a linear function, which is typically easy to learn. The drawback
of local features is, as for discrete states, the number of features grows exponentially with the number of
dimensions of the model state space. So we do not get rid of the curse of dimensionality by using local
linear features, rather just the generalization ability improves in comparison to discrete states. There are
many ways to calculate the feature factors. We will discuss three common approaches: tile coding, RBF
networks and linear interpolation.
3.2.1 Tile coding
In tile coding, the features are grouped into exhaustive partitions over the input state space. Each partition is
called a tiling, and each element of a partition is called a tile. There is always just one feature active in one
tiling, but we can use several tilings simultaneously, so that the number of active features equals the number
of tilings used. The shape of the partitions is not specied, but we usually use grid based partitions over
the state space. We can combine partitions with dierent sizes, osets or even partitions over dierent input
space variables.
3.2.2 Linear interpolation
For linear interpolation, each feature has a center position. In each dimension just two features are active,
namely those that are nearest to the current state. All other features have an activation factor of zero. The
feature factors are scaled linearly with the distance to the feature centers. Thus we get 2
N
active features,
where N is the number of input dimensions and the factors are calculated by
i
(x) =
d
_
j=1
dist
j
+ |x
j
Pos(
i
)
j
| (3.2)
where dist
j
is the distance between the two adjacent features in dimension j, and Pos(
i
)
j
is the j
th
dimen-
sion of the position vector of the i
th
feature.
3.2. Linear Feature States 43
3.2.3 RBF-Networks
Here we use RBF functions with xed centers and sigmas, so we just have to learn the linear scale factors
of the RBF-Functions. The RBF Function is given by
i
(x) = exp(
1
2
(x
j
)
1
j
(x
j
)
T
) (3.3)
A xed uniform grid of centers is typically used for RBF functions and linear interpolators. A more sophis-
ticated distribution of the centers is often useful and also necessary, but this distribution is hard to nd, as it
is for a good discrete state representation as well.
3.2.4 Linear features in the RL Toolbox
The linear feature factors do not depend on the weight of the approximator, so they can be easily represented
in our state model. The linear feature factors are always calculated using a state modier and stored in a state
object. So, if the feature state is needed more than once for a state, no redundant recalculation is needed.
All feature states are created by subclasses of the interface CFeatureCalculator, which in turn is a subclass
of the CStateModier class. This class receives the number of features and the maximum number of active
features as input, with this information the state properties are initialized correctly.
Additionally we want to combine dierent feature states easily, therefore we provide two feature operator
classes.
The Or Operator
The or operator provides us the possibility of using dierent, independent feature states simultaneously.
We can use an arbitrary number of feature calculators for the or operator, all the active features from the
dierent modiers are then simultaneously active in the same feature state. The feature state size is the sum
of all sub-feature state sizes, and the number of active features is obviously the sum of all numbers of active
features. All feature factors are normalized after this calculation so that the sum of all factors is 1.0
The feature operator or is used to combine two or more (for the most part) independent feature states
describing the same continuous state space. Examples of this include tilings or RBF networks with dierent
osets/resolutions, which increase the accuracy or perhaps the generalization properties of the linear state
representation.
The And Operator
The and operator allows us to use dierent feature states that describe dierent dependent continuous
state variables simultaneously. This class works primarily like the discrete and operator class, in that it
calculates for the tuple < f
i
, f
j
, f
k
, ..., f
n
> of active features, where each feature comes from another feature
calculator, a new unique feature index. The new activation factor of the feature is the product of all feature
factors
i
(x)
j
(x)
k
(x) ...
n
(x). The feature state size of the operator is the product of all feature state
sizes, and the number of active states is the product of all numbers of active sub-feature states.
The and feature operator is used to combine two or more dependent states: for example, if we use features
coming from single continuous state variables.
3.2.5 Laying uniform grids over the state space
In many cases we want to lay a grid over the state space, because we do not have enough knowledge or time
to specify the distribution of the feature centers more sophisticatedly. The grid can represent tilings, RBF
centers, or linear interpolation centers (or any other functions). The base class for all these dierent grids
is CGridFeatureCalculator. This class contains functions to specify the grid, to calculate the position of a
feature (the exact position of a feature is always the middle of a tile) and to determine the active tiling of
the grid (the tiling which contains the current state). We can specify which dimensions we want to use, how
many partitions to use per dimension and an oset for each dimension. Additionally a scaling factor (1.0 is
the default value) for each dimension can be dened; in combination with the oset we can lay the grid just
over specied intervals of the state variable. The grid based classes always use the normalized interval [0, 1]
for the continuous state variables. We have to consider this when we specify the osets.
Tilings
Tilings are represented by the class CTilingFeatureCalculator. This subclass of CGridFeatureCalculator
always returns the active tiling number with activation factor 1.0. So there is always just one active feature.
We can use the or operator to combine several tilings. So, actually, one individual tiling can be seen as
discrete state representation.
Figure 3.4: The use of more than one tiling with the or operator
Grids with more than one active feature
These grids are represented by the class CLinearMultiFeatureCalculator, which is a subclass of our grid
base class. We can also specify the number of active features for each dimension. The n
i
features nearest
to the current state are always considered active in dimension i. For each feature that is in the active area,
the feature factor is calculated by the interface function getFeatureFactor, which receives the position of
the feature and the current state vector as input. This function is be implemented by the subclasses. After
calculating the feature factors, the active features factors are normalized again.
3.2. Linear Feature States 45
RBF-Networks
For the RBF-network class we additionally have to specify the values for each dimension (always referred
to in the interval [0, 1]), so it is not possible in our approach to specify any cross-correlation between the
state variables. The following, simplied formulae is used to calculate the feature factors:
i
(x) = exp(
n
j=1
(x
i
j
i j
)
2
2
2
i j
) (3.4)
All features within the range of 2
i
are considered active, but at least two features per dimension must be
active. The number of active features per dimension is crucial for the speed of the Toolbox, so we have to
tune the sigma values carefully. At the end there is, as usual, the normalization step; so we use a normalized
RBF network as it is used by Doya [17] and Morimoto [30]. Often, such RBF networks are also referred to
as Gaussian Soft-Max Basis Function Networks (see chapter 6).
Figure 3.5: Using a grid of RBF-Centers for the feature state. For the x-dimension we use eight RBF
centers, for the y-dimension four centers. The sigma values are chosen in such a way that there are two
active features per dimension.
Linear Interpolation
For the linear interpolation approximator only two features per dimension are active . The feature factor is
calculated by the product of the distances to the current state for all dimensions.
i
(x) =
d
_
j=1
dist
j
+ |x
j
Pos(
i
)
j
| (3.5)
As usual these factors are normalized in the end.
3.2.6 Calculating features from a single continuous state variable
We also provide functionalities for specifying the centers of the features more accurately. Thereby, we
provide classes which enable us to choose the centers of the features explicitly for a single input dimension.
These feature states can be combined by the and operator.
The super class for creating features from a single continuous state variable is CSingleStateFeatureCalcula-
tor. For this abstract class we can specify the location of the one dimensional centers of the features and the
number of active features. The method for calculating the feature factors is again abstract and implemented
by the subclasses.
Figure 3.6: Calculation of the RBF features from single continuous state variables and combining them with
the and operator. In this example we use seven RBF features for dimension i and six RBF features for
dimension j. Both feature states use two active features simultaneously, resulting in a feature state with four
active features after the and operation.
RBF Features
Additionally we have to specify the sigma values for each RBF center. The feature factor is calculated using
the standard RBF-equation for one dimension.
Linear Interpolation Features
There are always two features active, the feature factors are linearly scaled between the two neighbored
features.
All the discussed feature calculators take periodic continuous states into consideration, so in fact, it is always
the nearest features, which are chosen to be active.
3.3 States for Neural Networks
Feed Forward Neural networks do not have any requirements for the state representation, but some pre-
processing can be useful. We use two pre-processing steps for continuous state variables.
Periodic states get scaled in the interval [, +]. In order to represent the periodicity for a neu-
ral network more accurately, we replace the scaled periodic state with two new state variables, one
representing the sin(x) and the other one the cos(x)
3.3. States for Neural Networks 47
Non periodic states variables are scaled to the interval [-1,1], so that they have the same scale as the
periodic state variables.
These pre-processing steps are done by the class CNeuralNetworkStateModier. We always have to consider
that the resulting input state for the neural network contains the number of periodic state variables more
continuous state variables. We can also use discrete state variables as input states. For discrete state variables
no generalization between the values is typically intended, so a separate input variable for each discrete state
number is usually used. The value of the input variable d
i
is 1.0 if the i is the current state number and 0.0
otherwise.
Chapter 4
General Reinforcement Learning
Algorithms
In this chapter we will discuss common RL approaches which are, in general, designed for a discrete state
and action space. We will discuss the algorithms in their theoretical form and also their implementation in
the Toolbox. We begin with value based approaches to learning the V-function and the Q-Function. After
that, we will cover discrete Actor-Critic architectures and nally we will discuss model based approaches.
At the conclusion of each theoretical discussion, an additional section discusses the implementation issues
in the RL Toolbox.
4.1 Theory on Value based approaches
Value based methods estimate how desirable it is to be in a given state or to execute a certain action in a
given state. Therefore, the algorithms use so-called value functions (V-Functions) or action value functions
(Q-Functions).
4.1.1 Value Functions
Value functions estimate how desirable it is to be in state s. The value of state s is dened to be the expected
future discounted reward the agent gets, if starting in state s and following the policy . Formally, this can
be written as
V
(s) = E[
k=0
k
r(t + k)] (4.1)
where the successor states s
t+1
are sampled from the distribution f (s
t
, (s
t
)) for all t. The expectation is
always caculated over all stochastic variables ( and f ). We can write 4.1 in the recursive form
V
(s
t
) = E[r(t) + V
(s
t+1
)] (4.2)
When referring to the value V() of a policy , we always mean the expected discounted reward when
following , beginning at a typical initial state s
0
. The value of policy is given by:
V() = E
s
0
D
[V
(s
0
)] (4.3)
D is the initial state distribution of the given MDP.
48
4.1. Theory on Value based approaches 49
4.1.2 Q-Functions
A Value Functions can be used to estimate the goodness of a certain state, but we can only use it for action
selection if we know the transition function. Action value functions (referred to as Q-Functions) estimate
the future discount reward (i.e. the value), if the agent chooses the action a in state s and then follows policy
again. Hence Q-Functions estimate the goodness of executing action a in state s.
Q
(s, a) = E[r(s, a, s
) + V
(s
)] (4.4)
Note that E
[Q
(s, (s))] = V
(s) so we can also write

Q
(s, a) = E[r(s, a, s
) + Q
(s
, a
)] (4.5)
where the action a
was chosen according to the policy .

4.1.3 Optimal Value Functions
Given denition of the value functions, we can also compare two policies. A policy
1
is better or as good
as policy
2
if V
1
(s) V
2
(s) for all s S . From the denition we see that the agent gathers at least as
great a reward following
1
as if he follows
2
. So the optimal policy
satises the condition

V
(s) V
(s) (4.6)
for all states and possible policies . We dene the optimal value function as V
(s) = V
(s).
Optimal policies also have optimal action values, i.e. Q
(s, a) = max
(s, a) (where Q
is already dened
as Q
). The optimal policy always chooses the action with the best action value, so it is also clear that
max
aA
s
Q
(s, a) = V
(s) (4.7)
for all states s, A
s
is the set of available actions in state s.
Inserting the optimal policy in 4.4 we can also write for Q
:
Q
(s, a) = E[r(s, a, s
) + max
a
s
Q
(s
, a
)] (4.8)
This equation is called the Bellman optimality equation. This equation can also be stated for value functions:
V
(s) = max
a
E[r(s, a, s
) + V
(s
)] (4.9)
4.1.4 Implementation in the RL Toolbox
V-Functions
Given the above theory, V-Functions need to provide the following functionalities:
Return an estimated value for a given state object.
Update the value for a given state object.
In the discrete case V-Functions are usually represented as tables. But for more complex problems any
representation of a function (polynomials, neural networks, etc.) can be used. We design a general interface
for V-Functions, which has then to be implemented by the dierent implementations of the V-Functions.
This interface is called CAbstractVFunction. It contains interface functions for retrieving, updating (add
a value) and setting a value for a given state. In this class we can also set which state representation the
V-Function will use, the specied state object is then automatically retrieved from the state collection and
passed to the interface functions.
50 Chapter 4. General Reinforcement Learning Algorithms
Figure 4.1: Representation of the Value Function. The value function can choose from any state representa-
tion from the state collection.
Value Functions for discrete States
Value functions for discrete states commonly store the value information in a tabular form. Thus we have a
value entry for each discrete state number. In the Toolbox tabular V-Functions are represented by the class
CFeatureVFunction, which can be used for discrete states and linear features. We can specify a discretizer
object for the value function, which determines the used state properties object (which has to be a discretizer
or feature calculator), which is used to retrieve the state from the state collection. The size of the table is
also taken from the discretizer. The feature V-Function also supports value manipulation directly with the
discrete state number, so we do not have to use the state objects. This possibility is used for example by the
implemented dynamic programming algorithms.
When using feature states, the feature state is decomposed into its single features before calling the functions
for the discrete state indices. The feature factors are used as weighting for the value calculation or the value
updates respectively.
Q-Functions
Q-Functions return the action value of a given state-action pair. Again, we provide a general interface for
all implementations of Q-Functions. The interface contains functions for getting, setting and updating a Q-
Value, so it has the same functionality as for V-Functions, but with the action as additional input parameters.
Each of these methods now contains additional arguments for the action, consisting of the action pointer
itself and an action data object (see 2.2.4). Additionally, the interface provides functions to accomplish the
following:
getActionValues: calculate the values of all actions in a given action set and write them in a double
array.
getMaxValue: calculate the maximum action value for a given state.
getMax: return the action with the maximum action value for a given state.
The Q-Function interface is called CAbstractQFunction.
Q-Functions for a set of actions
If we have a discrete set of actions, the action values for each action can be seen as separate, independent
functions. It would be optimal if we could use dierent representations for each function, e.g. if we could
4.2. Dynamic Programming 51
use other discrete state representations for the dierent actions or even use a neural network for one action
and a linear function approximator for another.
If we look at each action separately, the corresponding action value function has only to store state values, so
it has the same functionality as a V-Function. Consequently, it is rather obvious to use V-Functions for the
single action value functions of the Q-Function. For each action, we can specify an individual V-Function,
so that dierent function representations can be used for dierent actions. This approach of representing a
Q-Function is managed by the class CQFunction. For this class the user has to set a V-Function for each
specied action.
Figure 4.2: Q-Functions for a nite action set: For each action the Q-Function maintains an individual
V-Function object.
Q-Functions for Discrete or Feature States
For discrete Q-Functions we need only to use discrete V-Functions (CFeatureVFunction) for our single
action value functions. If we need a Q-Function which uses the same discrete or feature state representation
for each action, creating and setting the V-Functions each time can be very arduous. Thus we provide the
class CFeatureQFunction which creates the feature V-Function objects by itself, all with the same state
representation.
4.2 Dynamic Programming
Dynamic Programming (DP) approaches represent iterative methods to estimate the value or the action value
functions with the use of the perfect model of the MDP. Due to the need for the perfect model and a huge
computational expense, these algorithms are used only rarely in practice, but they are still very important
theoretically. DP methods do not use any real experience of the agent either, (the agent never executes an
action); only the model is used. Therefore DP methods are not really learning methods, rather a planning
approach.
DP methods use the perfect model to calculate the (optimal) value or action value for a given state s, assum-
ing the values of the successor states of s are correct. In general this assumption is false, because we do not
know the values of the successor states, but by repeating this step for all states innitely often, the algorithm
converges to the required solution.
4.2.1 Evaluating the V-Function of a given policy
The recursive equation of the V-Function for the stochastic policy and stochastic transition probabilities
P(s
|s, a) is given by:

V
(s) = E[r(t) + V(s
)] =
a
(s, a)
P(s
|s, a)(r(s, a, s
) + V(s
)) (4.10)
This equation is iteratively applied for all states, resulting in a sequence of Value Functions V
0
, V
1
, V
2
, ..., V
k
,
where V
k
is
V
k
(s) =
a
(s, a)
P(s
|s, a)(r(s, a, s
) + V
k1
(s
)), for all s (4.11)

We can think of backups as being done in a sweep through the state space. These updates are also commonly
referred to as full backups, since all possible transitions from one state s to all successor states s
are used.
This sequence of value functions is proved to converge to V
if done innitely often. This process is often

called iterative policy evaluation. Note that we can use the V-Function in combination with the model of the
MDP to create a Q-Function for action selection.
Q
(s, a) = E[r(s, a, s
) + V
(s
)] =
P(s
|s, a)(r(s, a, s
) + V
(s
)) (4.12)
4.2.2 Evaluating the Q-Function
Of course we can evaluate the action values with dynamic programming straightaway. In this case we use
the following iterative equation:
Q
k
(s, a) =
P(s
|s, a)(r(s, a, s
) +
(s
, a
)Q
k1
(s
, a
)), for all s and a (4.13)

In fact evaluating the V-Function and evaluating the Q-Function are theoretically equivalent. In the equation
for the V-Function update, we have to calculate the Q-Values of the current state, and in the equation for the
Q-Function update we have to calculate the V-Values of the successor states. In practice there is actually a
dierence in how we represent our data.
4.2.3 Policy Iteration
Up to this point, we can calculate the values (or action values) for a given policy. But we want to evaluate
the values of the optimal policy. There are two common ways to do this. The rst is called policy iteration;
in this case we evaluate the V-Function of a (xed) policy (policy evaluation step), then we create a policy
greedy on that V-Function (policy improvement). This policy is proved to be better than or at least as good
as the old policy. Then we repeat the entire process. Policy iteration is guaranteed to converge to the optimal
policy (or value function). But it is very time consuming because we have to do an entire policy evaluation
for each improvement step.
4.2.4 Value iteration
Value iteration combines the two steps of policy evaluation and policy improvement. We directly evaluate
the values of the greedy policy (greedy on the current values, not on old values like in policy iteration). Thus
we get the following equation for the value function:
V
k
(s) = max
a
P(s
|s, a)(r(s, a, s
) + V
k+1
(s
), for all s (4.14)

And for the action value function:
Q
k
(s, a) =
P(s
|s, a)(r(s, a, s
) + max
a
Q
k1
(s
, a
), for all s and a (4.15)

These are the iterative equations for the optimal value and action value function. Value iteration is also
guaranteed to converge to the optimal policy.
All these approaches only work for a discrete state space representation, otherwise we get problems with
representing the transition probabilities. Adapted versions for a continuous state space are called Neuro-
Dynamic Programming (see [12] or [15] for an exact description of these algorithms), but these approaches
work only with limited success due to the huge computational expense. The performance of DP methods
suers mostly from the sweeps through the state space. Theoretically the value updates must be done for
every state, but for the major part of the state space, the values remain unchanged. We will discuss an
algorithm called Prioritized Sweeping, which has a more sophisticated update schema.
4.2.5 The Dynamic Programming Implementation in the Toolbox
The Toolbox supports evaluating the value function or the action value function of any stochastic policy, so
value iteration is also possible if we use a greedy policy. Why is it useful to provide dynamic programming
methods for both value and action value learning if these two approaches are equivalent? There are several
approaches to combining dynamic programming with other learning methods (actually, in the Toolbox we
can use any other value based algorithm in combination with DP we just have to use the same V or Q-
Function) and depending on what other type of learning algorithm we want to use, we need either the
V-Function or the Q-Function.
For dynamic programming we need additional data structures to represent the following:
Transition Probability Matrix P(s
|s, a): Typically, most of the entries of this matrix are zero, be-
cause for common problems, only a few states can be reached from a specic state within one step.
Some other algorithms also need the backward transition probability P(s|s
, a), which is actually the

same quantity, but we need a data structure which provides a quick access to the probabilities greater
than zero in both directions. We want to access these probabilities by supplying either the successor
state or the predecessor state.
Stochastic policies: The policies have to return a probability distribution over the action for each
state. This will be discussed in section 4.5. It is enough to know that such stochastic policies exist
and are used.
Representing the Transition probabilities
The requirements of our transition probability matrix have already been explained. It is understood that we
can not store the probability values in a matrix because we would waste a lot of memory and computation
time. Instead we implement an interface class CAbstractFeatureStochasticModel and an implementation
class CFeatureStochasticModel. This gives room for other implementations. In our approach we store lists
of transitions. A transition object contains the index of the initial state, the index of the nal state and the
probability of the transition. For each state-action pair we maintain two lists, one for the forward transitions
and one for the backward transitions. If a transition is added to the probability matrix, the transition is added
to the initial states forward transition list and to the end states backward transition list for the specied
action. This allows a quick access to all successor (or predecessor states) of a given state. The action itself is
not stored by the transition object, the action is already implicitly specied with the location of the transition
object.
Figure 4.3: Representation of the Transition Matrix. For each state-action pair we have a list of forward and
a list of backward transitions.
The CFeatureStochasticModel class oers functions that retrieve the probability of a specied transition
directly or retrieve all forward (or backward transitions) of a given state-action pair as well.
For the stochastic model, the state (or alternatively the feature index) is represented as an integer variable,
because these updates have to be done several hundred thousand times for larger MDPs, therefore we do
not want to waste any computational resources by using state objects.
Discrete Reward Functions
For the DP update we need reward functions which take discrete states as arguments, instead of our usual
state objects. For performance reasons we represent the discrete states with integer variables in this context.
Therefore we create the class CFeatureRewardFunction, which returns a reward value for states represented
by integers.
The Value Iteration Algorithm
As already mentioned, the Toolbox supports calculating the values or the action values of a given stochastic
policy. Therefore we introduce some tool classes:
Converting V-Functions to Q-Functions: We build an extra read-only Q-Function class (CQFunc-
tionFromStochasticModel) which calculates the action values given a value function, the transition
probabilities and the reward function. For action value calculation the standard equation
Q(s, a) =
P(s
|s, a)(r(s, a, s
) + V(s
))
is used. Of course the sum over all states s
is only computed for all successor states in the forward

list of state s. Note that this is already a main part of policy evaluation.
Converting Q-Functions to V-Functions: The class CVFunctionFromQFunction was used to calcu-
late the V value from a Q-Function and the corresponding policy. For this calculation, the standard
equation
V(s) =
a
(s, a)Q(s, a)
is used Note that combining these two conversions already denes the update step used for value
iteration.
Depending on what we want to estimate (values or action values), the algorithm takes a stochastic policy
and either Q- or a V-Function as input. If no policy has been specied, a greedy policy is used.
Value estimation: The given V-Function is converted to a Q-Function. Then we convert that Q-
Function back to a V-Function. For each update step we set the value of state s of the original V-
Function to the value of the virtual V-Function.
Action-Value estimation: Here it works vice versa, we create a V-Function from the given Q-Function
and then convert the V-Function back to a Q-Function. For each subsequent update step, the values
which were calculated by the virtual Q-Function are used.
We also added some useful extensions for choosing which states to update to the standard value iteration
algorithm. The states which get updated are usually chosen at random or sequentially by a sweep. With
this update scheme it is unlikely to update states where the value will change signicantly. In our common
approach, a priority list for the state updates is used. The state with the highest priority is chosen for the
update of the value function. If a state update has been made for state s and the value of that state changed
signicantly, it is likely that the values of the predecessor states will also change. If state s has been updated
and the bellman error b = |V
k+1
(s) V
k
(s)| is the dierence between the new and the old value, the priorities
of all predecessor states s
are increased by the expected change of the value of state s
(which is P(s|s
, a)b).
The algorithm is listed in 4.2.5. Only priorities above a given threshold are added. We provide functions for
Algorithm 1 Priority Lists
b = V
k
(s) V
k1
(s)
for all actions a do
for all predecessor states s
do
priority(s
)+ = P(s|s
, a) b
end for
end for
doUpdateSteps: Updating the rst N states from the priority list.
doUpdateStepsUntilEmptyList: Do the updates until the priority list is empty.
doUpdateBackwardStates: Update all predecessor states of a given state. This can be used to give the
algorithm some hints on where to start the updates; for example it is useful to begin in known target
or failed-states.
As already mentioned, this is not the pure value iteration algorithm any more. It is a sort of intermediate
step to the prioritized sweeping algorithm discussed in the section 4.8.1.
4.3 Learning the V-Function
4.3.1 Temporal Dierence Learning
TD learning approaches calculate the error of the bellman equation for an one step sample < s
t
, a
t
, r
t
, s
t+1
>.
So in contrast to Dynamic Programming, TD methods use one sample backups instead of full backups.
Since we use only a single sample for the update, the calculated values have to be averaged. For this, the
learning rate is used. For the value function, this equation is dened as
V
(s
t
) = E[r(t) + V(s
t+1
)]
We obtain the following solution for a single step sample:
V
k+1
(s
t
) = (1 ) V
k
(s
t
) + (r
t
+ V
k
(s
t+1
) (4.16)
The one step error of the bellman equation for a step tuple < s
t
, a
t
, r
t
, s
t+1
> is also called temporal dierence
(TD) and is calculated by
td = r
t
+ V(s
t+1
) V(s
t
) (4.17)
We can also use the TD value to express the update equation of the V value of state s
t
.
V(s
t
) = td = (r
t
+ V(s
t+1
) V(s
t
)) (4.18)
By following a given policy , we can estimate V
using this approach.

4.3.2 TD () V-Learning
The normal TD learning algorithm simply rewards the current state with the temporal dierence. But
in general, the states from the past are also responsible for the achieved temporal dierence. Eligibility
traces (e-traces) are a common approach used to speed up the convergence of the TD-Learning algorithm
(see [49]). An e-trace of a state e(s) represents the inuence of the state s to the current TD update. Now
each state is updated with the help of its e-traces.
V(s) = td e(s), for all s (4.19)
4.3.3 Eligibility traces for Value Functions
There are dierent ways of calculating the eligibility of a state. The eligibility of each state has to be
decreased with a given attenuation factor at each step, except for the current state, in which the eligibility
is increased. The two most common eligibility trace update methods are:
Replacing e-traces:
e
t+1
(s) =
_
e
t
(s), if s s
t
1, else
(4.20)
Accumulating (Non-Replacing) e-traces:
e
t+1
(s) =
_
e
t
(s), if s s
t
e
t
(s) + 1, else
(4.21)
At the beginning of an episode the e-traces must obviously be reset to zero. In general it is not clear which
approach works better. Non-Replacing e-traces are more common, but they can falsify the V-Function update
considerably, if the learning tasks allows to stay in the same state for a long time. Replacing e-traces have
been introduced by Singh [44]. In his experiments, replacing e-traces had a considerably better learning
performance.
4.3. Learning the V-Function 57
E-Traces for V-Functions
There are dierent type of e-traces for the dierent types of value functions, so we again have to provide a
general interface for the e-traces. This interface is called CAbstractVETraces. An e-trace object is always
bound to a V-Function object, which is passed to the constructor of the e-traces object. The interface contains
functions for the following:
Adding the current state to the e-traces (addETraces(CStateCollection *)).
Updating the current e-traces by multiplying the e-trace value with the attenuation factor updateE-
Traces()
Updatng the value function with a given td (em updateVFunction(double td))
Resetting the e-traces for all states (resetETraces()).
In the following we will only discuss the implementation for the discrete and linear feature state representa-
tion (which is implemented by the class CFeatureVETraces).
In the TD() update rule we have to update all states in each step. But in general (and particularly for large
state spaces) this is not necessary, the e-traces for most states will be zero anyway. This conclusion arises
from the general assumption that the agent only uses a local part of the state space for a long time-period.
As a result, we can not store the e-traces in an array, but we store the index and the eligibility factor in a
list. All states which are not in the list have an e-trace factor of zero. The list is also sorted by the eligibility
factors. In order to nd the < s, e(s) > tuple faster, we also maintain an integer map for the tuple. In this
context the state number serves as map index, so we can search for the eligibility factor of state s (as well as
decide whether the state is in the list) very fast. This is needed in order to add the eligibility of a given state.
This kind of list is called feature list, class CFeatureList in the Toolbox and is also needed for other classes
and functionalities. There are sorted and unsorted feature lists.
When updating the value function, the class calls the update method of the feature V-Function (the one using
integers as the state parameter) for every state in the eligibility list.
We also implement replacing and accumulating e-traces; this capability can be set by the parameter Re-
placingETraces. We provide two additional parameters which control the speed and accuracy of the e-
traces. The rst parameter (ETraceMaxListSize) controls the maximum number of states that are in the
e-trace list. If there are more states in the list, the states with the smallest factors are deleted. The second
parameter controls the minimum factor of an e-trace (ETraceTreshold). Once again, all states with an
eligibility factor lower than the given factor are deleted.
Additionally each V-Function provides a method for creating a newe-traces object of the correct type already
initialized with the V-Function. Feature V-Functions always create feature e-trace objects as standard, but it
is also possible to use other kinds of e-traces with feature V-Functions.
TD() V-Learning
Since We have already designed the V-Function and the e-trace objects, implementing the TD() algorithm
itself is straight forward. The algorithm implements the CSemiMDPRewardListener interface, and gets
a reward function and a value function passed at the constructor. At each new step and new episode it
performs the updates listed in algorithm 2. A value function alone can not be used for action selection, but
in combination with a transition function, it can be used to calculate a policy (see section 4.5).
Algorithm 2 TD-V Learning
for each new Episode do
etraces > resetETraces()
end for
for each new step < s
t
, a
t
, r
t
, s
t+1
> do
td = r
t
+ V(s
t+1
) V(s
t
)
etraces > updateETraces() //e(s) = e(s)
etraces > addETrace(s
t
) //e(s)+ = 1.0 or e(s) = 1.0
etraces > updateVFunction( td) //V(s)

td e(s)
end for
4.4 Learning the Q-Function
4.4.1 TD Learning
As we have seen, the Q-Function is dened according to 4.5. Similar to learning the V-Function, we take a
one step sample < s, a, r, s
> of this equation and calculate the equations error when using the estimated
Q-Values.
td = r(s, a, s
) + Q
(s
, a
) Q
(s, a) (4.22)
a
in 4.22 is chosen from the policy , because we estimate the action values of policy this policy is called
the estimation policy. In the case of Q-Function learning the estimation policy does not have to be the policy
that is used by the agent. The TD value is then used to update the Q value of state s and action a.
Q(s, a) = td (4.23)
If every state-action pair is visited innitely often and the learning rate is decreased over time, the TD
algorithm is guaranteed to converge to Q
.
We can use any policy as the estimation policy , but in general, we want to estimate the values of the
optimal policy. There are two main algorithms, SARSA and Q-Learning, which dier purely in the choice
of the estimation policy.
In Q-Function learning we can use stored experience from the past (like batch updates) for learning, because
we can specify an individual estimation policy, which is learned. When learning the V-Function, the step
tuple always has to be generated by the estimated policy, therefore past information can be used.
SARSA Learning
State Action Reward State Action learning [49], p. 145 always uses the policy the agent follows as estima-
tion policy. That is, it uses the < s
t
, a
t
, r
t
, s
t+1
, a
t+1
> tuple for the updates, which is why it is called SARSA.
This approach is called on-policy learning, because we learn the policy we follow. Usually the agent follows
a greedy policy with some kind of exploration. For this reason, SARSA does not estimate the optimal policy,
but also takes the exploration of the policy into account. If the policy gradually changes towards a greedy
policy, SARSA converges to Q
Q Learning
Q-Learning [49], p. 148) uses a greedy policy as estimation policy. This it estimates the action values of the
optimal policy. Since we estimate a policy other than the one we follow (due to some random exploration
term), this type of update is called o-policy learning.
4.4. Learning the Q-Function 59
In general it is not clear which method is better. Q-Learning is likely to converge faster. SARSA learning
has an advantage if there are areas of high negative reward, because in that case SARSA tries to stay away
from these areas more signicantly than Q-Learning. The reason for this is that SARSA also considers the
exploring actions, which could lead the agent in these areas by chance.
4.4.2 TD() Q-Learning
Similar to V-Function learning, we can also use eligibility traces for Q-Function learning. Now the e-traces
are not only for the states, but for state-action pairs. The update rules are the same as for the V-Learning
case (here again, there are replacing and non replacing e-traces).
But there is a dierence when resetting the e-traces. In this case we have to distinguish between actions
which were chosen by the estimation policy and actions where the estimation policy diered from the step
information. Once we recognize that we have not followed the estimation policy, we need to reset the
e-traces. This is done because we want to estimate the action values of the estimation policy and if the
estimation policy has not been followed for one step, the state-action pairs from the past are not responsible
for any further TD updates.
Nevertheless, there are also approaches where the e-traces are never reset if the estimated action contradicts
the executed action. This may falsify the Q-Function updates, but it can also improve performance because
the e-traces reach longer into the past. In the initial learning phase in particular, where we have many
exploratory steps, this can be a signicant advantage.
Of course there are extensions of Qand SARSAlearning using e-traces. They are called Q() and SARSA().
In this case, the SARSA() algorithm has a small advantage over the Q() algorithm, because the agent al-
ways follows the estimation policy; thus we never have to reset the e-traces during an episode.
E-Traces for Q-Functions
We have already dened e-traces for V-Functions which store the eligibility for a state. Now we have
to store the eligibility for a state-action pair. For Q-ETraces we have to provide a similar interface class
CAbstractQETraces. The interface contains functions for the following:
Add the current state-action pair to the e-traces (addETrace(s,a)).
Update the current e-traces by multiplying it by the attenuation factor (updateETrace()).
Update the Q-Function, given td (updateQFunction(td)).
Reset the e-traces for all state-action pairs(resetETraces()).
The design of the Q-ETraces class is patterned on the design of the Q-Functions. Again, we maintain a
V-ETraces object for each action. Each V-ETrace object is assigned to the corresponding V-Function of the
CQFunction object. Thus, if the Q-Function uses feature V-Functions for the single action value functions,
the Q-ETrace object will use feature V-ETraces. This functionality is covered by the class CQETraces,
which implements the general CAbstractQETraces interface. Since we use e-traces for V-Functions, we can
use the entire functionality discussed previously, including replacing or non replacing e-traces and setting
the maximum list size and the minimum e-traces value before the state is discarded from the list.
Each Q-Function contains a method for retrieving a new standard Q-ETraces object for that type of Q-
Function, similar to the V-Function approach.
TD() Q-Learning
We have dened all necessary parts for the algorithm. Now we can easily combine these parts and build
the learner class CTDLearner. The algorithm implements the CSemiMDPRewardListener interface. It gets
a reward function, a Q-Function object and an estimation policy object as parameters. Additionally we can
specify an individual Q-ETraces object, otherwise the standard Q-ETraces object for the given Q-Function
is used. The algorithm is listed in algorithm 3.
Algorithm 3 TD-Q Learning
a
e
... estimated action
e
... estimation policy
for each new Episode do
etraces resetETraces()
end for
t
, a
t
, r
t
, s
t+1
> do
if a
e
a
t
then
etraces resetEtraces() //e(s) = 0, for all s
else
etraces updateETraces() //e(s) = e(s), for all s
end if
a
e
(s
t+1
)
td r
t
+ Q(s
t+1
, a
e
) Q(s
t
, a
t
)
etraces addETrace(s
t
) //e(s) = e(s) + 1.0 or e(s) = 1.0
etraces updateQFunction( td)
end for
We can disable resetting the e-traces when an incorrect estimated action occurs by the parameter ResetE-
TracesOnWrongEstimate.
There is an individual class for SARSA Learning CSARSALearner, which requires the agent for its esti-
mation policy (remember that the agent is also a deterministic controller, which always returns the action
executed in the current state). There is also an individual class for Q-Learning CQLearning, which automat-
ically uses a greedy policy for its estimation policy.
4.5 Action Selection
If we use a discrete set of actions the action can be selected with a distribution based on the action values.
But always taking the greedy action is not advisable, because we also need to incorporate some exploring
actions. There are three commonly used ways of selecting an action from the Q-Values:
The greedy policy: Always take the best action
The Epsilon-Greedy policy: Take a random action with probability ; take the greedy action with
probability 1 . This gives us the following action distribution:
P(s, a
i
) =
_
_
1 +

|A
s
|
, if a
i
= arg max
a
A
s
Q(s, a
|A
s
|
, else
(4.24)
4.5. Action Selection 61
The advantage of the epsilon greedy policy is that the exploration factor can be set very intuitively.
The soft-max policy: The soft-max policy uses the Boltzmann distribution for action selection:
P(s, a
i
) =
exp( Q(s, a
i
))
_
|A
s
|
j=0
exp( Q(s, a
j
))
(4.25)
The parameter controls the exploration rate. The higher the value, the sharper the distribution
becomes. For it converges to the greedy policy. When using the soft-max distribution, actions
with high Q-Values are more likely to be chosen than actions with a lower value, so it generally
has a better performance than epsilon greedy policies, because the exploration is more guided. The
disadvantage is that the exploration rate also depends on the magnitude of the Q-Values, so it is harder
to nd a parameter setting for .
4.5.1 Action Selection with V-Functions using Planning
We already covered how to learn the value function of a specied policy using TD(). Of course we usually
want to learn the optimal policy, and not just estimate the value function of a given policy. With the V-
Function alone, we can not make any decision about whether an action is good or bad, so we can not learn
a policy either. But we can use the transition function f of the model (or learn the transition function if it
is not known) to make a one step forward prediction of the current state for every action. We can then use
equation
Q
(s, a) = E
s
[r(s, a, s
) + V
(s
)]
where s
is sampled from the distribution dened by f (s, a) to calculate action values from these state
predictions. For strongly stochastic processes we would have to repeat the prediction several times, which is
rather time consuming. But for deterministic (or hardly stochastic processes - for example, processes with
a small amount of noise) we can omit the expectancy value and just calculate the Q-Value with a one step
forward prediction.
Q
(s, a) = r(s, a, s
) + V
( f (s, a))
These action values have the same meaning as the Q-Values of an ordinary Q-Function. From this it follows
that by following a greedy policy (or a policy gradually converging to a greedy policy), we can also learn
the optimal policy
. The advantage of this approach is that we use the transition function (and the reward
function) as a kind of prior knowledge for our policy. This can boost our learning performance consider-
ably, in particular for continuous control tasks where the transition function usually denes some complex
dynamics.
Even if we do not know the transition function, we can use any kind of supervised learning algorithm to learn
the model. Learning the model is typically easier than learning the Q-Function (because it is a supervised
learning task), so we can divide the entire task of learning the Q-Function into learning the V-Function and
learning the transition function.
The disadvantage of this approach is that comparatively it requires much computation time. The policy has
to calculate the state prediction s
and the values V(s
) for each action. Calculating this value can be quite

time consuming especially if we use large RBF networks.
Planning for more than one step
Another nice advantage of this approach is that we can combine heuristic search approaches with V-Learning.
We are not restricted to using just a one step forward prediction. We can also use an N-step forward pre-
diction and span a search tree over the state space to calculate the action values. Hence we use a heuristic
search with a learned value function. The point in searching deeper than one step is to obtain a better action
selection, in particular if we have a perfect model but an imperfect value function. The action value of action
a is then
Q(s
t
, a
t
) = r(s
t
, a
t
, s
t+1
) + max
a
t+1
,a
t+2
,...,a
t+N1
_
_
N1
i=1
i
r(s
t+i
, a
t+i
, s
t+i+1
)
_
_
+
N
V(s
t+N
) (4.26)
The number of states to predict (
_
N
i=1
|A|
i
=
|A|
N
|A|
|A|1
) and the number of V-Functions to evaluate (|A|
N
)
increase exponentially, consequently planning can obviously only be done for small N.
Figure 4.4: In this example, a two step forward prediction is used to select the best action a
1
.
Stochastic Policies
All the policies mentioned above can be seen as stochastic policies. In our design, stochastic policies are
represented by the abstract class CStochasticPolicy. The class calculates the action values of all available
actions in the current state and passes them to an action distribution object, which calculates the probability
distribution. Then an action is sampled from this distribution and returned. Since we use only a discrete
action set, we do not have to cope with action data objects in this case. The probability distribution itself
is also needed by a few algorithms; for that reason, we calculate the distribution in an individual public
method getActionProbabilities. There are three action distribution classes for the three discussed policies
(CGreedyDistribution, CEpsilonGreedyDistribution, CSoftMaxDistribution).
How the action values for calculating the distribution are calculated is not specied at this point (therefore
the class is abstract, this functionality is determined by the subclasses). The class CQStochasticPolicy uses
a Q-Function object for calculating the action values.
4.6. Actor-Critic Learning 63
Figure 4.5: The stochastic policy: The distribution is determined by the action distribution object, which
can be a greedy, epsilon greedy or soft-max policy.
V-Planning
We want to use the same policies as for Q-Functions. For that reason, we implement an extra Q-Function
class which calculates the action values from the value function, the transition function and the reward
function. Of course this Q-Function is read-only and can not be modied.
The class is called CQFunctionFromTransitionFunction. It takes a value function, the transition function
and the reward function as arguments. For the state prediction, a list of state modiers as additional argument
is needed. These modiers are used by the V-Function (typically we can specify the state modiers list from
the agent here). Because we can only calculate model states with the transition function, we have to maintain
a separate state collection object which can also calculate and store the modied states.
For the n-Step prediction the algorithm is more complicated. The search tree is built by calling the search
function recursively with a decremented search depth argument. The search process is stopped when we
reach a search depth of zero (tat is we are in a leaf). We maintain a stack of already calculated state collec-
tions, and the search function always uses the rst state collection from the stack as the current predicted
state. Before the recursive function-calls (for each action we create a new branch), the new predicted state is
pushed onto the stack. When the recursive calls are nished, the predicted state collection is removed again.
The search depth of the search tree can be specied by the parameter SearchDepth.
This Q-Function can be used for the stochastic policy class. The class CVMStochasticPolicy is inherited
from CQStochasticPolicy and creates such a Q-Function itself, making it more comfortable to use.
4.6 Actor-Critic Learning
Actor-Critic algorithms are methods which represent the value function and the policy in distinct data struc-
tures. The policy is called the actor; the value function is known as the critic. The value function is learned
in the usual way (with any V-Learning approach we want). The critique coming from the V-Function is usu-
ally the temporal dierence (although there are a few algorithms which use other quantities coming from the
V-Function), which indicates whether the executed action from the actor was better than expected (positive
critique) or worse (negative critique). The actor can then adapt its policy according to this critique. Actor
critic learning has two main advantages:
We can learn an explicitly stochastic policy, which can be useful in competitive or non Markovian
processes.
For a continuous action set, we can represent the policy directly and calculate the continuous action
vector. When learning action values, we would have to search through an innite set of actions to
pick the best one. Although it is possible to discretize the action space, we still get a huge number of
actions for a high dimensional continuous action space.
Figure 4.6: The general Actor-Critic architecture.
4.6.1 Actors for two dierent actions
To illustrate the Actor-Critic design we discuss a very simple actor. The algorithm was proposed by Barto
[8] in 1983. The actor can only choose between two dierent actions. It stores an action-value p(s) just for
the rst action.
The action value is then updated according to the rule
p
t
(s) =
_
0.5 critique, i f a
t
= a
1
0.5 critique, else
(4.27)
where is the learning rate of the actor. If the rst action has been used, the critique is added to the action
value. Otherwise, the critique is subtracted. Thus the action value increases if the rst action has yielded a
good result or the second action yielded a bad one.
The rst action is then selected using the probability
P(a
t
= a
1
) =
1
1 + exp(p(s
t
))
(4.28)
Thus action number one is taken with high probability if the action value is positive.
We can also use this algorithm with e-traces. The current state is added to the e-traces by the following
equation
e
t
(s) =
_
+0.5, if a
t
= a
1
0.5, else
(4.29)
The remaining e-traces updates (attenuation, replacing or accumulating traces) and the updates of the action
values are carried out as usual. Consequently, we arrive at the following action value update rule:
p
t
(s) = critique e(s), for all s (4.30)
4.6. Actor-Critic Learning 65
4.6.2 Actors for a discrete action set
Actors for a discrete action set calculate a probability distribution for taking an action in state s. Action
values for each state are used to indicate the preference for selecting the action (similar to Q-Functions).
The action can then be selected again by a probability distribution (as discussed earlier in subsection action
selection).
There are dierent approaches for updating this action value function.
The main approach, discussed in [49] p. 151 is related to the TD update of a Q-Function.
p(s
t
, a
t
) = critique (4.31)
The dierence to TD learning is that in this case we use a separate value function to estimate the
temporal dierence.
Another approach is to include the inverse probability of selecting action a
t
. As a result, the action
values from rarely chosen actions receive a higher emphasis:
p(s
t
, a
t
) = critique(1
t
(s
t
, a
t
)) (4.32)
where
t
is the stochastic policy at time t.
The problem with this approach is that actions with a high probability do not get updated at all any more.
In the Toolbox we intermix the two approaches, which uses a minimum learning rate for all actions, and
higher learning rates for actions with low probabilities. Both approaches are also implemented with e-traces.
Again, we can use those e-traces already discussed for Q-Functions.
The critic is already implemented; we can use any V-Function learner class. To receive the critique for
a given state-action pair, we create the interface CErrorListener. Error listeners receive an error value of
an unspecied function (to maintain generality, the type of function is not specied here) for a given state-
action pair. In our case this function is the Value Function; therefore we extend our V-Learner and Q-Learner
classes. The TD learner classes can also maintain a list of error listeners, and after calculating the TD error,
this quantity is sent to the error listeners. Therefore all actors needing the TD value must implement the
error listener interface. Then we only have to add the actor to the error listener list of the TD learner. Note
that through this approach we can use either a V-Function or a Q-Function as critique.
Actors can also implement the agent controller interface directly, but this is not mandatory for every actor,
as we will see.
Actors for two dierent actions
The class representing this type of actor is called CActorFromValueFunction. This class maintains a V-
Function object representing the p values and a designated e-traces object. The actor class also implements
the discussed policy using the agent controller interface. The rest of the implementation is straight forward
using the already implemented classes.
Figure 4.7: The class architecture of the Toolbox for the Actor-Critic framework
Actors for a discrete action set
In the RL Toolbox, these two approaches are implemented for the classes CActorFromQFunction and CAc-
torFromQFunctionAndPolicy. Both algorithms use Q-Functions to represent the action values. Because the
actors can not be used as policies, we can use the Q-Function for the CQStochasticPolicy class. This policy
must also be passed to the latter class because the stochastic policy is needed for the action value update.
For the e-traces, a designated Q-ETrace object is used, so we have the entire discussed functionalities for
the e-traces.
4.7 Exploration in Reinforcement Learning
In RL problems, we usually face a trade-o between exploration and exploitation. In order to nd an
optimal policy and increase the reward during learning, we have to follow the action considered best at the
time (short-term optimization). But without exploration (long term optimization), we can never be certain
that the supposedly best action is really the optimal action, because certain values or action values may still
be wrong. Thus we have to make sure that we visited all areas of the state space thoroughly enough while
still following a good policy. In this section we will discuss the results from Thrun [51] and Wyatt [55] who
deal with the problem of directed exploration.
There are basically two methods for integrating exploration in RL: undirected exploration and directed
exploration. Undirected exploration schemes induce exploration only randomly, while directed exploration
approaches use further knowledge about the learning process.
Thrun [51] proved that for nite deterministic MDPs, the worst case for the complexity of learning the given
task is exponential to the size of the state space (if we use a undirected exploration method), while the
worst case bound is polynomial (if using a directed exploration approach). The results were not generalized
to general innite stochastic MDPs, but intuitively, we can say that directed exploration also reduces the
4.7. Exploration in Reinforcement Learning 67
complexity of learning in general MDPs.
4.7.1 Undirected Exploration
Undirected exploration schemes rely only on the knowledge of the optimal control; they make better actions
more likely. Exploration is ensured only at random. We already discussed two undirected exploration
schemes: the soft-max policy and the -greedy policy. If we initialize our action value function with the
upper bound Q
max
, we get a special case of undirected exploration, where each action is guaranteed to
be tried at least once by the agent. Also, actions which have been selected more often in a certain state
are likely to have lower Q-Values than rarely chosen actions. Consequently, the probability of taking a
frequently chosen action again is reduced.
So in this case, it is a sort of mixture of the undirected and the directed exploration schemes. This method of
incorporating exploration is often called optimistic value initialization. Note that this is not possible with all
function approximation schemes that we will discuss in chapter 6. For example, neural networks are global
function approximators, i.e. changing the value of state s
t
changes the value of many other states s
even
if s
has never been visited. Consequently, the value of the state can not be used to estimate how often the
state has been visited. Taking the Q-Value as the exploration measure is often referred to as a utility-based
measure in literature.
4.7.2 Directed Exploration
Dierent exploration measures can be used, among then counter-based, recency-based and error based
measures. In this thesis, only the use of counter based measures is investigated in more detail , but the other
exploration measures can be implemented in the Toolbox easily. In the following discussion, the exploration
measure of executing action a in state s will be called (s, a). This exploration measure (exploration term)
is usually linearly combined with a Q-Value (exploitation term) to calculate new action values.
Eval(s, a) = Q(s, a) + (s, a) (4.33)
The action value Eval(s, a) can be used again for action selection in the usual way - for example, using
any stochastic policy. The exploration factor is typically decreased over time in order to converge to a
greedy policy. Thrun [51] also proposed a method for dynamically adjusting the value, which is called
Selective Attention. The dierent exploration measures can be combined to get a more eective, but also
more complex exploration policy.
Counter-based measures
Counter-based measures count the number of visits for each state C(s). The exploration measure is typically
the number of visits of the next state s
t+1
(s
t
, a) = E
s
t+1
[C(s
t+1
)|s
t
, a] (4.34)
An exploration policy would try to minimize this term, so using a linear combination with an exploitation
term (which has to be maximized) does not work. For this reason, and in order to ensure that counter based
exploration measures converge with time, Thrun proposed the following counter based exploration measure:
(s
t
, a) =
C(s
t
)
E
s
t+1
[C(s
t+1
)|s
t
, a]
(4.35)
which is the relation between the number of visits of the current state and those by the successor state.
In order to get an exploration policy, this measure has to be maximized, and made suitable for the linear
combination again. The expectancy E
s
t+1
[C(s
t+1
)|s
t
, a] can either be learned or estimated using a model of
the MDP.
For the counter on the other hand a decay term can be used in order to induce the recency information (see
next section) of the state visits. In this case the counter is updated by
C(s) =
_
_
C(s) + 1 , if s = s
t
C(s) , else
for all s (4.36)
where is the decay factor. Counter-based measures with decay can be seen as combination of counter
based and recency based measures.
Recency-based measures
This exploration measure estimates the time that has elapsed since a state was last visited; therefore, it is
well suited for changing environments. Sutton [48] suggested using the square route of the time (s, a)
elapsed since the last selection of a in state s:
(s, a) =
_
(s, a) (4.37)
Error-based measures
Another way to construct an exploration measure is to use the expected error of the value function in state
s. If the error of the value function is large in state s, it is understood that visiting state s and updating V(s)
is preferable. We can use the average over the last temporal dierence values as an estimate of the expected
error V(s).
4.7.3 Model Free and Model Based directed exploration
In this section we will take a more detailed look at calculating the counter based exploration measure. For
counter based measures, we need an estimate of E
s
t+1
[C(s
t+1
)|s
t
, a]. The expectancy can either be estimated
by:
Model Based Method: In this case we have a model of the (stochastic) MDP. The expectancy is given
by
E
s
t
+1
[C(s
t+1
)|s
t
, a] =
P(s
|s
t
, a)C(s
) (4.38)
If the model is not given we can learn the stochastic model (see section 4.8.3).
Model Free Method: In this case, we can estimate the expectancy by
E
t+1
[C(s
t+1
)|s
t
, a] = (1 )

E
t
[C(s
t+1
)|s
t
, a] + C(s
t+1
) (4.39)
Wyatt [55] empirically conrmed the intuitive statement that model-based exploration methods work better
than model-free.
4.7. Exploration in Reinforcement Learning 69
4.7.4 Distal Exploration
All the discussed exploration measures take only the next state s
t+1
into account in their calculations. But
it is advantageous to base the action decision on future exploration measures as well. In this case, we
can either use planning methods in the same way as discussed for V-Planning, or we can learn the future
exploitation measure as done by Wyatt. We can state the exploration problem in terms of a reinforcement
learning problem by taking the exploration measure as the reward signal. For this reward signal, we can
dene a seperate exploration value function (here for the counter-based case)
(s) = E
s
[C(s
) +
(s
)|s, (s)] (4.40)

and also an exploration action value function
(s, a) = E
s
[C(s
) +
(s
)|s, a] (4.41)
Having formulated the equations for the value function and action value function, we can use the same
algorithms to learn the exploration value function (s) as we can for the standard value function V(s). For
action selection, the exploration value function is used instead of the immediate exploration measure (s, a).
For example the distal counter-based approach uses the following equation to evaluate the merit of an action.
Eval(s, a) =
C(s)
(s, a)
(4.42)
Wyatt [55] proposed two dierent methods for learning the exploration function: a model free approach
and a model based approach. These approaches completely correspond to the value function learning algo-
rithms TD()-Learning and Prioritized Sweeping, so they are not covered in this section in more detail. The
principal dierence between a exploration value function and a value function is that the exploration value
function is non-stationary, because the reward signal (s, a) changes over time. Consequently calculating the
exploration reward function is a more complex task than calculating the value function. Nevertheless, Wyatt
showed empirically that distal exploration methods can outperform their local counterparts in a gridworld
example.
4.7.5 Selective Attention
All the discussed methods use a xed linear combination of the exploitation and exploration term. This
can be ineective, particularly if the optimal action for the exploitation and the exploration point in exactly
opposite directions. Hence the exploration rule might yield an action which neither explores nor exploits.
A basic idea for overcoming this problem would be to use selective attention to determine the current be-
havior (exploration or exploitation); then the dynamically calculated attention parameter [0, 1] is used
for action selection.
Eval(s, a) = Q(s, a) + (1 ) (s, a) (4.43)
Consequently, a value of 1.0 totally exploits and a value of 0.0 explores. Thrun [51] proposed the
following rules to calculate the attention parameter .
=
t1
Q
t
(s, a)
V
t
(s)
(1
t1
) (s, a) (4.44)
t
=
min
+ () (
max

min
) (4.45)
where is a sigmoidal function. Typical values for
min
and
max
are 0.1 and 0.9. The value estimates
the exploration-exploitation tradeo under the current attention setting. The new value is calculated by
squashing with a sigmoidal function. If the exploitation measure
Q
t
(s,a)
V
t
(s)
is large compared to the exploration
measure, will be positive, resulting in a preference for the exploration action, and vice versa for large
exploration measures. Due to the incorporation of the previous values for calculating , the values can
not change abruptly, so either exploring or exploiting actions are executed over several time steps.
In the Toolbox only the counter based exploration measures are implemented directly, but the other explo-
ration schemes can be implemented easily. For example, to estimate the expected error of the value function,
we can exploit the already existing error listener interface. For the counter based methods, we use a value
function object as counter, because it already provides all the necessary functionalities. We implement two
classes to count the state visits and state action visits (CVisitStateCounter and CVisitStateActionCounter).
These classes take a feature V (or feature Q)-Function and increase the value of the current state by one at
each step. Through this approach, we can use tables and feature functions (e.g. RBF networks) as a counter.
The exploration Q-Function and exploration Policy
The exploration Q-Function CExplorationQFunction calculates the exploration measure
(s
t
, a) =
C(s
t
)
E
s
t+1
[C(s
t+1
)|s
t
, a]
Thus it takes a value function which represents C(s
t
) and an action value function which represents E
s
t+1
[C(s
t+1
)|s
t
, a]
for the local or (s, a) for the distal case. The class is a read only Q-Function class, so only the getValue
function is implemented in the described way.
The exploration policy CQStochasticExplorationPolicy implements the sum of an exploitation Q-Function
and an exploration Q-Function to calculate the action values. Both are given as abstract Q-Function objects.
The sum is calculated by
Eval(s, a) = Q(s, a) + (1 ) (s, a)
where and are both parameters which can be set by the parameters interface. is initialized with
0.5. The class is a subclass of the stochastic policy class; because the remaining action selection part is not
changed, we can use any action distribution for action selection.
Local exploration
For the local exploration case we have two dierent ways to calculate E
s
t
+1
[C(s
t+1
)|s
t
, a]
Model Based: Here we can use standard planning techniques which are already implemented in the
Toolbox. If we have a stochastic model, we can use the class CQFunctionFromStochasticModel; for
a deterministic model we can use CQFunctionFromTransitionFunction.
Model Free: The model free approach requires an additional learner class CVisitStateActionEsti-
mator, which is derived from the agent listener class and estimates E
s
t+1
[C(s
t+1
)|s
t
, a] according to
equation 4.39.
4.8. Planning and model based learning 71
Distal exploration
Distal exploration approaches use the same learning algorithm as for value (likewise action value) function
learning. We just need to dene an appropriate reward function. A new reward function class CReward-
FunctionFromValueFunction is implemented, which takes a value function as input and always returns the
value of the current state as the reward. For this value function the counter is used.
Another value function object is needed for learning the exploration value function (s). We can use any of
the already existing architectures to do this; we can learn the exploration value function (s) and calculate
the action values via planning, or we can learn the exploration action values (s, a) directly. Alternatively,
prioritized sweeping (see 4) can be used as the model based approach.
Selective Attention
For selective attention, we implement an additional agent listener class CSelectiveExplorationCalculator.
The class takes an exploration policy object and calculates the value according to equations 4.43 and 4.45.
It adapts the value of the exploration policy at each step.
4.8 Planning and model based learning
In this section, we will discuss a few planning methods which can be used in combination with learning.
At rst we will look at the dierent denitions of planning and learning, then we will discuss the Dyna-Q
algorithm, and Prioritized Sweeping as an extension to dynamic programming.
4.8.1 Planning and Learning
Planning and learning are two very popular keywords in the area of Articial Intelligence. But what exactly
is the dierence between planning and learning? We use the same denition as Sutton [49], p. 227 who
distinguished learning from planning in the following way:
Learning: Improving the policy from real experience. Learning always uses the current step tuple
< s
t
, a
t
, r
t
, s
t+1
> to improve its performance.
Planning: Produces or improves a policy with simulated experience. Planning methods use models to
generate new experience, or experience from the past. This experience is called simulated, because
it has not been experienced by the agent at time step t. So we see that planning can also improve the
policy without executing any action.
We will now take a closer look at state space planning, which is viewed primarily as a search through the
state space for a good policy. We already discussed the V-Planning approach and dynamic programming as
representatives of this approach. Two basic ideas are very similar to the value based learning approaches:
State space planning methods compute value functions to improve their policy.
The value function is computed by backup operations applied to simulated experience.
Thus the main purpose of using planning methods in combination with learning is to exploit the training
data more eciently.
4.8.2 The Dyna-Q algorithm
The Dyna-Q algorithm as it is proposed in [49], integrates planning and learning methods. The original
Dyna-Q algorithm uses a standard Q-Learning algorithm without e-traces for estimating the action values.
After the common update (the learning part), the algorithm updates the Q-Values for N randomly chosen
state-action pairs from the past (the planning part). So the step information serves as a kind of replacement
for the e-traces in order to accumulate the current TD to the past states. The number of planning update steps
can very, depending on how much time is left before the next action decision takes place. For example, for
robotic tasks superuous computational resources can be spent on these update steps).
The problem is that for large state spaces, we get many dierent state-action pairs which we have already
experienced in the past, so it is unlikely that good state-action pairs with a high temporal dierence occur.
The advantage over e-traces is that we break the e-traces temporal order and thereby may discover better
action selection strategies. How to sample state-action pairs from the past is still an open problem in the
research. A few approaches [49] attempt to sample the on-policy distribution (the distribution of the state-
action pair visited when following the policy). However, estimating this distribution accurately needs many
evaluation steps itself.
Another positive aspect of this approach is that we do not make any assumptions about the used state space.
Other planning algorithms like DP only work for discrete state spaces, but here we need only store the steps
taken during the learning trial.
Dyna Q-Learning architectures realization in the Toolbox is quite simple, because all parts of this archi-
tecture have already been designed. The Q-Learner already exists and the batch step update class provides
a uniform sampling of the past steps. We will investigate the appropriateness of this architecture for other
learning algorithms like Q() or SARSA() as well and nd whether or not this can improve the performance
of the standard algorithm.
Figure 4.8: Using the Toolbox for Dyna-QLearning with the agent logger and batch step update classes.
4.8. Planning and model based learning 73
4.8.3 Prioritized Sweeping
Prioritized Sweeping (PS) is a model based RL method. It combines learning with planning (DP). PS
provides the same full backups as Dynamic Programming, but attempts to update only the most important
states. Therefore, PS maintains a priority list for the states; at each update for a state s, the priorities of
all states s
that can reach s with a single action a are increased by the expected size of the change in their
value. The expected change in the value of predecessor state s
can be calculated by P(s
|s, a) b(s), b(s), the

Bellman error occurred in state s, which is given by the dierence of the value before and after the update.
Thus the Bellman error can be expressed by the equation
b(s) =
a
(s, a)
P(s
|s, a)(r(s, a, s
) + V(s
) V(s)) = V
new
(s) V
old
(s) (4.46)
In PS the agent actually performs actions, so in contrast to DP, it is not just a planning algorithm. The value
of the current state is constantly updated (using the same procedure for the priorities of the predecessor
states); this is an additional help in choosing states for the updates. After updating the current state, the states
from the priority list can be updated as long as there is time remaining. In addition, PS does not require a
completely known model of the MDP; instead the model can be learned online with the information from
the performed steps of the agent. The PS algorithm is listed in algorithm 4:
Algorithm 4 Prioritized Sweeping
function addPriorities(state s, b)
for all actions a do
for all states s
that can reach s with a do

priority(s)+ = P(s|s
, a) b
end for
end for
function updateValue(state s)
V
old
V(s)
V
new
=
_
a
(s, a)
_
s
P(s
|s, a)(r(s, a, s
) + V(s
)
b = V
new
V
old
addPriorities(s, b)
t
, a
t
, r
t
, s
t+1
> do
update the model parameters
updateValue(s)
while there is time do
s
argmax
s
priority(s)
updateValue(s
)
end while
end for
DP methods use discrete state representation for planning, but can be easily extended to work with linear
features. The planning updates themselves still work in the discrete state representation, but it makes no
dierence whether these state indices for the DP updates represent discrete state indices or linear feature
state indices. Hence, if the model learning part can cope with linear features, we can use linear function
approximation for PS as well.
There is another approach called Generalized Prioritized Sweeping [2] which can cope with general paramet-
ric representations of the model and the value function. This algorithm was tested with dynamic Bayesian
Networks on a grid-world example. There was no time to implement this particular algorithm, but these
algorithms are not very promising for continuous problems anyway.
The priority list algorithm has already been implemented for the value iteration class, so we derive our PS
algorithm from that class. Remember that we are able to learn the V-Function or the Q-Function with that
approach. The rest of the algorithm is implemented in a straight forward manner within the agent listener
interface. We can specify either a number of updates K that are done after each step or the maximum time
the updates may need.
Learning the transition Model
In the Toolbox, we provide techniques for learning the distribution model for discrete state representations
as well as for linear feature states. In this section, we always refer to discrete integer states instead of
state objects. When using feature states, by state indices we always mean single feature indices. For both
state representations the same super class CAbstractFeatureStochasticEstimatedModel is used, this class is
subclass of CFeatureStochasticModel. Consequently it can be used for the DP classes. Our restriction in
representing the learned data is that access to the transition probabilities must work in exactly the same way
as it works with the xed model (see the DP section). Therefore the estimated model class also maintains
a visit counter for each state-action pair (CStateActionVisitCounter. The transition objects themselves still
store only the probabilities.
The number of state-action visits can then be used in combination with the probability to calculate the
frequency of occurrence of a specied transition (< s, a, s
>). In order to update the probability of the

current transition < s
t
, a, s
t+1
> we calculate the frequency of occurrence for each transition beginning with
< s
t
, a
t
>, increment the number of visits of < s
t
, a
t
> and increment the calculated number of occurrence
for the transition < s
t
, a
t
, s
t+1
>. Then we calculate the new probabilities again by dividing all transition
numbers N(s
t
, a
t
, s
) by the new visit counter N(s

t
, a
t
).
This approach can cope with two dierent state representations: discrete and linear feature states. For the
discrete states we just add the occurred transition to the probability matrix as explained. This is performed
by the class CDiscreteStochasticEstimatedModel.
For feature states, these updates have to be completed for combinations of feature indices from the current
and the next state. In this case, we do not increment the visit counters per 1.0, rather we increment the visit
counter of the transition < f
i
, a
k
, f
j
> by the product of the two feature factors.
Learning the reward Function
Since PS is a DP method, we still need a reward function for our discrete state (or feature) indices. For the
reward function, we need to store the reward value for each < s
t
, a
t
, s
t+1
> transition. Obviously the reward
value for many of the transitions will be zero since only a few transitions are possible. In our approach, we
store a map for each state-action pair < s, a > with all possible successor states s
as index and the reward as

function value. This representation allows us quick and ecient access to the reward values. At each step,
the reward r
t
is added to the current value of the map. To calculate the average, the estimated stochastic
models are used to fetch the number of occurrences of the transitions.
Chapter 5
Hierarchical Reinforcement Learning
One of RLs principal problems is the curse of dimensionality: the number of parameters increases expo-
nentially with the size of a compact state representation. One approach to combatting this curse is temporal
abstraction of the learning problem. That is decisions do not have to be made at each step, rather temporally
extended behaviors are invoked which then follow their individual policies. Adding a hierarchical control
structure to a control task is a popular way for achieving this temporal abstraction. In this chapter we begin
with a theoretical section on hierarchical RL, in particular SemiMDPs and options. Then we briey discuss
three approaches to hierarchical RL, learning with Options [40], Hierarchy of Abstract Machines (HAM,
[36]) and MAX-Q learning [16]. After this theoretical part, we will take a look at hierarchical approaches
used for continuous control tasks. We conclude with the RL Toolbox implementation of hierarchical RL.
5.1 Semi Markov Decision Processes
Semi Markov Decision Processes (SMDP), as dened in [40], have the same properties as MDPs, with one
exception: now the actions are temporally extended. Each temporally extended action denes its own policy,
and thus return primitive actions. We will call these temporally extended actions options. An option o
i
O
has the following parts:
The initiation set I S is the set of all states where the option can be initiated.
Either a stochastic policy
i
: S A [0, 1] or a deterministic policy
i
: S A.
A termination condition
i
For Markov options, the termination condition and the policy must depend on the current state s
t
. In this
case, we can assume that all states where the option does not have to end are also part of the initiation set
of the option. Consequently, it is sucient to dene Markov options only on their initiation set. However
in many cases, we also want to terminate an option; for example, terminating the option after a given
period of time, which contradicts the Markov option property (termination condition depends not on the
current state, but on the sequence of past states). Therefore we soften the requirement for and the policy.
For Semi Markov options, the termination condition and the policy might depend on the entire sequence
of states and primitive actions that occurred since the option was initiated. We will call this sequence
< s
t
, a
t
, r
t+1
, s
t+1
, a
t+1
, ..., r
, s
> the history of option o, which began at time step t. The set of all possible
histories is called . Thus we have : [0, 1] and : A. Note that the composition of two Markov
options is in general not Markov, because the actions are selected dierently before and after the rst option
has ended, ; that is, a composed option also depends on the history.
75
76 Chapter 5. Hierarchical Reinforcement Learning
Policies over options select an option o O according to a given probability distribution (s, o) and execute
it until the termination condition is met, then a new one is selected. The options themselves return primitive
actions, which are executed by the agent.
A policy which returns the primitive actions arising from the current option is called at policy. In general,
the at policy is semi Markov, even if the policy itself and the options are Markov (since a policy is a
composition of options).
This implies that for hierarchic architectures, where options can in turn choose from options (and so dene
a policy over options), we always have to deal with semi Markov options. For any MDP and any option
set O dened on this MDP, the decision process which selects from those options is a Semi-MDP. Note that
primitive actions can also be seen as options, always terminating after one step.
Figure 5.1: Illustration of the execution of a MDP, SMDP and a MDP with options,
5.1.1 Value and action value functions
Of course we must also consider the options for our V and Q-Functions.
The value function of an SMDP is dened as
V
(s
t
) = E[r
t+1
+ r
t+2
+ +
k1
r
t+k
+
k
V
(s
t+k
)] (5.1)
For the recursive form of the value function, we need some additional denitions. Let r(s, o) = E[r
t
+ +
k1
r
t+k1
|s
t
= s, o
t
= o, s
t+k
= s
] be the expected reward when executing option o in state s The value

function can then be expressed by:
V
(s) =
oO
(s, o)
_
_
r(s, o, s
) +
P(s
|s, o)
k
P(k|s, o, s
)
k
V
(s
)
_
_
(5.2)
, where P(s
|s, o) is the probability that the executed option terminates in s
, and P(ks,o,s) is the proba-

bility that option o ends after k steps, if o is already known to end in state s
.
The corresponding equation for the Q-Function is:
Q
(s
t
, o) = E[r
t+1
+ r
t+2
+ ... +
k1
r
t+k
+
k
V
(s
t+k
)|o
t
= o]
=
_
_
r(s, o) +
P(s
|s, o)
k
P(k|s, o, s
)
k
oO
(s
, o) Q
(s
, o)
_
_
(5.3)
5.1. Semi Markov Decision Processes 77
Similar to the MDP approach, we arrive at the optimal value function V
and action value function Q
by
using a greedy policy for .
Some dierences with semi-MDP learning arise naturally when implementing the learner classes. In this
case, we need to represent the duration of an action, while incorporate the duration information into the
Dynamic Programming and Temporal Dierence algorithms.
Temporal Extended Actions in the RL-Toolbox
We have already discussed multi step actions in the action model. Now we know the precise requirements
of these actions. The multi-step action (CMultiStepAction) has two additional data elds: the number of
steps the action has executed to a certain point and whether the action has been completed in the current
step. The duration and the nished ag are updated in every step. Whenever an action has nished, the step
information containing the duration eld of the action is sent to all listeners and a new action is selected.
The isFinished method initially depends only on the current state, so it is primarily designed for Markov
options. If we want to use semi Markov options, the action object additionally needs access to an episode
object, or access its duration eld if the duration of the option is all that is needed.
Having discussed how the duration information is passed to the listeners, we can now look at the dierences
between the algorithms,which we have already discussed for the MDP case. All algorithms support SMDP
learning automatically if they recognize multi-step actions (unless told otherwise explicitly), so we do not
use new algorithm classes.
Naturally, there are dierences for estimating the value or action value function, which will be discussed in
the following section.
Dynamic Programming
The DP approach uses the SMDP equations in order to calculate the value or action value function for the
iterations. Therefore, we also need to represent the probabilities of the durations of a transition.
For Semi-MDPs, since we have to store the probabilities P(s
, k|s, o) = P(s
|s, a) P(k|s, a, s
), we also have
to store the duration probabilities. This is accomplished in the same way as for single step actions (see class
CAbstractFeatureStochasticModel), in fact it is done by the same class. We simply use another type of the
transition objects, which are used for every multi step action automatically. In addition to the transition
probabilities, these objects store the relative probability of the duration of that transition. Therefore, an
individual map for the durations is maintained in the transition object, and can be retrieved by an extra
function call of the transition object (getDuration). The probability of the transition is retrieved in the same
way as for normal transitions (in order to allow the use of algorithms, which can not deal with Semi-MDPs).
We also have to extend the stochastic model learner classes CFeatureEstimatedModel and CDiscreteEsti-
matedModel. For SMDP transitions, the probabilities of the transitions are updated in the same way as
single step actions. For MDP transitions, the number of visits of the state-action pair < s, a > is used to cal-
culate the frequency with which the transition < s, a, s
> hac occurred in the past. Carrying out the update

for the relative probabilities P(k|s
t
, a
t
, s
t+1
) of SMDP transitions is similar to updating the probability of a
transition. Before incrementing the visit counter N(s
t
, a
t
, s
t+1
) of the occurred transition, we multiply the
relative probabilities of the durations with this visit counter. Then the counter of the occurred duration k
t
is
incremented; nally we divide again by the new visit counter of < s
t
, a
t
, s
t+1
> to arrive at the probabilities.
Since for multi-step actions the SMDP transition objects and the SMDP update rule are automatically used,
so we can use the DP and the Prioritized Sweeping classes for SMDP learning. The reward function used
for SMDP learning obviously also have to consider the options and return r(s, o). The Toolbox does not
contain a method to calculate this reward function or the transition probabilities P(s
, k|s, o) automatically
from the at model of the reward respectively transition function and the model of the option. All possible
trajectories from s to s
in k steps would have to be calculated to get the mean reward and the transition
probabilities. But the transition probabilities and the reward for the options can be learned with the Toolbox.
Temporal Dierence Methods
As already discussed in the previous section, TD-methods update their value functions with a one step
sample backup. For SMDP learning we have the following equations for the temporal dierence:
td = r
o
s
+
k
V(s
) V(s) (5.4)
for the V-Function update and
td = r
o
s
+
k
Q(s
, o
) Q(s, o) (5.5)
for the Q-Function update. r
o
s
is a sample of the expected option reward r(s, o), thus it is given by r
o
s
=
_
k1
i=0

i
r
t+i
. Thereby the dierences to the original algorithm include the need to calculate the discounted
reward received during the execution of the option the need to exponentiate the discount factor by the
duration of the action.
For calculating the reward of an option, we use an individual reward function class. Then the only dierence
for the TD-learning classes is that we exponentiate the discount factor by the duration if a multi-step action
is used. The SMDP reward function class stores all rewards coming from the at reward function during an
episode. To calculate the reward of the step < s
t
, o
t
, s
t+d
>, the reward function retrieves the duration d of
option o
t
, then sums up the last d at reward values.
Figure 5.2: Temporal Dierence Learning with options.
5.2 Hierarchical Reinforcement Learning Approaches
We have discussed SMDPs and options as the theoretical framework of all hierarchical methods. Now we
take a look at three dierent hierarchical structures that have been proposed in the literature. We will discuss
5.2. Hierarchical Reinforcement Learning Approaches 79
the option approach in more detail [40], then cover the Hierarchy of Abstract Machines (HAM, [36]) and
the MAX-Q Value Decomposition approach [16]. Each of these approaches is discussed only theoretically,
because they are not directly part of the Toolbox. However they inspired the design of the hierarchic structure
implemented in the Toolbox. And while the Toolbox only supports the general options approach, (but with
a deeper hierarchy level) options can choose from other options. The other frameworks can be described
using this general options framework, so the Toolbox can be extended to support the discussed approaches
easily.
5.2.1 The Option framework
We have already dened the options framework which is explained in more detail in [40]. The options
framework can be extended to Markov options. Markov options can be initiated in every state where they
are active (so it suces to dene them by their initiation set I). TD learning methods apply only an update
for the tuple < s, o, s
> after the execution of one option. But for Markov options we can also update the
states which were visited during the execution of the options. And we can further update the option values of
the other options, which are not active, if the current state is in the initiation set and the policy of that option
could have selected the action a
t
in this state. This is the motivation for the intra-option learning method.
The intra-option Q-Learning algorithm works as follows: for each step < s
t
, a
t
, s
t+1
>, the algorithm updates
the option values of all options that could have executed the action a
t
(so
i
(s
t
, a
t
) > 0)
Q
(s
t
, o)

r
t+1
+ U(s
t+1
, o), for all o (5.6)
U
(s, o) = (1 (s)) Q(s, o) + (s) max

o
O
s
Q(s, o
) (5.7)
U(s, o) is the value of state s if we have executed option o before the arrival in state s. Then the value is
either the value of the option in state s if the option does not terminate (with probability 1 (s) ), or if
a new option is chosen (with probability (s)), then the value is the maximum option value (or any other
option value, depending on the estimation policy) of all options available in state s. If all the options in O
are deterministic and Markov, the algorithm will converge to Q
.
This approach can only be applied to Markov options, so we can not use it for a more sophisticated hierar-
chic architecture. The primary motivation for the option framework is to allow the addition of temporally
extended behaviors to the action set, without precluding the choice of primitive actions. The resulting task
might then be easier to learn because the goal is attainable in fewer decision steps.
5.2.2 Hierarchy of Abstract Machines
Parr [36] developed a hierarchic structure approach called Hierarchy of Abstract Machines (HAM). This
approach attempts to simplify a complex MDP by restricting the number of realizable policies rather than
by adding more action choices. The higher hierarchy level supervises the behaviors and intervenes when
its antecedent current state enters a boundary state. Then a new low-level behavior is selected. This is very
similar to hybrid control systems, except that the low level behavior can be formulated as MDP and the high
level process as SMDP.
The idea of HAM is that the low level policies of the MDP work as programs, based on their individual state
and the state of the MDP. So every behavior H
i
is a nite state machine. Each H
i
has four types of states:
action, call, choice and stop. The action state determines an action to be executed by the MDP, the action
selection is based on the current state of the nite state machine H
i
and of the MDP. Thus the behavior
H
i
denes a policy (m
i
t
, s
t
). The call state interrupts the execution of H
i
and executes another nite state
machine H
j
until H
j
has nished; then the execution of H
i
is continued. The choice state selects a new
internal state of H
i
and a stop state obviously stops the execution of H
i
. Parr denes a HAM as the initial
state of all machine together with all states reachable by this initial state.
In gure 5.3 we can see the structure of a simple HAM for a navigation task in a gridworld. Each time
an obstacle is encountered, a choice state is entered. The agent can choose whether to back away from the
obstacle or to try to avoid the obstacle by following the wall.
Figure 5.3: State transition structure of a simple HAM, taken from Parr [36]
The decomposition HM of an HAM H and a MDP M denes a SMDP. The actions are the choices allowed
in the choice states; these actions can only change the internal state of a HAM. This framework denes an
SMDP because after the choice state, the HAM runs independently until the next choice state occurs. So
for action selection, only those states of the MDP and the internal state of the HAM which are possible
in the choice states must be considered, so we reduce the complexity of the problem considerably. HAM
drastically reduces the possible set of policies depending on the designers prior knowledge of ecient ways
to control the MDP M.
We can use the standard SMDP learning rules for the choice states of the HAM. The state of the SMDP is
dened as [s
c
, m
c
] (where s
c
is the state of the MDP and m
c
is the state of the HAM). So for two sequenced
choice states we get the following update for Q-Learning.
Q([s
c
, m
c
], a
c
)

r
t
+ r
t+1
+ +
1
r
t+1
+
max
a
c
Q([s
c
, m
c
], a
c
) (5.8)
Parr and Russell [2] illustrate the advantages of HAMs for simulated robot navigation tasks, but no larger
scale problem is known to have been solved with HAM.
5.2.3 MAX-Q Learning
MAX-Q learning was proposed by Dietterich [16]. MAX-Q tries to hierarchically decompose the value
function. MAX-Q builds a hierarchy of SMDPs, where each SMDP is learned simultaneously. MAX-Q
begins with a decomposition of the MDP M into n subtasks < M
0
, M
1
, , M
n
>. In addition, a hierarchic tree
structure is dened, thus we have a root subtask (M
0
, which means that solving M
0
solves M). Each subtask
can have the policy of another subtask in its action set (so the actions are branches of the tree).
The hierarchic structure is visualized in a task graph, as shown in 5.5(b) for the Taxi task, which is used as
a benchmark by Dieterrich. The problem is to pick up a passenger at a specic location and get him to a
particular destination in a grid-world. In the initial setting there are four dierent destinations and locations
(R, G, Y ,B) as illustrated in 5.4(a). This is a shortest-time problem, so the agent gets a negative reward for
Figure 5.4: Illustration of the taxi task, taken from Dietterich [16]
each step. We have six primitive actions: one for each direction, one for picking up the passenger and one
for dropping him o again. This problem can be divided into two primary subtasks: getting the passenger
from his initial location and putting him at his nal destination. As we can see, these subtasks in turn consist
of subtasks. The get subtask can either navigate to one of the possible locations or pick up the passenger
at the current location. The put subtask can drop the passenger o or use the same navigation subtasks. A
navigation subtask is parameterized with the location; an individual subtask exists for each parameter value.
Clearly, the subtasks can be used by several other subtasks. Because the choice of actions (subtasks) is
always made by the policy, the order of a subtasks child nodes is unimportant.
Figure 5.5: MAX-Q task decomposition for the Taxi task, taken from Dietterich [16]
A subtask M
i
consists of a policy
i
which can select other subtasks (including primitive actions), a set of
active states where the subtask can be executed, a set of termination states and a pseudo reward function. The
pseudo reward function is only used to learn a specic subtask and does not aect the hierarchic solution.
Note that a subtask has the same denition as an option with an additional pseudo reward function.
Each subtask denes its own SMDP by its state set S
i
and the action set A
i
, which consists of the subtasks
children. The transition probabilities P(s
, k|s, a) are also well dened by the policies of the subtasks.

The main feature of MAX-Q is that we can now decompose the value function of the MDP with the help
of the hierarchic structure. In each subtask M
i
we can learn an individual value function V
(i, s), dened

through our SMDP learning rules. The reward gained during the execution of subtask is estimated by its
value function, so we can express the reward values of an option with the learned value function.
V
(i, s) = V((s), s) +
,k
P(s
, k|s, (s))
k
V
(i, s
) (5.9)
The denition of the Q-Function can also be extended to the subtask approach by
Q
(i, s, a) = V(a, s) +
,k
P(s
, k|s, a)
k
Q
(i, s
, (s
)) (5.10)
The second term is called the completion function C, which represents the expected reward the agent will
get after he has executed subtask a. Thus C is dened as
C
(i, s, a) =
,k
P(s
, k|s, a)
k
Q
(i, s
, (s
)) (5.11)
So we can write for Q
Q
(i, s, a) = V(a, s) + C
(i, s, a) (5.12)
This decomposition can also be done for subtask a, yielding for all active subtasks from the root task M
0
to
the primitive action M
k
the following form of the value function for subtask M
0
(so for the MDP):
V
(0, s) = V
(a
k
, s) + C(a
k1
, s, a
k
) + C(a
k2
, s, a
k1
) + ... + C(0, s, a
1
) (5.13)
The value of a primitive actions subtask (remember that primitive actions are themselves considered as
subtasks) is dened as the expectancy of the reward in state s executing primitive action a
k
.
V
(a
k
, s) =
P(s
|s, a
k
) r(s, a
k
, s
) (5.14)
The decomposition is the basis of the learning algorithm, the C-Function can be learned with temporal
dierence methods. If a pseudo reward function is used to guide the subtask to specied sub goals, we
have to learn two C-Functions. We need an intern C-Function for the policy of the subtask and an extern
C-Function for the value decomposition.
In the described algorithm, the policies of each subtask converge to an optimal policy individually. There-
fore, the hierarchical policy consisting of <
0
,
1
, ,
n
> can only converge to the recursive hierarchical
optimal policy. Recursive optimal policies do not take the context of the subtask into consideration (which
subtasks are active in the higher hierarchy levels). For example, the optimal solution for navigating to a
destination could also depend on what we intend to do after arrival. To learn the hierarchical optimal policy,
we would have to include the context of the subtask in the state space. But by doing this, we would lose the
ability to reuse subtasks as descendants for several other subtasks.
The MAX-Q algorithm has shown good results for the taxi problem, as Dietterich illustrates using dierent
settings. Yet the problem seems to be too simple to estimate the benets of MAX-Q well. The MAX-Q
approach was also used successfully in the multi-agent control of an automated guided vehicles scheduling
task [27]. It outperformed the commonly used heuristics and thereby became one of very few successful
multi agent learning examples. The taxi domain is also part of the Toolbox; even if MAX-Q and HAMs are
not implemented, this example task can be used to experiment with the hierarchic structure of the Toolbox.
The preceding paragraphs were only a brief overview of the hierarchic algorithms to demonstrate how
concepts from programming languages can t into the RL framework. These approaches also inuenced
the design of the hierarchic structure in the Toolbox.
5.2.4 Hierarchical RL used in optimal control tasks
Only a few researchers have tried hierarchical RL for optimal control tasks. Most of the algorithms and
frameworks were only tested on discrete domains like the taxi domain from Dietterich [16], or the sim-
ulated robot navigation task in a grid world used for the HAM algorithms. To our knowledge, none of
the introduced architectures have been used for continuous optimal control tasks, but there are a few other
approaches which use other, usually simplied hierarchic architectures.
Using Predened Subgoals
Morimoto and Doya [30] used a subgoal approach for the robot stand up problem (see 1.3). The basic idea
is to divide a highly non-linear task into several local, high-dimensional tasks and a global task with lower
dimensionality. The subgoals are predened by target areas in the (continuous) state space. Each sub-goal
has its own reward function which is dependent of the distance to the target area of the subgoal. Each
goal is learned independently with an independent value function (in the case of Morimoto, an Actor-Critic
algorithm was used). The individual reward function of the subgoal simplies the global problem drastically.
If a subgoal reaches its target area, the subgoal has nished and a new subgoal is selected by the upper level
controller. At the upper level, the sequence of the subgoals can be xed or, alternatively, be learned by a
Q-Learning algorithm in a reduced state space.
Figure 5.6: Subgoals dened for the robot stand-up task, taken from Moritimoto [29]
Morimoto specied three dierent subgoals, as illustrated in gure 5.6, each of which denes a dierent
posture of the robot (so the velocities of the joint angles do not matter in reaching a subgoal). With the help
of these subgoals, the learning time and the needed number of RBF centers drops drastically.
RL with the via-point representation
The goal of this approach, also used by Morimoto [31] is to learn via-points with an Actor-Critic archi-
tecture. A via-point denes a certain point in the state space the trajectory should reach at a certain time.
The actor tries to learn good via points for the trajectory and at which time t
n
these via points should be
reached. This information is then used by a local trajectory planning controller to create the control vector.
As a result, the algorithm can choose its own time scale for executing an action. When the algorithm was
used for the cart-pole swing up problem it outperformed the at architecture.
Hierarchic Task Composition
The only approach that employs a complex hierarchical framework is a method used by Vollbrecht [53] for
the TBU (Truck-Backer-Upper) example. There are three dierent categories of subtasks: avoidance tasks,
goal seeking tasks and state maintaining tasks. These groups of tasks may interact with the veto principle, the
subtask principle and the perturbation principle. The tasks are learned individually in a bottom up manner,
so that all subtasks which are needed by a higher subtask T
H
are learned in individual learning trials before
the subtask T
H
can be used.
In the veto principle, an avoidance task T
1
may veto an action selected by task T
2
. The rst task learns in
isolation which actions lead, for example, to a collusion. When the second task is learned, only those actions
may be taken by the agent which are not predicted to lead to a collusion by the rst subtask.
In the subtask principle, a task T can choose from several subtasks T
i
. If T
i
has been chosen by T, one action
is executed by T
i
, then a new subtask is selected by T no matter whether T
i
has nished. It is assumed that
the task T learns the composition of all goals of its subtasks T
i
, which is only possible if certain conditions
are met for the goals of the subtasks T
i
.
For the perturbation principle, we have two hierarchically related subtasks T
H
and T
L
, where T
H
has a
hierarchical higher level. If T
H
perturbs the goal state of the lower level task T
L
by executing action a, the
lower task T
L
is activated until it has reached its goal state once more. Then the control returns to T
H
again.
The advantage of this approach is that the high level subtasks state space can be reduced to the goal area of
the low level subtask, since if this area is left, T
L
immediately interrupts the execution of T
H
. It is also clear
that this only works for certain kind of subtasks, for example, it is possible that the subtask T
L
always does
the inverse action of the high level task T
H
to restore the goal state again, which would not be the desired
eect.
The approach worked well for the TBU task, this is probably due to the specic hierarchic nature of the task.
Intuitively it is hard to scale this approach to a general framework, no other usage of this approach has been
seen. This approach is totally dierent to the option approach, it shows the possibilities of the interaction of
the dierent hierarchic levels very good.
5.3 The Implementation of the Hierarchical Structure in the Toolbox
The hierarchical structure in the Toolbox is a mixture of the option frame work and the MAX-Q framework,
but the HAM approach can also be implemented easily. As has already been mentioned, the standard SMDP
learning rules are implemented, but there are no algorithms for intra-option learning or the MAX-Q value
decomposition.
5.3.1 Extended actions
In the Toolbox, we use extended actions (CExtendedAction) as options. An extended action is a multi step
action, so it is a temporally extended action; it stores the duration and nished ag and has its individual
termination condition. Additionally, the extended action denes an own policy interface, so it has to return
another action for a given state with the function getNextHierarchyLevel. Once again, this action can be
5.3. The Implementation of the Hierarchical Structure in the Toolbox 85
either again an extended action or a primitive action. Consequently there can be more than one extended
action active at a time. The design is modeled similarly to the hierarchy levels of the task graph explained in
MAXQ learning. Extended actions cannot be directly executed by the agent, as the agent can only execute
primitive actions. We need therefore a hierarchic controller which manages the appropriate execution of the
active extended actions.
5.3.2 The Hierarchical Controller
The hierarchical controller (CHierarchicalController) has three functionalities: It manages the execution of
the active extended actions, builds a hierarchic stack, and send this hierarchic stack to the specied hierarchic
stack listeners. The hierarchic stack is a list of all actions that are currently active, so it begins with the root
action a
0
and ends with a primitive action a
k
. The hierarchic controller also has to be used as the controller
for the agent, because it returns the primitive action which was returned by the last extended action.
The hierarchic controller contains a reference to the root action from the hierarchy. From this root action,
it builds the hierarchic stack by recursively calling the getNextHierarchy function of the extended actions
until a primitive action is reached. All these actions are stored in the hierarchic stack, beginning with the
root action and ending with the primitive action. This primitive action is then returned to the agent by the
standard controller interface. The agent executes this action and sends the step information < s
t
, a
t
, s
t+1
>
to the listeners as usual. Note that through this approach, the agents listeners are always informed about the
at policy, so they do not get any information about the extended actions which have been used.
Figure 5.7: The hierarchic controller architecture of the Toolbox as the main part of the hierarchic frame-
work.
The hierarchic controller also implements the agent listener interface, so it gets the step information too.
With the step information it can update the hierarchic stack. The duration of the primitive action is added
to each extended action in the action stack (note that even primitive actions can have dierent durations,
which are generally xed or set by the environmental model). It also calls the isFinished method of each
extended action and updates isFinished ag of the action with this information. If an extended action ends
in the current state, the nished ags of all other extended actions following on the stack (i.e. actions with
a lower hierarchy) are nished too. After updating the hierarchic stack h, the hierarchic step information
< s
t
, h, s
t+1
> is sent to all hierarchic stack listeners. Hierarchic stack listeners dene almost the same
interface as common listeners, but they get the hierarchic stack instead of an action object as the argument
for the nextStep interface.
After sending the hierarchical step information, the primitive action and all terminated extended actions are
deleted from the stack. At the next call of the getNextAction interface, the stack is lled with the action
objects from the getNextHierachyLevel calls again . We also provide a parameter for setting the maximum
number of hierarchic execution steps. Each extended action (except the root action), is terminated if it has
been executed longer than this maximum number of steps. This parameter for example can be used to slowly
converge to a at policy again when the hierarchy levels have already been learned (this was also tried by
Dietterich for his MAX-Q approach)
5.3.3 Hierarchic SMDPs
Now we know how the hierarchic controller builds the hierarchic stack and therefore how the hierarchic
policy is executed, but we also want to learn in the dierent hierarchy levels. For learning in dierent
hierarchy levels we have to send the step information < s
t
, o
t
, s
t+k
>; we also want to oer almost the same
functionalities as for the agent, like using more than one listener or setting an arbitrary controller as policy
for the SMDP.
A hierarchical SMDP (CHierarchicalSemiMarkovDecisionProcess) has a set of options O, an individual
policy , which can choose from these options, and a termination condition . Also it obviously must be
a member of the hierarchy structure, so it is a subclass of the extended action class. Additionally, the
hierarchic SMDP class is a subclass of the standard semi MDP class (CSemiMarkovDecisionProcess). This
class is the super class of the agent and supports the agents primary functionality for managing the listeners
list, like sending the step information to all specied listeners. The semi MDP class already maintains a
controller object and is itself a deterministic controller, storing the current action to be executed. Hence we
do not have to calculate the action in the current state twice. The specied controller of a hierarchic SMDP
is used for the getNextHierarchyLevel function of the extended action interface, meaning that it implements
the policy of the extended action.
In addition the hierarchical SMPD implements the hierarchical step listener interface, so it retrieves the
hierarchic stack from the hierarchic controller at each step. Once the hierarchic SMDP has been executed,
it becomes a member of the stack. If that is the case, it searches for the next action on the stack (i.e. the
action which the policy of the SMDP has selected). If this action has been completed in the current step, the
step information < s
t
, o
t
, s
t+k
> can be sent to all listeners. We get the hierarchical step information from
the hierarchic controller, but this information contains only the states < s
t+k1
, h, s
t+k
>. In order to nd the
correct initial state s
t
of the executed option, the hierarchic SMDP class contains an extra state collection
object which always stores the state s
t
at the beginning of an option.
For this approach, all hierarchical SMDPs have to be added to the hierarchical step listener list from the
hierarchic controller. See also gure 5.8 for an illustration of the class system.
5.3.4 Intermediate Steps
For Markov options we can also use the intermediate states (as done in intra-option learning) for the value
updates. A Markov option could have been started in any of the intermediate states that occurred during
the execution of the option. Hence, we can create additional step information < s
, o
t
, s
t+
> for every s
in {s
t+1
, , s
t+1
} and send it to the listeners. To retrieve this intermediate step information, the hierarchic
SMDP needs access to an episode object, which has to be specied in the constructor if the intermediate
steps are used.
Figure 5.8: The hierarchic Semi-MDP is used for learning in dierent hierarchy levels.
This step information must be treated dierently because these steps do not correspond to a ascending
temporal sequence. The steps sent by the standard agent listener interface are supposed to be temporally
sequenced if no new episode has started. For this reason, we have to use an individual interface function
called intermediateStep, which is added to the agent listener interface, but does not have to be implemented.
The intermediate steps are sent after the standard step information, for every intermediate step, the duration
of the option o
t
is also set correctly. For extended actions, we can choose whether we want to send the
intermediate step information to the listeners or not. We implement an individual intermediate step treatment
for the TD learning algorithms, which all other algorithms ignore.
Figure 5.9: Intermediate Steps of an option can also be used for learning updates when using Markov
options.
TD-learning with intermediate steps
TD Learning with intermediate steps has the same eect as intra-option learning. The problem with interme-
diate steps is their missing temporal ascending sequence, so we can not add them to the e-traces in the usual
way. But we can carry out the standard TD-learning update without e-traces (because these updates work
for any step information, they do not require any temporal sequence) and we can also add the intermediate
states to the e-trace list, because these states are also predecessor states of s
t+
. But this has to be done after
the normal TD() update of the standard step information < s
t
, o
t
, s
t+
>. The intermediate step update is
Algorithm 5 TD learning with intermediate steps
t
, o
t
, s
t+
> do
etraces updateETraces()
etraces addETrace(s
t
, o
t
)
td =
_
1
i=0

i
r
t+i
+
Q(s
t+
, o
) Q(s
t
, o
t
)
Q(s, o)

td e(s, o)
for k = 1 to 1 do
td =
_
1
i=k

ik
r
t+i
+
k
Q(s
t+
, o
) Q(s
t+k
, o
t
)
Q(s
t+k
, o
t
)

td
etraces addETrace(s
t+k
, o
t
)
end for
end for
shown in algorithm 5. This is done for all TD Q-Function and V-Function learning algorithm.
5.3.5 Implementation of the Hierarchic architectures
Now we want to take a closer look at the option and the MAX-Q hierarchic structure framework and how
these can be implemented in the Toolbox.
The option framework
The option framework consists of one hierarchy level, namely an SMDP which can choose from dierent
options. Nevertheless, we need the hierarchic controller in order to use options. The hierarchic controller
executes the options and returns the primitive actions to the agent. The options have to be implemented
by the user, and all options must be subclasses of the extended action class. For learning, we have to add a
hierarchic SMDP as hierarchic stack listener to the controller. This hierarchic SMDP is also the root element
of the hierarchic structure. At this hierarchic SMDP we can add our learning algorithms and specify agent
controllers in the usual way. If we add listeners to the hierarchic SMDP they will be informed about the
option steps and about the intermediate steps if required. If we add listeners to the agent, the listeners will
be informed about the at step information. So we can learn the option values and the values of primitive
actions simultaneously. For example, we can add a TD learner to the hierarchic SMDP and a TD learner to
the agent, both TD-Learners can use the same Q-Function, which contains action values for the options and
the primitive actions. The realization of this approach is also illustrated in gure 5.10
The MAX-Q framework
In the MAX-Q framework we have a hierarchic task graph, for each subtask a policy can be learned. We
only discuss the creation of the hierarchical structure of MAX-Q learning in the Toolbox; the MAX-Q value
Figure 5.10: Realization of the option framework with the hierarchic architecture of the Toolbox.
decomposition algorithm is not implemented. A subtask has almost the same structure as options do. Again,
we represent the subtask as hierarchic SMDP with an individual controller as policy and an arbitrary number
of listeners for learning. The dierence is that the descendants of the subtasks can in turn be subtasks. The
dierence to the standard option approach is that the actions selected from a a subtasks policy can in turn be
a hierarchic SMDP. So the policies can choose from either hierarchic SMDPs or primitive actions (but also
an intermixing with self coded options is possible). The only functionality that is missing is the termination
condition of the subtasks, this has to be implemented by the user himself. The standard hierarchic SMDP
termination condition is always false. Thus an individual class is needed for each subtask to implement the
termination condition, this class can simultaneously be used for the individual reward functions.
Figure 5.11: Realization of the MAX-Q framework with the hierarchic architecture of the Toolbox.
Chapter 6
Function Approximators for Reinforcement
Learning
In this section, we discuss several function approximation schemes that are successfully used with RL. In
the context of RL function approximation is needed to approximate the value function or the policy for con-
tinuous or large discrete state spaces. We begin the chapter by briey covering gradient descent algorithms
for function approximation. Then we take a look at the available function representations for the Toolbox,
including tables, linear feature representations, adaptive gaussian soft-max basis function networks, feed
forward neural networks and Gaussian sigmoidal networks. Then we will explain how these FAs can be
used to approximate either the V-Function, Q-Function or the policy directly. All the function approxima-
tion schemes are implemented independently of the RL algorithms, so we can use almost any approximator
for a given algorithm.
6.1 Gradient Descent
Gradient descent is a general mathematical optimization approach. Given a smooth scalar function f (w),
we want to nd the weight vector w
which minimizes the function f . One approach for this is gradient

descent. In general, the vector w is initialized randomly; then the weights are updated according to the
negative gradient.
w
t
=
t

d f (w)
dw
(6.1)
This algorithm is proved to converge to a local minimum if the learning rate satises the following prop-
erties:
t=0
t
= (6.2)
t=0
2
t
< (6.3)
For regression and other supervised learning problems we usually want to approximate a function g where
just n given input-output values are known. So, for every input vector x
i
we know the output value g(x
i
).
Our goal is to nd a parameterized function g(x, w) which approximates g as well as possible (at least for
the given input points x
i
).
90
6.1. Gradient Descent 91
In this case the function which we want to minimize is the quadratic error function
f (w) = E(w) =
1
2
n
i=1
( g(x
i
; w) g(x
i
))
2
. (6.4)
There are two dierent gradient descent update approaches:
Simple Gradient Descent (Batch Learning): This approach calculates the gradient of all input points
for one weight update, as a result we have the following update rule:
w =
t

n
i=0
( g(x
i
; w) g(x
i
))
dg(x
i
; w)
dw
(6.5)
Conditions of convergence are well understood for batch learning, another advantage is that several
acceleration techniques like the Conjugate Gradient, the Levenberg-Marquant or the QuickProp algo-
rithm can only be applied to batch learning updates.
Incremental Gradient Descent: In this case the weight update is already done after the rst gradient
calculation of an input point.
w =
t
( g(x
i
; w) g(x
i
))
dg(x
i
; w)
dw
, for a random input x
i
(6.6)
Simple gradient descent corresponds to epoch-wise learning, incremental gradient descent to incre-
mental learning. Incremental learning is usually faster than batch learning. This is a consequence
from the fact that there exists usually several similar input output patterns. In this case batch learn-
ing wastes time computing and adding several similar gradients before performing the weight update.
Because of the randomness of incremental learning, we are also more likely to avoid local minima,
for that reason incremental learning often leads to better results.
For all these update schemes the convergence results has been extensively studied [12].
General RL algorithms like TD() all use a incremental learning update, since the updates are done imme-
diately after one step.
6.1.1 Ecient Gradient Descent
One major aspect of gradient descent is how to choose the learning rate . If is too small, learning will be
very slow, on the other hand, if is too large, learning can diverge. In general optimal values dier for
dierent weights, so applying dierent learning rates can be useful. If we assume a quadratic shape of the
error function, the optimal learning rate depends on the second order derivative (curvature) of the weights.
Since we have dierent curvatures for dierent weights, most algorithms try to transform the weight space
in order to have uniform curvatures in all directions. Most of the algorithm that deal with that problem are
batch algorithms, so they can not be used for reinforcement learning tasks easily.
But there exists a method called the Vario algorithm used by [15], which can be used for incremental
learning.
92 Chapter 6. Function Approximators for Reinforcement Learning
The Vario Algorithm
The Vario algorithm measures the variance of each weight during during learning. This variance is then
used to scale the individual learning rates.
v
k+1
(i) = (1 ) v
k
(i) +
w
k
(i)
k
(i)
2
(6.7)
k+1
(i) =
v
k+1
(i) +
(6.8)
v
k
(i) measures the variance of weight updates w
i
(it is assumed that the variance of the is high in comparison
to the mean, which has been empirically veried), is the variance decay factor (usually small values like
0.01 are used). is a small constant which prevents a division by zero. The initial learning rate gets
divided through the standard deviation of the weight, as a result the updates for all weights have the same
variance
2
.
The Vario algorithm is not used in online learning, instead it is used for observing the variance and
calculating the learning rates for a given function approximation framework in advance. In this thesis we
use the same results as in Coulom [15] for FF-NNs.
6.2 The Gradient Calculation Model in the Toolbox
For the Toolbox it is desirable to have a general interface for calculating the gradient of a specic function
representation. Function representations which are allow calculating the gradient are obviously always
parameterized functions. In our case we assume that the function has a xed set of weights. We need
interface functions for updating the weights, given the gradient and also for calculating the gradient given
an input vector. But at rst we have to take a look at how we represent a gradient vector.
6.2.1 Representing the Gradient
The properties of gradients can be very dierent, some are sparse (like for RBF networks), for other function
representations the gradient can be non-zero everywhere. We already dened an appropriate data structure
to handle these demands eciently in the section 4.3.3 when discussing e-traces (see class CFeatureList).
For our gradient representation we use the same, but unsorted feature list. In this case the feature index
represents the weight index, and the feature factor the value of the gradient vector. For all weight indices
which are not in the list, we again assume that the derivation with respect to that weight is zero.
6.2.2 Updating the Weights
We introduce an own interface for updating the weights, given the gradient vector as a feature list. This
interface is called CGradientUpdateFunction. Gradient update functions additionally provide the function-
ality to get and set the weight vector directly by specifying the weight vector as a double array. A method
to retrieve the number of weights is also provided.
The weight update is done by two dierent functions. The actual gradient update is done by the interface
function updateWeights(CFeatureList *gradient), which has to be implemented by the subclasses. As dis-
cussed in the section 6.1.1 about ecient gradient descent it can be advantageous to apply dierent learning
rates for dierent weights. This is done by extra objects called adaptive calculators. Whenever the weights
gradient function has to be updated, the updateGradient function is called. At rst the gradient vector
6.2. The Gradient Calculation Model in the Toolbox 93
is passed to the calculator (if one has be dened), which applies the individual learning rates for each
weight. Then the updateWeights function is called.
The gradient update function interface does not specify any input output types of the function, it just provides
the interface for updating a arbitrary parameterized function approximator with the gradient vector.
Applying dierent learning rates
In order to apply dierent learning rates for dierent weights we introduce the abstract class CAdaptiveEta-
Calculator. We can assign an adaptive calculator for each gradient update function. If an calculator has
been specied for an update function, the calculators interface function getWeightUpdates is called before
performing the actual weight update. This function can now be used to apply dierent learning rates to the
gradient vector. There are two dierent general implementations of calculators already implemented in
the Toolbox:
Individual -Calculator: This class maintains an own array for the learning rates (initialized with
1.0), we can set the learning rate for each weight individually.
Vario -Calculator: This is the algorithm discussed in the section 6.1.1 to calculate an optimal
learning rate for the weights, considering the variances of the weight updates. In praxis this algorithms
is not used online for performance reasons, but the results are used to calculate static learning rates
for specic function approximators.
The structure of a gradient update function and the interaction with the adaptive calculator class is also
illustrated in gure 6.1
Figure 6.1: Interface for updating the weights of a parameterized FA
Delayed Weight Updates
In a few cases it is worthwhile to postpone the weight updates to a later time (e.g. it can be advantageous
to apply the updates only when a new episode has started). Therefore we implemented the class CGra-
dientDelayedUpdateFunction, which encapsulates another gradient update function. The updates to the
encapsulated gradient function are stored as long as long as the method updateOriginalGradientFunction is
called, then the stored updates are transferred to the original gradient function.
With the class CDelayedFunctionUpdater we can choose when we want to update the specied function
approximator. We can choose how many episodes and/or steps have to elapse until the next update is
performed.
6.2.3 Calculating the Gradient
For calculating the gradient we provide the interface class CGradientFunction, which is a subclass of CGra-
dientUpdateFunction. In dierence to gradient update function we already dene interfaces for calculating
a m-dimensional output vector given a n-dimensional input and calculating the gradient at given an input
vector. Thus the input and output behavior are all ready xed for this class, for the input and the output
vectors CMyVector objects are used. The gradient calculation interface additionally gets the error vector e
as input, the returned gradient is calculated in the following way
grad =
m
i=1
d f
i
(x)
dw
e
i
(6.9)
where i denotes the i
th
output dimension. Hence we can specify which output dimension we want to use for
the gradient calculation by specifying an appropriate error vector e
Figure 6.2: Interface for parameterized FAs which provide the gradient calculation with respect to the
weights.
6.2.4 Calculating the Gradient of V-Functions and Q-Functions
We still need to implement value and action value functions which support our gradient interfaces. The
main dierence to our general gradient calculation design is that we now need state and action data objects
6.2. The Gradient Calculation Model in the Toolbox 95
as input. Therefore we create two additional classes CGradientVFunction and CGradientQFunction which
are both subclasses of CGradientUpdateFunction. Both classes have an additional interface function for
calculating the gradient given either the current state or the state and the action object as input (in dierence
to CGradientFunction, where the input is a CMyVector object).
Gradient V and Q-Functions are updated through the gradient update function interface. As a result we
can already implement the updateValue methods, which can use the gradient calculation and weight update
functions.
Now we can implement gradient V-Functions and gradient Q-Functions independently, but we do not want
to implement the same type of function approximator twice. For this reason we already introduced the
CGradientFunction interface. With the help of this interface we design classes which encapsulate a gra-
dient function object and implement the V-Function respectively Q-Function functionalities. Therefore we
create an own class for V-Functions (class CVFunctionFromGradientFunction) and for Q-Functions (class
CQFunctionFromGradientFunction), which implement the gradient V-Function interface respectively the
gradient Q-Function interface and use a given gradient function object for each calculation. It is assumed
that the given gradient function has the correct number of output and input space dimensions, otherwise an
error message is thrown. The number of outputs is obviously always one, the number of inputs depends on
the number of discrete and continuous state variables of the input state. With this approach we can create a
V or a Q-Function just by specifying a gradient function object.
Figure 6.3: Value Function class which uses an gradient function as function representation. The function
calls are just passed to the gradient function object, for the getValue and getGradient functions of the V-
Function object we rst have to convert the state objects to to vector objects in order to be able to use the
gradient functions interface methods.
6.2.5 Calculating the gradient of stochastic Policies
Many policy search algorithms need to calculate the gradient of the likelihood
d(s,a)
d
or the log-likelihood
d log((s,a))
d
=
1
(s,a)
d(s,a)
d
of a stochastic policy , where is the parametrization of the policy. In the case
where the policy depends on action values (CQStochasticPolicy) of a Q-Function (or also reconstructed
values from a V-Function) the policy parametrization is equal to the weights w of a (action) value function.
The gradient can in this case be expressed by
d(s, a
i
)
d
=
d(s, a
i
)
dw
=
d(s, a
i
)
dQ(s, a)
dQ(s, a)
dw
=
a
j
|A
s
|
d(s, a
i
)
dQ(s, a
j
)
dQ(s, a
j
)
dw
(6.10)
dQ(s,a)
dw
is the derivation of the Q-Function for all action values, hence it is a m p matrix
_
_
dQ(s,a
1
)
dw
dQ(s,a
2
)
dw
. . .
_
_
, m being
the number of actions, p the number of weights.
d(s,a
i
)
dQ(s,a)
is a m-dimensional row vector, representing the
derivatives of the action distribution with respect to the Q-Values, thus this distribution has to be dieren-
tiable. The only action distribution we discussed which fullls this requirement is the soft-max distribution
(s, a
i
) =
exp( Q(s, a
i
))
_
|A
s
|
j=0
exp( Q(s, a
j
))
The gradient of the soft-max distribution with respect to Q(s, a
j
) is given by
d(s, a
i
)
dQ(s, a
j
)
=
_
_
(s, a
i
)
_
1 (s, a
j
)
_
if a
i
= a
j
(s, a
i
)(s, a
j
) else
(6.11)
We extend our design of the action distribution objects by an additional interface, which calculates the
gradient vector
d(s,a
i
)
dQ(s,a)
= [
d(s,a
i
)
dQ(s,a
1
)
,
d(s,a
i
)
dQ(s,a
2
)
, . . . ]. This interface function is not obligatory, since it can only be
implemented by the soft-max distribution class. It gets the action values [Q(s, a
1
), Q(s, a
2
), . . . ] as input and
returns the gradient vector. Since not all action distribution support the gradient calculation, an additional
boolean function indicates whether the gradient calculation is supported.
The stochastic policy class gets also extended by three functions, one for calculating
d(s,a)
dw
, one for
d log((s,a))
dw
and one for calculating the gradient
dQ(s,a)
dw
, which is used by the two former functions in combination with
the action distributions gradient function to calculate the demanded gradients of the likelihood resp. log
likelihood. The gradient
d(s,a)
dw
is simply calculated by the weighted sum of the single gradients
dQ(s,a)
dw
(represented as feature lists) given by equation 6.10
6.2.6 The supervised learning framework
In RL we may also need supervised learning algorithm for example for learning the state dynamics of a
dynamic system. In this section we will briey present the supervised learning interfaces of the Toolbox, an
example how we can learn the dynamics of a model is given in the section 7.1.6.
The class CSupervisedLearner is the super class of all supervised learning algorithms. Supervised learning
is only supported for regression problems like continuous state prediction, so we have a n-dimensional
continuous input and m-dimensional continuous output state space. The inputs and the outputs are all
represented as vector objects. The supervised learning base class consists of two methods, one for testing
an input vector and return the output of the function approximator, and one for using a given input-output
vector pair for learning. Both functions are only interfaces, and have to be implemented by the subclasses.
The Toolbox supports just one supervised learning algorithm implementation, using incremental gradient
descent as discussed in the theoretic part of this section. The class CSupervisedGradientFunctionLearner
gets a gradient function as input, and serves as connection between the supervised learning interface and
6.3. Function Approximation Schemes 97
the gradient function interface. When learning a new example, it calculates the error between the output
of the gradient function and the given output behavior, then the gradient and the error are used to update
the weights of the gradient function according to the stochastic gradient equations. We can additionally
specify a momentum factor for the supervised gradient learner, which calculates a weighted average over
the gradient updates:
w
k+1
= w
k
+ E(x, g(x)) (6.12)
This is a common approach to boost the performance of supervised gradient descent algorithms.
Through this approach we can also use any implemented gradient based function approximation scheme
also for supervised learning.
6.3 Function Approximation Schemes
In this section we will discuss the function approximation schemes which are implemented in the Toolbox.
All these function approximators are updated via gradient descent. The following section will cover the
implementation of the discussed function representations.
6.3.1 Tables
Tables are the simplest function approximators. As we have already discussed in the section 3.1 a single
state index is used to represent the state, the value of the state is then determined through tabular look up.
g(s
i
; w) = w
i
(6.13)
Continuous problems have to be discretized in order to get a single state index, which is a very crucial task,
so tables are only recommendable for discrete problems. But nevertheless we can treat tables like normal
function approximators, and therefore we can also calculate the gradient with respect to the weights:
d g(s
i
; w)
dw
= e
i
(6.14)
where e
i
is the i
th
unit vector.
6.3.2 Linear Approximators
We already discussed linear feature state representations in the section 3.2. For linear approximators we
have, similar to tables, one entry for each state, the dierence is that the function value is interpolated
between several table entries.
g(s; w) =
n
i=0
i
(s) w
i
= (s) W (6.15)
Here n is the total number of features,
i
calculates the activation factor of feature i. The gradient of linear
approximators is calculated very easily:
d g(s; w)
dw
= (s) (6.16)
There are dierent ways how to create the linear feature state representation, the two most common are:
Tile Coding
Normalized Gaussian networks
For a detailed discussion of these two approaches see chapter 3.2.
Linear feature approximators have, depending of the choice of the feature space very good learning prop-
erties. An important advantage is that we can choose our features to just have local inuence of the global
function (RBF-Features, Tile Coding), so updating state s does only change the function value in a local
neighborhood of s. Another important advantage that we can see now, is that the gradient does not de-
pend on the current weight values, so it is purely specied by the state vector. The disadvantage of these
approaches with just local features is that they suer from the curse of dimensionality, i.e. the number of
features increases exponentially with the number of states.
6.3.3 Gaussian Softmax Basis Function Networks (GSBFN)
GSBFNsss are a special case of RBF networks, where the sumof all feature factors is normalized to 1.0. As a
result, the dierence to the standard RBF network approach is that GSBFNs have an additional extrapolation
property for areas where no RBF center is nearby located. Doya and Morimoto successfully used adaptive
GSBFNs to teach a planar, two linked robot to stand up [30].
For a given n dimensional input vector x the activation function of the center i is calculated by the standard
RBF-formular
a
i
(x) = exp(
1
2
((x c
i
)/s
i
)
2
) (6.17)
where c
k
is the location of the center and the vector s
k
determines the shape of the activation function. For
simplicity we choose to specify the shape of the function just by a vector instead of a matrix. So we can
specify the shape (size of the bell-shaped curve) for each dimension separately, but we cant specify any
correlated expanse.
The soft-max basis activation function is then given by
i
(x) =
a
i
_
n
j=1
a
j
(x)
(6.18)
The function value is then calculated straightforward like in the linear feature case.
g(x; w, C, S) =
n
i=0
i
(x)w
i
= (x) W (6.19)
The gradient with respect to w is then given by
d g(x; w, C, S)
dw
= (x) (6.20)
which is the same as for linear feature approximators. Consequently the non adaptive case of a GSBFN can
be treated as a usual linear function approximator.
Adaptive GSBFN
The framework of GSBFNs can be extended to have adaptable activation functions, we adapt the location
and the shape of the centers. There are three additional update schemes for adaptive GSBFN as proposed in
[30].
Add a center: A new center is allocated if the approximation error is larger than a specied criterion
e
max
and the activation factor a
i
of all existing centers is smaller than a given treshold a
min
| g(x; w) g(x)| > e
max
and max
k
a
k
(x) < a
min
The new center is added at position x with given initial shape parameters s
0
and the weight w
i
is
initialized with the current function value.
If a new basis function is allocated, the shape of neighbored basis functions also change due to the
normalization step.
Update the center positions We can also calculate the gradient with respect to the position of a center
c
i
to adjust the location of the centers.
d g(x; w, C, S)
dc
i
=
d(x)
da
i
da
i
dc
i
= (1
i
(x))
i
(x)
(x c
i
)
s
2
i
w
i
(6.21)
Update the center shapes The shape of the centers is updated by:
d g(x; w, C, S)
ds
i
=
d(x)
da
i
da
i
ds
i
= (1
i
(x))
i
(x)
(x c
i
)
2
s
3
i
w
i
(6.22)
Actually both gradients calculations are just approximations of the real gradient because they neglect
the fact that
i
(x) is a function of c
j
and s
j
even if i j.
In practise the adaption of the center position and shape has to be done very slowly, so usually individual
learning rates
c
and eta
s
are used for these parameters.
Adaptive GSBFNs do not trouble the user to choose the positions and shapes of the centers that accurately,
but they still suer from the curse of dimensionality.
6.3.4 Feed Forward Neural Networks
Feed forward neural networks (FF-NNs) consist of a graph of nodes, called neurons, connected by weighted
links. These nodes and links form a direct, acyclic graph. FF-NNs contain one input layer, one or more
hidden layers and one output layer. Usually only neurons of two neighbored layers are connected through
links. For each node n
i
the input variables of the neuron is multiplied by the weights w
i j
and summed up
(gure 6.4), each node has its own activation function, which can be a sigmoidal, a tansig or a linear function
(usually used for the output-layer). The sigmoidal transfer function of the hidden neurons divide the input
space with a hyperplane into two regions. Therefore these functions are global functions in dierence to
RBF centers, which use only a small region close to the center.
The weights are usually updated with the back-propagation algorithm (backprop), which exists in several
modications. The backprop algorithm calculates the gradient of the error function with respect to the
weights of the neural network by propagating the initial error back in the network. This algorithm is not
covered by this thesis, there are several resources in [15].
FF-NNs are not used as often as linear approximators because they are quite tricky to use for RL. They
have a poor locality, learning can be trapped in local minima and after all we have very few convergence
guarantees.
The major strength of FF-NNs is that they can deal with high dimensional input, in other words FF-NNs do
not suer from the curse of dimensionality [7]. Another advantage is that the hidden layer of a FF-NN has
a global generalization ability, which can be reused for similar problems if we already have learned a task.
FF-NNs have been extensively used by Coulom [15] for several optimal control tasks like the cart-pole
swing up, the acrobot swing up and a high dimensional swimmer task. These results show that the use of
FF-NNs can solve problems which are too complex for linear function approximators.
Figure 6.4: A single neuron y
i
=
i
(w
i0
+
_
n
j=1
w
i j
x
i j
)
Coulom showed empirically with the Vario- algorithm that the variance of the weights to the linear output
units is typically n times larger than the of weights of the internal connections (n being the total number of
neurons in the hidden layer). So good learning rates are simply obtained by dividing the learning rates of
the output units by

n.
The weights also have to be carefully initialized, because with bad initial weights we are likely to get stuck
in a local minima. Le Cun [26] proposed to initialize all weights of a node randomly according to a normal
distribution with a variance of
1
m
, m being the number of inputs of the node.
6.3.5 Gauss Sigmoid Neural Networks (GS-NNs)
These function approximator scheme tries to combine the benets of the local RBF functions and the global
sigmoidal NNs. This approach has been proposed and used by Shibata [42] for learning hand reaching
movements with visual sensors and also for a biologically inspired arm motion task by Izawa [20].
GS-NNs consists of two layers. The rst layer is the gaussian localization layer. This layer uses RBF
networks (or also rather GSBFNs) to localize the n-dimensional input space. Actually we can use here any
kind of feature calculator we want for localization, even adaptive GSBFNs where the centers and the shape
of the activation functions are adapted can be used. In this case the learning rate of the adaptive GSBFN
has to be very small to keep stability in learning. The second layer is a sigmoid layer like in a FF-NN. This
layer contains the global generalization ability. The input to the second layer is obviously the feature vector
calculated by the rst layer.
Through the second global layer we get the advantage that we can use less accurate feature representations
(consequently also less features) or another approach which is used by Izawa [20] is to do the localization of
each state variable separately and rely on the global layer to combine the localized state variables correctly.
This gives us the huge advantage because we escape the curse of dimensionality in this case the number of
features only increases linearly with the number of state variables. If this approach also scales up to more
complex problems has to be investigated.
6.3.6 Other interesting or utilized architectures for RL
In this section we briey provide an overview of other kind of function approximators which are used in
the area of RL, in particular for optimal control tasks. These architectures are just mentioned, and their
(a) RBF network (b) GS-NN
Figure 6.5: (a) RBF-networks: there are no hidden units which can represent global information (b) GS-NN
with an added sigmoidal hidden layer to provide better generalization properties. Taken from Shibata [42]
interesting properties are pointed out, they are not implemented in the Toolbox.
Normalized Gaussian networks with linear regression (NG-net)
NG-nets approximate an m-dimensional function with an n dimensional input space. The output of a NG-net
is given by
y =
M
i=1
_
_
a
i
(x)
_
M
j=1
a
j
(x)
_
_
(W
i
x + b
i
) (6.23)
The a
i
again represent gaussian RBF functions, thus the inner term is the same is for GSBFNs. W
i
is a
linear regression matrix and b
i
is the oset for the i
th
gaussian kernel. Instead of just summing the weighted
activation factors of the radial basis functions like in GSBFNs, each radial basis function denes an own
linear regression, and the NG-net denes the sum of these linear regression kernels. In addition to the
parameters of the gaussian kernel functions
i
,
i
we have the m n linear regression matrix M and the
m-dimensional oset vector b as parameters. Usually this approach needs less RBF-centers for a good
approximation, because the linear regression is more powerful than linear function approximation.
Again we can use gradient descent methods to nd a good parameterization of the NG-net. Yoshimoto
and Ishii [57] used another, also very interesting approach, they used an EM algorithm to calculate the
parameter setting. This can be done by dening a stochastic model for the NG-net and then using the
standard (E)xpectation (M)aximation algorithm, where the probability of P(x, y|w) is maximized at each
maximation step. In their approach an Actor-Critic algorithm was used. They use two dierent training
phases, one with a xed actor for estimating the value function, the other training phase is used to improve
the actor given the xed estimated value function of the policy dened by the actor. Fur further details please
refer to the given literature references.
Adaptive state space discretization using a kd-tree
Vollbrecht uses an adaptive kd-tree to discretize his state space in the Truck-Backer-Upper example. This
approach which is quite interesting, because no prior knowledge is needed to construct the discrete state
space. The basic partitioning structure is a kd-tree which divides the state space in n-dimensional cuboid
like cells. The whole state space gets successively split with n 1 dimensional hyper-planes, which cut
a cell in two halves along a selected dimension, consequently a kd-tree can be represented as binary tree.
Two neighbored cells can not dier in their box length in any dimension by more than a factor of two. In a
cell, the Q-Value is constant, similar to using tables. An action is executed as long as the Q-Value does not
change, consequently the state has to leave the current cell of the kd-tree. Every time a certain condition is
met, the current cell of the kd-tree is split into half and two new cells replace the existing one. Vollbrecht
uses dierent kinds of tasks for his hierarchic structure, avoidance, goal seeking and state maintaining
tasks. For each of these tasks dierent rules are used to split the cells of the kd-tree. For a more detailed
discussion about the used hierarchical system see section 5.2.4.
Echo state networks
Echo state networks (ESNs) have been proposed by Herbert J ager [22], [21] to do non-linear time series
prediction. The idea of the echo state networks is to use a xed recurrent neural network with sparse
internal connections and only learn the linear output mapping. The internal connections are chosen at the
beginning randomly and are not learned at all, only the linear output mapping is learned, which can be
easily sone by the LMS rule or alternatively also online via gradient descent. Under certain conditions the
internal nodes of the networks represent echo state functions, which are functions of the input state history.
The network uses these echo state functions as the basis functions for the linear mapping, if we have a
large pool of uncorrelated basis functions we are likely to nd a good linear output mapping. The recurrent
neural network has echo states (so useable uncorrelated base functions) if the input function is compact and
certain conditions on the internal connection matrix W
i
are met. ESN have never been used with RL, but this
approximation schemes has many interesting properties. We do not need to incorporate any knowledge in the
function approximator (as in linear function approximation), but we can still use linear learning rules which
usually converge faster. Another interesting approach is the incorporation of state information from the past
with the help of the recurrent network, so it can be advantageous for learning POMDPs. Unfortunately there
was no time for an additional investigation of these ideas.
Locally weighted regression
Locally weighted learning (LWL, [4], [3]) is a popular supervised learning approach for learning the for-
ward model or the inverse model of the system dynamics. In dierence to all the other discussed approaches
this method is memory based, that is to say it maintains all the experience (input-output vectors) in mem-
ory (similar to the nearest neighbor algorithms). Hence locally weighted learning is an un-parameterized
function representation. Using a set of the k-nearest input points from the query point x a local model of
the learned function is created and used to calculate the function value of x. The local model can be any
kind of parameterized function, usually a linear or quadratic model is used, but also using a small neural
network is possible. The parameters of the local model have to be recalculated for each query point, which
is in combination with the look-up of the k-nearest neighbors computationally more expensive than using a
global parameterized model. The advantage of this approach is that hardly any time for learning is needed
(we just have to add the new input point to the memory), and that the same input point just has to be learned
once (in dierence to gradient descent methods where we have to use a learning example more than once
to train the desired function value more exactly). Atkeson gives a very good overview of locally weighted
learning (LWL)and how to use it for control tasks [3]. But in the context of RL, LWL is just used for learn-
ing the forward model of the system to improve the performance of a reinforcement learning agent in this
paper. Smart [46], [47], [45] uses LWL directly to represent the Q-Function of the learning problem with a
LWL system using a kd-tree for a faster look up of the neighbored inputs. Since the Q-Values are changing
with the policy, special algorithms has to be used to update the already existing examples in memory. Smart
emphasizes on robot learning, the approach was tested on corridor following and obstacle avoidance tasks.
The LWL approach is particular interesting since LWL does not suer from the curse of dimensionality as
much as linear feature state representations.
Linear Approximators and Tables
Tables and linear approximators are represented by the class CFeatureFunction. Because of our state repre-
sentation model, this class is not subclass of CGradientFunction (it needs a state collection object as input).
Instead we directly derive a value function class CFeatureVFunction from the base class CGradientVFunc-
tion. We already introduced this class in the chapter 4.
For the gradient calculation we just store the given feature state object in a gradient feature list. So the main
functionality of a linear approximator is still implemented in a feature calculator object.
Adaptive GSBFNs
In this case we decided not to use the gradient function interface immediately for the base class (CAdap-
tiveSoftMaxNetwork) for the sake of extensibility, because there exists approaches which do not just use
the features for a linear approximation,like the NG-network (see 6.3.6). Our base class implements just the
localization layer, without the weights w for the linear approximation. Thus base class just contains the
center position and shape information, and does only represent a part of a function approximator. Therefore
the class is subclass of CGradienUpdateFunction instead of CGradientFunction because the gradient and
the input/output behavior are not known at this place.
The weights w for the linear approximation are maintained by a common feature function object. At the
constructor we have to x the maximum number of centers that can be allocated and that can be active
simultaneously. The class also already implements the feature calculator interface and calculates the activa-
tion factors of the maxFeatures most active features, which are stored in a feature state object. We can treat
adaptive GSBFNs as feature calculators if we assume only small changes of the center positions and shapes
during one learning episode. In general this assumption holds at least for small learning rates
c
and
s
.
The class maintains a list of RBF-centers. A RBF center is stored in an own data structure which stores
vectors for the location and the shape of the center.
The calculation of the derivation with respect to the location and shape of a single center is implemented
in the function (getGradient(CStateCollection *state, int featureIndex, CFeatureList *gradientFeatures)). It
returns the gradient [
i
(x)
dc
i
,

i
(x)
ds
i
], given the input state collection, the gradient is calculated as discussed above.
Each weight of the location and shape of a center is assigned a unique index. This index is used to identify
the associated center to a certain weight w
i
and also to update the data structures of the centers(implemented
in the interface function updateWeights).
Since we use a gradient update function as base class we can use an individual calculator for our adaptive
GSBFN. This calculator identies the weight as location or shape weight and then applies the specied
learning rates
c
respectively
s
to the weight updates. If
c
or
s
are set to zero, the corresponding deriva-
tions are not calculated, resulting in the constant case again, but where new centers can still be added.
The calculation of the feature factors
k
(x) is done in the getModiedState function of the feature calculator
interface. We can specify a
i
rectangular neighborhood for each dimension i for the local search of nearby
RBF centers. At the beginning of the search, all centers are stored in the search list. Then the rst dimension
of the centers location is checked to be in the specied neighborhood of the current state. If that is not the
case, the center is deleted from the list. This process is repeated for all dimensions. At the end the search list
contains only centers which are located in the neighborhood. The factors of all these centers are evaluated
and the maxFeatures most active are written in the feature state object. Then the feature state is normalized
as usual. The feature state is also needed by the gradient calculation itself, so it has to be added to the state
collection objects of the agent.
For adding a RBF center automatically, the method addCenterOnError is provided. The function adds an
center at the current position if the activation factors of all centers are less than a
max
and the error is larger
than e
min
. But here arises a problem with our state representation using the feature calculator interface.
Feature calculators can only be used if the feature factors do not vary for the same state over time. As
already mentioned we can neglect the drift of the center location and shape over one training episode, but if
we add a new center, the feature activation factors change very abruptly. We decided on the following work
around: We can use the standard state model of the Toolbox as long as no new center has been allocated,
which spares us a lot of computation time in a convenient way (because we calculate the feature factors just
once). The state collection - state modier interface is now adapted slightly. Each state modier maintains
a list of all state collections which store a state object of that modier. As a consequence the modier can
always inform the state collections whenever a state object has become invalid, and so has to be calculated
once again. This is done each time after a new RBF center is added.
In addition we have the possibility to specify an initial set of RBF centers. This can either be done by
specifying the centers individually or by specifying a grid feature calculator object. Then a RBF center is
added at each tile of the grid.
Feed Forward Neural Networks
The RL Toolbox uses the Torch Library to represent all FF-NNs. With the Torch library we can create arbi-
trary FF-NNs with an arbitrary number of dierent layers. The layers can be interconnected as needed (but
usually a straight forward NN is used). For these neural networks the gradient with respect to the weights
can be calculated, given a specic input and output data structure. All objects which supports gradient cal-
culation in the Torch Library are subclass of the class GradientMachine. The class CTorchGradientFunction
encapsulates a gradient machine object from Torch, it is also subclass of the CGradientFunction interface,
so it can be used to create a V-Function or also a Q-Function. All the communication to the Torch Library
is done by this class, which involves converting the input and output vectors from the Torch intern structure
to the used data structures of the Toolbox and also getting and updating the weight vectors directly.
A standard FF-NN is created very easily with the Torch Library. For further details consult the Torch
documentation of the class MLP.
FF-NNs additionally use an own calculator, which scales the learning rates of the output weights by the
factor of
1
n
, n being the number of neurons in the hidden layers.
At the creation of the FF-NN the weights are initialized in the resetData method by the proposed method of
LeCun. We can additionally scale the variance of the initial weights by the parameter InitWeightVariance-
Factor.
Gaussian Sigmoidal Neural Networks
The two main parts of GS-NNs already exists, the localization part and the FF-NNs part. What is still missing
is the interface between both. We have to provide a conversion from the sparse feature state representation
to the full feature state vector (including the features with activation factor 0.0). This full feature state vector
can then directly be used as input for a FF-NN. The conversion is done by the class CFeatureStateNNInput.
Through this approach any constant feature representation can be used for the GS-NN, so we can for example
choose whether we want to localize the global state space or each state variable separately (using and
respectively or feature operators). Unfortunately this approach does not work for adaptive GSBFNs, because
we would have to calculate the gradient of the composition of the GSBFN and the FF-NN, which is quite
tricky and time consuming (the question is if that is really necessary for small learning rates of the adaptive
GSBFN). Anyway, the use of GS-NN with an adaptable localization layer is not supported in the Toolbox.
Chapter 7
Reinforcement learning for optimal control
tasks
Reinforcement Learning for control tasks is a challenging problem, because we usually have a high di-
mensional continuous state and action space. For learning with continuous state and action spaces, usually
function approximation techniques are used. Many interesting problems have been solved in the area of
optimal control tasks with dierent RL algorithms. For a detailed description of the successes of RL in this
area see chapter 1. Using RL with function approximation needs many learn steps to converge, consequently
almost all results are for simulated tasks.
In this chapter we will discuss a few commonly used algorithms for optimal control tasks. At rst we will
take a closer look at the use of continuous actions and extend the framework of the Toolbox for continu-
ous action learning. Then we will come to value approximation algorithms, which allows us to use TD()
learning even function approximators [6]. We will also cover two new value-based approaches, that is to say
continuous time RL [17] and advantage learning [6]. The next section will cover two policy search algo-
rithms (GPOMDP [11]and PEGASUS [33]). Then we will also come to continuous Actor-Critic learning,
whereas also discuss two dierent approaches will be discussed, the stochastic real valued algorithm (SRV,
[19]), and a new proposed algorithm which is called policy gradient Actor-Critic learning (PGAC).
7.1 Using continuous actions in the Toolbox
For continuous control tasks we need to use continuous actions. For a low dimensional action space, we
could alternatively discretize the action space, but this can impair the quality of the policy and it is not
possible for high dimensional action spaces anyway.
There arises several limitations when using continuous actions. At rst, action value functions can not be
used straight forwardly. Since we want to search for the best action value of state s, we somehow need
to discretize the action space or use a more sophisticated search method. The same is true for V-Function
planning, where an action discretization is also necessary for nding the optimal action.
Concerning the Toolbox, we already discussed the concept of continuous actions using the action pointer
to identify the action and using the action data object for the continuous action values (see 2.2.4). We
additionally design a continuous controller interface and also an interface for Q-Functions with continuous
action vectors as input. Even if we need to discretize the action space for the Q-Function in order to search
for the best action value, it can be advantageous to use continuous inputs for learning in order to induce
some generalization eects between the actions.
106
7.1. Using continuous actions in the Toolbox 107
As a discretized version of a continuous action we introduce static continuous actions. Static continuous
actions have the same properties than continuous actions, so they maintain an action data object representing
the continuous action value vector, but this action value is now xed, the action represents a xed point in the
action space. Static continuous actions are for example used for the discretization needed for Q-Functions
to search for the best action value. For each static continuous action, we can additionally calculate the
distance to any other continuous action object in the action space, which is needed later on for interpolating
Q-Values.
7.1.1 Continuous action controllers
We need to design a continuous action controller interface which still ts in our agent controller architec-
ture. An agent controller always returns an action pointer (to identify the action) and stores, if needed, the
action data object associated with the returned action in a given action data set. A controller specically
build for continuous actions always returns the same action pointer (the pointer of the continuous action
object) and stores the calculated action vector in the corresponding action data object. For that reason we
create the interface class CContinuousActionController, which is subclass of CAgentController and has an
additional interface function getNextContinuousAction which gets the state as input, and has to store the
action vector in a given continuous action data object. This function is called by the getNextAction method
of the controller, the continuous action data object is automatically passed to the getNextContinuousAction
interface. Consequently this class simplies the design of continuous action controllers because we do not
have to worry about the action model any more. Also consult gure 7.1 for an illustration.
Figure 7.1: Continuous action controllers and their interaction with the agent controller interface.
Every continuous action controller already has an own noise controller, this noise is added to the action value
at each time getNextAction is called. If u(s) is the action vector coming from the getNextContinuousAction
function, the policy is dened to be
(s
t
) = u(s
t
) + n
t
(7.1)
Random Controllers
For the noise we provide a general noise controller called CContinuousActionRandomPolicy. The noise
is normally distributed with zero mean and a specied sigma value. In order to accomplish a smoother
noise signal we can also choose to low-pass lter the noise signal. Hence the noise vector is calculated the
following way:
n
t+1
= n
t
+ N(0, )
108 Chapter 7. Reinforcement learning for optimal control tasks
being the smooth factor and N(0, ) is a normally distributed random variable. In order to switch the noise
o for a certain continuous controller we have to set the value to zero.
The noise used for a given action is needed by the SRV algorithm. Since the noise signal is not stored with
the action object, we have to recalculate the noise signal if we know the action vector a
t
. The continuous
agent controller interface also supports calculating the noise vector given a control vector a
t
and a state s
t
.
The noise (or the derivation to the original control vector) is then obviously calculated by
n
t
= a
t
(s) (7.2)
7.1.2 Gradient calculation of Continuous Policies
For continuous policies the gradient
d(s)
d
is needed by a few algorithms which represent the continuous
policy directly.
d(s)
d
is a m p matrix, m being the dimensionality of the action space U and p is the number
of weights used to represent the policy. We can use the already discussed function approximtation schemes
which are implemented as gradient functions for the continuous policies, hence we encapsulate gradient
function objects. For the gradient calculation we can then use the encapsulated gradient function object
easily.
Implementations of the continuous gradient policies
There are two implementations of this interface
Gradient function encapsulation :(CContinuousActionPolicyFromGradientFunction) Agradient func-
tion object with n inputs and m outputs is used to represent the policy. The user has to specify a
gradient function with the correct number of input and output values. The class itself is the interface
between the gradient function and the continuous gradient policy classes, so all function calls are
passed to the corresponding functions of the gradient function object. In this case only one gradient
function for all control variables is used. This class can for example be used to encapsulate a FF-NN
from the Torch library and use it as policy.
Single state gradient function encapsulation : (CContinuousActionPolicyFromSingleGradientFunc-
tion). In this case we can use a list of gradient functions, for each control variable an individual gra-
dient function is used (so the gradient functions have n inputs and one output dimension. We can use
this class to represent our policy with independent functions for the dierent control variables, for
example with individual feature functions for each control variable.
Control Limits
Usually we have limits for our control variables which are given by an interval [u
min
, u
max
] in the simplest
case (e.g. see [15]). Just cutting the control vector outside this interval to the interval limits would be a
possibility, but this denitely falsies the gradient calculation. Our approach is to use sigmoidal functions
for the control variable which get saturated at the limit values. We use the following function to get a
quasi-linear behavior between the limits, where u
j
is the output vector of the policy without any limits (this
function is also illustrated in gure 7.2).
u
j
= (u
j
) = u
j,min
+ (u
j,max
u
min
)logsig
_
2 +
4 (u
j
u
j,min
)
(u
j,max
u
min
)
_
(7.3)
Figure 7.2: Limited control policy, the action u is limited to [u
min
, u
max
] by a sigmoidal function. In the
middle of the interval the limited control policy is quasi identic to the unlimited policy.
Obviously the gradient also changes:
d
(s)
d
= (u
j,max
u
min
)logsig
_
2 +
4
((s) u
j,min
)
(u
j,max
u
min
)
_
4
(u
j,max
u
min
)

d(s)
d
(7.4)
Due to the scaling of the sigmoidal function argument, we get a quasi-identical behavior of the original
and the sigmoidal function for 90% of the allowed control space. For introducing the limits of the control
variables, we create an own class CContinuousActionSigmoidPolicy, which encapsulates another gradient
policy class and uses the introduced sigmoid function to limit the control variables and also to calculate the
new gradient. For sigmoidal policies we can use a dierent kind of noise which we will refer to as internal
noise. The internal noise is added before the sigmoidal function is applied:
(s) = ((s) + n) (7.5)

This internal noise has lower eect if the control variable if the original policy is outside the given control
interval because of the saturation eect. Usually a value outside the limits has the meaning that the algorithm
is quite sure about taking the maximum or minimum control value, so it makes sense to reduce the eect of
noise in this areas.
For the inverse calculation of the noise n
t
given the executed action vector a
t
and the state s
t
we have to use
the inverse sigmoidal function if an internal noise controller was used.
7.1.3 Continuous action Q-Functions
Similar to the continuous action controllers we also create such an interface for Q-Functions, which now
does not have to take an action pointer plus action data object as input anymore, but instead the interface
function immediately get an continuous action data object as input. This class is already subclass of the
gradient Q-Function interface, so its subclasses also has to provide full gradient support.
7.1.4 Interpolation of Action Values
One negative eect of discretizing the action space is that we get a non smooth policy which is usually
sub-optimal. An approach to overcome this problem is linear interpolation of the Q-Values. For example if
we have three dierent discretized action vectors [veca
min
, veca
0
, a
max
], and the Q-Values of two neighbored
actions are almost the same, it can be useful to take the average of both action vectors.
On the other hand, if we have executed the continuous action a
t
and its continuous action vector does not
match any discretized action vector, we can update the Q-Values of nearby discretized action vectors.
The action selection part is done by the class CContinuousActionPolicy. The class has quite the same
functionality as stochastic policies, it takes a set of actions (this time all have to be static continuous actions)
and an action distribution. In dierence to stochastic actions the class rst samples one action according
to the given action distribution and then it calculates the weighted sum of all static action vectors in the
neighborhood of the sampled action.
(s) =
||a
a
i
||<
P(a
i
)a
i
(7.6)
where a
is the sampled action. The size of the neighborhood that is searched for nearby action vectors
can be specied.
An interpolated Q-Function is represented by the class CCALinearFAQFunction. This class encapsulates
another Q-Function object and uses it for calculating the interpolated values. All actions used for that
encapsulated Q-Function have to be subclasses of CLinearFAContinuousAction. These action objects are
derived from the static continuous action class, so they represent a xed point in the action space. Addi-
tionally they provide a function to calculate an activation factor of the static action, given the currently used
action vector. This approach is related to linear feature states, which does the same in the state space. The
class CContinuousRBFAction implements the RBF-activation function for static actions, but we can also
implement classes for linear interpolation easily. At the end the activation factors of the static actions are
normalized (
_
i
a
i
= 1). The interpolation Q-Function class calculates the action activation factors of all
static actions for a given action vector, normalizes these activation factors and then calls the corresponding
functions of the encapsulated Q-Function. For the update functions, the update value for each static action
is scaled by the corresponding activation factor, for the getValue functions the value is calculated by the
weighted sum of the action values.
7.1.5 Continuous State and Action Models
In optimal control tasks our model consists of n continuous state variables and m continuous control vari-
ables. Often we know the model of the state dynamics, or at least we can learn them. The model can be
represented as a transition function
s
t+1
= F(s
t
, a
t
)
for discrete time processes or it can be directly represented as the state dynamics of the system
s
t
= f (s
t
, a
t
)
. The state transition from s
t
to s
t+t
can then be calculated by using a numerical integration method like the
Runga-Kutte method. For the discrete time transition function we have already discussed the CTransition-
Function interface, which already ts our requirements (i.e. can cope with continuous states and actions).
Continuous Time Models
Continuous Time Models are represented by the state dynamics s
t
= f (s
t
, a
t
) For calculating state transi-
tions with a continuous time model the class CContinuousTimeTransitionFunction is implemented. This
class is subclass of the transition function class and maintains an additional interface getDerivationX which
represents the state dynamics s
t
. With this interface the class can calculate the state transitions, we just have
to specify the simulation time step t
s
. This integration is done by the method doSimulationStep. A usual
rst order integration is used for calculating the state s
t
s
t+t
= s
t
+ t s
t
This method has to be overridden if a second or higher order integration method is needed for certain state
variables. For example it is recommendable to use second order integration for positions p and to use rst
order integration for velocities v = p to get a more accurate simulation result. In order to stay accurate
the simulation time step is much smaller than time steps which are useful for learning. Consequently the
numerical integration of the learning time step t is divided into several simulation steps with the simulation
time step t
s
=
t
N
, the small step integration is done by the doSimulationStep method. The whole integration
step is done by the interface function of the transition function class transitionFunction, the number of
simulation steps per learning time step can be set separately.
The class CContinuousTimeAndActionTransitionFunction represents a continuous time model where the
action has to consist of continuous control variables (in dierence to the normal continuous time transition
function class, where the action can be any action object). The interface for calculating the state derivative
is adapted to take a continuous action data object as input instead of an action object.
Models Linear with respect to the control variables
For the most motor control problems, where a mechanical system is driven by torques and forces, the state
dynamics are linear with respect to the control variables, i.e. that the state dynamics can be represented by
f (s, u) = B(s) u + a(s)
The state dynamics can be completely described by the matrix B(s) and the vector a(s). This is modeled by
the class CLinearActionContinuousTimeTransitionFunction, subclass of CContinuousTimeAndActionTran-
sitionFunction. This class has two additional interface functions for retrieving a(s) and B(s), which has to
be implemented by the subclasses. The state derivation s
t
is then calculated in the described way.
All the models used for the benchmark tests can be described in this form, so this class is the super class of
all our model classes.
7.1.6 Learning the transition function
It is often useful to use state predictions like in V-Planning, even if the transition function is not known as
for robotic tasks or external simulators. In this case we can learn the transition function s
t+1
= f (s
t
, a
t
). This
is a supervised learning problem, which is usually much easier than learning a Q or a V-Function. Although
Q-Learning methods do not need any model of the MDP, it can be advantageous to learn the transition
function and then use V-Planning methods, because we have divided the complexity of the original learning
problem.
We can use our already existing supervised learning interface for learning the transition function. The class
CLearnedTransitionFunction represents a learned transition function. It is inherited from the agent listener
and also the transition function class, thus it can be used as standard transition function. The class gets
a supervised learner object as input, in the agent listener interface, the old state vector s
t
and the action
vector a
t
are combined as input vector, the new state vector s
t+1
is stored in the output vector, which are
both passed to the supervised learner object. Thus at each step a new training example is created. For the
transition function interface the testExample method of the supervised learner is used to calculate the learned
output vector.
Any supervised learning method can be used with this design to learn the transition function, although just
simple gradient descent methods are currently implemented in the Toolbox.
7.2 Value Function Approximation
There is a well developed theory for learning the value function with lookup tables, and also a for guar-
anteeing the convergence of supervised learning algorithms on function approximation, but when the two
concepts are combined the problem becomes more complex. In this chapter we introduce three concepts
for gradient-based value function approximation, the direct gradient, the residual gradient and the residual
algorithms. These algorithms are discussed in more detail in [6].
At rst let us x some notation issues, the real value function is still called V
(s), the approximated value

function is written as

V
w
. If we want to refer to the gradient of the value function or some other function
with respect to the weights (i.e.
d

V
w
(s)
dw
) we will use the
w
operator.
For approximating the value function we need to minimize the Mean Squared Error (MSE):
E =
1
n
s
E[V
(s)

V
w
(s)]
2
(7.7)
where V
(x) is the real value of the state x and

V
(x) is the approximated value coming from our function

approximator. Usually we do not know the real value of state x, consequently we estimate it by
V
(s) = E[r(s, a) + V
(s
)] E[r(s, a) +

V
w
(s
)] (7.8)
which is supposed to be a more accurate estimate of V
(s
) than

V(s
). Consequently V
(s) is again an
approximation because we have to use the function approximator for the value V(s
). The resulting error

function
E =
1
2n
s
E[r(s, a) +

V
w
(s
)

V
w
(s)]
2
=
1
2n
s
E[residual(s, s
)]
2
(7.9)
is called mean squared Bellman residual. We dene the residual to be the inner term of the error function.
residual(s, s
) = r(s, a) +

V
w
(s
)

V
w
(s) (7.10)
Note that this residual is basically the same as the temporal dierence value used for TD-learning. For a
nite state space this error function is only zero if we have an exact approximation of the value function.
Now we can do stochastic gradient descent on this error function, the gradient of the error function in a state
s is then given by
w
E = E[residual(s, s
)
w
residual(s, s
)] (7.11)
For deterministic processes we can omit the expectancy value, and directly use the successor state s
. For
stochastic process we need an unbiased estimate of the error function in state s. The unbiased estimate of
7.2. Value Function Approximation 113
the gradient of a quadratic function E[Y
2
] = E[Y
2
] is given by y1 y2, where y1 and y2 are samples
from Y (y1 y1 would be the gradient of the squared expectancy E[Y]
2
). Consequently we would have to
calculate the gradient from two independent samples of s
for stochastic processes.
w
E = residual(s, s
1
)
w
residual(s, s
2
)] (7.12)
But if the nondeterministic part of the process is small (for example only a small noise term is added) then
we can still use 7.11 as a good approximation of the gradient. Since we use only deterministic processes (or
slightly stochastic processes) in this thesis only one sample of the next state is used in the Toolbox for the
gradient calculation.
An underlying problem of value function approximation is the accuracy of the approximated value function.
Even a good approximation of the value function does not guarantee a good performance of the resulting
policy. For example there exist innite horizon MDPs which can be proved to have the following properties
[12]: If the maximum approximation error is given by
= max
sS
|
V
w
(s) V
(s)| (7.13)
then the worst case of the expected discounted reward (V() =
_
sD
d(s)V
(s)) of the greedy policy

following

V
(D is the set of al initial states, d(s) is the distribution over these states) is just bounded by:
V( ) = V()
2
1
(7.14)
where V() is the real expected discounted reward of the policy. So even for a good approximation, the
value function can generate bad policies for values close to one. Certainly, this is only for a specic MDP
(which was build on purpose for the proof) and this is the worst case, but we have to keep in mind that value
function approximation can be crucial.
7.2.1 Direct Gradient Algorithm
The direct gradient method is the most obvious algorithm implementing value function approximation, so
it was also the rst algorithm that was investigated [54]. Again we try to adjust the weights of the function
approximation system to make the current output

V
w
(s) closer to the desired output V
w
(s
) = r(s) +

V
w
(s
).
So we get the following update for the weights:
w
D
= (r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
))
w
V
w
(s
t
) (7.15)
If we look at our error function notation E, the direct gradient method neglects the fact that the desired
output V
(s) also depends on the weights of the function approximator, because we use

V(s
) to estimate it.
Although this is the most obvious way to do value function approximation, this approach is not guaranteed
to converge. Tsitsiklis and Van Roy [52] gave very simple examples for a two state MDP where this algo-
rithm diverges. For a more detailed discussion about these examples also refer to [6]. A reason why this
method does not work is that, if we change the value of one state with function approximation, we will
usually change the values of other states too, including the value of successor state s
. As a result this also

changes the target value V(s) which may actually let

V(s) move away from the target value and hence cause
divergence.
7.2.2 Residual Gradient Algorithm
The residual gradient algorithm calculates the real gradient of the residual given by:
residual(s, s
) = r(s, a) +

V
w
(s
)

V
w
(s)
So we have the following gradient of the error function:
w
E = (r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
))
_
w
V
w
(s
t+1
)
w

V
w
(s
t
)
_
(7.16)
and thus the weight update rule
w
RG
=
w
E = (r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
)))
_
w

V
w
(s
t+1
)
w

V
w
(s
t
)
_
(7.17)
Since we do stochastic gradient descent on the error function with the real gradient, this method is guaranteed
to converge to a local minima of the error function. The residual gradient algorithm updates both states, s
and s
to achieve the convergence on the error function. But these convergence results unfortunately do not
necessarily mean that this algorithm will learn as quickly as the direct gradient algorithm, or the solution
is the solution of the dynamic programming problem. In praxis it turned out that the residual gradient
algorithm is in fact signicantly slower and does not nd as good solutions as the direct algorithm. The
advantage of this algorithm is that it is proved to be stable.
7.2.3 Residual Algorithm
The residual algorithm tries to combine the two algorithms to get the advantages of both of them, fast and
stable learning. In gure 7.3 we see an illustration of the two gradients, the direct and the residual gradient.
The direct gradient is known to learn fast, the residual gradient always decreases the error function E. The
dotted line represents a plane perpendicular to the residual gradient. Each vector that is on the same side as
the residual gradient of this hyperplane also decreases the error function. So, if the angle between the two
gradient vectors is acute, we can use the direct gradient. If the angle is obtuse, the direct gradient would lead
to divergence, but we can use a vector that is as close as possible to the direct gradient but is still located
on the same side of the hyperplane as the residual gradient (see gure 7.3). This can be achieved by using a
weighted average of the two gradient vectors [6]. So for a [0, 1] we can calculate our new weight update
by:
w
R
= (1 )w
D
+ w
RG
=
_
r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
)
_
[(1 )
w

V
w
(s
t
) +
_
w

V
w
(s
t+1
)
w

V
w
(s
t
)
_
]
=
_
r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
)
_
w

V
w
(s
t
)
w

V
w
(s
t+1
)
_
(7.18)
So in the residual algorithm, we additionally attenuate the inuence of the successor state with the factor
. By this denition the residual gradient and the direct gradient algorithm are both special cases of the
residual algorithm. Depending on the value, this method is guaranteed to converge to a local minima of
the error function. There are two methods proposed by Baird for the choice of the value:
Constant weighting: The two gradients are summed up with a constant weight factor . The value
can then be found by trial and error, the smallest value of should be chosen that does not blow up
the value function.
(a) Acute Angle (b) Obtuse Angle (c) Optimal Update Vec-
tor
Figure 7.3: In the case of an acute angle we can straightly take the direct gradient weight update, for obtuse
angles we have to calculate the new update vector w
r
.
Calculate : We can alternatively calculate the lowest possible value in the range of [0, 1] which
still ensures that the angle between the epoch-wise residual gradient W
RG
and the residual weight
update W
R
is still acute (W
RG
W
R
> 0). This value can be found by taking a slightly bigger
value that fullls the equation W
RG
W
R
= 0.
((1
)W
D
+
W
RG
)W
RG
= 0
=
W
D
W
RG
(W
RG
W
D
) W
RG
=
+ (7.19)
If this equation yields a value outside the interval [0, 1], the angle between the residual gradient
and the direct gradient is already acute, so a value of 0.0 can be used to provide maximum learning
speed. The disadvantage of this adaptive calculation is that we need estimates of the real epoch-wise
direct gradient and residual gradient. Using the stochastic gradient of only one step (which is used by
TD-methods) does not work because these gradient estimates are very noisy.
But we can estimate the epoch-wise gradient incrementally. The epoch-wise calculated weight updates
W
D
and W
RG
can be estimated by traces for the direct and the residual gradient. The traces used
for the direct and the residual gradient weight updates are updated the following way:
w
d
= (1 ) w
d
+ residual(s
t
, s
t+1
)
V
w
(s) (7.20)
w
rg
= (1 ) w
rg
+ residual(s
t
, s
t+1
)
_
V
w
(s)
V
w
(s
)
_
(7.21)
Now we can use the traces w
d
and w
rg
as estimates for W
D
and W
RG
for the adaptive calcula-
tion.
In general we can not say which algorithm works best. This again depends on the problem and in particular
on the used function approximator.
7.2.4 Generalizing the Results to TD-Learning
These three weight update schemes can be generalized to all the discussed value-based learning algorithms,
we just need to adapt the choice of our residual function. Each of the proposed algorithms can be imple-
mented with one of the three gradient calculation methods. Here we show the equations for learning either
the V-Function or the Q-Function with TD(0)-Learning with the residual algorithm.
TD V-Function Learning:
w = (r(s
t
, a
t
, s
t+1
) +

V
w
(s
t+1
)

V
w
(s
t
))
_
w

V
w
(s
t
)
w

V
w
(s
t+1
)
_
(7.22)
TD Q-Function Learning:
w = (r(s
t
, a
t
, s
t+1
)+

Q
w
(s
t+1
, a
t+1
)

Q
w
(s
t
, a
t
))
_
w

Q
w
(s
t
, a
t
)
w

Q
w
(s
t+1
, a
t+1
)
_
(7.23)
7.2.5 TD() with Function approximation
We derived the equations for TD(0) learning, but as we have already heard, the use of e-traces can consid-
erably improve the learning performance. Can we use eligibility traces with function approximation? With
function approximation we do not have any discrete state representation, so we can not calculate the re-
sponsibility of a state for a TD update. But we can use the current gradient as a sort of state representation,
and calculate eligibility traces for the weights of the approximator instead of eligibility traces for states. For
the direct gradient method we can use the same justication for using eligibility traces as for the discrete
state model, we change the value of states which are likely to be responsible for an occurred TD-error. The
direct gradient method has been extensively studied, the strongest theoretically results from Tsitsiklis and
Van Roy [52] prove that the algorithm converges with linear function approximation when learning is per-
formed along the trajectories given by a xed policy (policy evaluation). The policy can then be improved
by discrete policy improvement - policy evaluation steps.
But for the residual and residual gradient algorithm, we do not update the value of states anymore, we
minimize the error function E at each step. So can we still use e-traces?. Although this problem has
according to our state of knowledge not been addressed in literature, we answer this question with yes. If
we minimize the error function at step t, which is given by E
t
=
1
2
[r
t
+
V(s
t+1
)

V(s
t
)]
2
and the residual
is positive, the value of the current state s
t
will increase (and the value of the next state s
t+1
decrease), as a
result the residual from step t 1 will also increase (since V(s
t
) increased). Consequently it makes sense
updating E(t 1) also with a positive residual error. This suggests that using eligibility traces makes sense
even if we calculate the gradient of an error function, and not of the gradient of a value function.
For the eligibility e-traces we can again use replacing or non replacing e-traces:
Non-replacing e-traces: Here the update of the eligibility traces are straight forwardly summed up:
e
t+1
= e
t

w
residual(s, s
) (7.24)
Replacing e-traces: In this case it becomes more complicated, because we can have now dierent
signs in the eligibility traces. We decided on the following approach:
e
t+1
(w
i
) =
_
_
absmax( e
t
(w
i
),
w
i
residual(s, s
)) , if sign
_
w
i
residual(s, s
)
_
= sign(e
t
(w
i
))
w
i
residual(s, s
) , else
(7.25)
If the eligibility trace e
i
for weight w
i
and the current negative derivation of the residual
dresidual(s,s
)
dw
i
have the same sign, the value with the larger magnitude is chosen. So always the largest weight update
from the past is saved if the sign does not change. Otherwise, if the e-trace and the derivation show
in dierent directions, just the value of the derivation is used and the old e-trace value is discarded.
This is done because the updates from the past are likely to contradict the current weight update, if
the updates show in a dierent direction. This approach was empirically evaluated to work well, in
the most cases better than the accumulating e-traces algorithm.
Both approaches rely on the assumption that
w
residual(s, s
) is constant over time, which is in general not

true, because changing the weights in one state usually also changes the gradient of the value function in all
other states. But if we assume only small weight changes during one episode we can use that approximation
[15]. Only for linear function approximators this assumption is always true.
Another approach for eligibility traces is to store the state vectors from the past (or the last n steps from the
past) and recalculate the gradient of all states at each step again. This approach gets rid of the assumption that
the gradient is constant gradient over time, but it is obviously computationally considerably more expensive.
Since the dierent gradient calculation schemes can be used for any value based algorithm, a general imple-
mentation of the gradient calculation is needed.
The Residual Functions
In our approach we designed individual interfaces for dening the residual (CResidualFunction) and for
dening the gradient of the residual (CResidualGradientFunction). The residual interface gets the old value
V
t
, the new value V
t+1
, the reward and the duration of the step as input. The residual gradient interface
gets the gradient of the V-Function in the old state
w

V(s
t
), the gradient of the new state
w

V(s
t+1
) and the
duration of the step as input. It has to return the gradient of the residual. The duration is used to calculate the
corresponding SMDP updates (exponentiate the value). For both interfaces there is no dierence whether
these values com from a Q-Function or a V-Function.
For the residual function we dene at the moment only one class which calculates the standard residual
function
residual(V
t
, r
t
, V
t+1
) = r
t
+ V
t+1
V
t
We will dene additional residuals later.
For the gradient calculation interface we dene one class for calculating the direct gradient (just returns the
gradient of state s
t
) and the residual gradient.
For the residual algorithm, we provide the class CResidualBetaFunction which superimposes the direct and
the residual gradient vector with the variable in the described way. The value is calculated by an own
interface CAbstractBetaCalculator which also gets the direct and the residual gradient as inputs. There are
two implementations of this interface.
CConstantBetaCalculator: Always returns a constant value, which can be set by the parameter
interface.
CVariableBetaCalculator: Calculates the best beta value for the given direct and residual gradient
(see 7.2.3).
Using Eligibility Traces
The e-traces classes for tracing the gradient are basically the same as for the discrete state representation,
because there is no dierence in calculating e-traces for state or weight indices in the view of the function-
ality of the software. We add functions for directly adding a gradient to the e-traces list instead of adding
a state collection object, the target value function is also updated directly through the gradient update func-
tion interface. The update method for replacing e-traces (see 7.25) also have to be changed slightly. These
extensions was done for the e-trace classes for value and action value functions (CGradientVETraces and
CGradientQETraces)
The TD-Gradient Learner classes
New learner classes for learning the value (CVFunctionGradientLearner) and the action value function
(CTDGradientLearner) with function approximation are created. Both classes are subclasses of the already
existing corresponding TD-Learner class. These classes additionally get a residual and a residual gradient
function as input. At each step the values of the new and old states are calculated, and then transferred to the
residual function. This residual error value is then used as temporal dierence. The gradient at the current
state and at the next state is also calculated and passed to the residual gradient function object. The result is
then used to update the gradient e-traces object of the learner. The rest of the functionality is inherited from
the super classes.
The TD-Residual Learner classes
The variable calculation is a special case of the gradient learner model because in this case we need an
estimate of the epoch-wise gradient, not only the gradient for a single step. If we use the former dened
gradient learner classes ( (CVFunctionGradientLearner) and (CTDGradientLearner)) for the residual algo-
rithm, only the single step gradients are used to calculate the value. This obviously works for a constant
value, but for the variable calculation we have to pass an epoch-wise estimate of the gradient to the
calculator interface.
We use the traces w
d
and w
rg
as discussed for the direct and the residual gradient vector to estimate them
epoch-wise, here the usual eligibility traces classes are used and updated in the described way. The residual
learner class also gets a beta calculator as input, so it can calculate the best value for the estimated epoch-
wise gradients with the adaptive calculation class CVariableBetaCalculator. The residual learner classes
(CVFunctionResidualLearner and CTDResidualLearner) also maintain two eligibility traces objects instead
of one, one for the direct gradient update e
d
and one for the residual gradient update e
rg
. After calculating
the value, the direct gradient update of this e-traces is multiplied by (1 ) and for the residual update by
for the weight update.
w =
_
(1 )e
d
+ e
rg
_
residual(s, s
)
Through this approach always the best estimate of even for error functions from the past (with the use of
e-traces) is used.
The residual learner classes are again designed for value (CVFunctionResidualLearner) and action value
function learning (CTDResidualLearner).
7.3 Continuous Time Reinforcement Learning
Doya [17] proposed a value based RL learning framework for continuous time dynamical systems without
a priori discretization of time, state and action space. This framework was also extensively used by Coulom
[15] in his PhD thesis and Morimoto [30], [29] in his experiments.
7.3. Continuous Time Reinforcement Learning 119
7.3.1 Continuous Time RL formulation
The system is now described by the continuous time deterministic system
s(t) = f (s(t), u(t)) (7.26)
where s S R
n
and u U R
m
, S is the set of all states and U is the set of all possible actions.
Continuous Time Value-Functions
Again we want to nd a (real valued) policy (s) which maximizes the cumulative future reward, but in this
case the equations are formulated in continuous time. Consequently we have to integrate the future reward
signal over the time to calculate the value of state s.
V
(s(t
0
)) =
_

t=t
0
exp((t t
0
) s
)r(s(t), u(t))dt (7.27)

where s(t) and u(t) follow the system dynamics respectively the given policy. s
is the continuous discount

factor, an corresponds to the inverse decay time of reward signal r. Again we dene the optimal Value
Function V
(s) = max
(s), for all s S . If we consider a time step of length t we can write the value
function in its recursive form.
V
(s(t
0
)) =
_
t
0
+t
t=t
0
exp((t t
0
) s
)r(s(t), u(t))dt + exp(t s
) V
(s(t
0
+ t)) (7.28)
For small t values this equation can be approximated by
V
(s(t
0
)) t r(s(t
0
), (s
(
t
0
))) + (1 s
t)) V
(s(t
0
) + s) (7.29)
with
s = f (s
0
, (s
0
)) t (7.30)
The Hamilton-Jacobi-Bellman Equation
This is still similar to the discrete time equations, but now we can subtract V
(s(t)) on each side and divide

through t.
0 = r(s, (s)) s
(s + s) +
V
(s + s) V
(s))
t
(7.31)
After performing the lim
t0
we get the following equation
0 = r(s, (s)) s
(s) +
dV
(s)
dt
= r(s, (s)) s
(s) +
V
(s)
ds
f (s, (s)) (7.32)
A similar equation can be found for the optimal value function by always performing a greedy action, this
equation is called Hamilton-Jacobi-Bellman equation and given by:
0 = max
uU
_
r(s, u) s
(s) +
V
(s)
ds
f (s, u)
_
(7.33)
This is the continuous time counterpart of the Bellman Optimality Equation. We also dene the Hamiltonian
H for any value function V
to be:
H(t) = r(s(t), u(t)) s
(s(t)) +
V
(s(t))
ds
f (s(t), u(t)) (7.34)
This denition is analogous to the denition of the discrete time Bellman residual or the temporal dierence
error. The Hamiltonian of an estimated value function

V equals only 0 for all states s if the estimate

V equals
the real value function V
.
7.3.2 Learning the continuous time Value Function
Now, as we have derived the continuous time residual we can use the same techniques as for discrete time.
Basically we want to minimize the error function
E(t) =
1
2
t
H(t)
2
(7.35)
per step with gradient descent techniques.
Updating the Value and the Slope
The most obvious way is to take the Hamiltonian as it is and calculate either the direct gradients, residual
gradients or the residual algorithms weight update. For example the weight update of the residual gradient
algorithm is given by
w =
w
E(t) = H(t) [s

w

V
w
(s
t
)
w
d

V
w
(s
t
)
ds
f (s
t
, u
t
)] (7.36)
There are two problems that arise for this method. This algorithm can only be used if the process has
continuous and dierentiable state dynamics. But many interesting problems have discrete deterministic
discontinuities, like a mechanical shock that causes a discontinuity in the velocity. In this case we can
not calculate the derivative s(t) = f (s
t
, u
t
). The second potential problem is the symmetry in time of the
Hamiltonian, the value function update is only calculated with the current state s
t
. This symmetry in time is
reported by Doya and Coulom to be a severe problem, that causes the algorithm to blow up.
Approximating the Hamiltonian
The Hamiltonian H can be approximated by replacing the derivative

V(t) by an approximation. This ap-
proximation usually contains the asynchronous time information (thus the value of the next state). In the
literature we can nd two dierent approximation schemes:
Euler Dierentiation: The time derivative of V is approximated by the dierence quotient
dV(t)
dt
=
V(t+t)V(t)
t
. As a result our residual looks the following way:
residual(t) = r(t) +
1
t
_
(1 s
t) V(t + t) V(t)
_
(7.37)
This method was proposed and used by Doya. By setting a xed step size t, scaling the value
function through V
d
=
1
t
V(t) and setting = 1 s
t the Euler TD-error coincides with the

conventional TD-error td
d
(t) = r(t) + V
d
(t + 1) V
d
(t)
Complete Interval Approximation: Here we additionally approximate the value of V(t) in the inter-
val [t, t + t] by the average of the values from the interval limits:
residual(t) = r(t) s

V(t + t) V(t)
2
+
V(t + t) V(t)
t
(7.38)
= r(t) +
1
t
_
(1
s
t
2
) V(t + t) (1 +
s
t
2
)V(t)
_
(7.39)
This method was proposed and used by Coulom.
Both methods are just slightly dierent approximation schemes, so they are supposed to give the same
results. Again we can use the direct gradient, residual gradient or use the residual algorithmfor this residuals.
Comparing the residuals to the discrete case we can see that, because of the approximation of the Hamilto-
nian H, which eliminates the derivation of the value function

V, the main dierence is a dierent weighting
of the current reward to the value function. The magnitude of the continuous time value function is t times
smaller than in the discrete case. But the inuence of the value function in the residual calculation is there-
fore again
1
t
times higher. As a result of this scale of the value function in the residual calculation we have
to use smaller learning rates in order to avoid divergence of the algorithm. Usually the residual is multiplied
by t to achieve that (to calculate w from w). Consequently we get the following residual function (e.g.
for the Euler residual):
residual(t) = r(t) t + (1 s
t) V(t + t) V(t)
If we set = (1 s
t) we can see an additional view of continuous RL, that is to say the reward is
normalized by the time step t.
7.3.3 Continuous TD()
As a matter of course we can also use eligibility traces for the continuous time formulations. Coulom derived
the e-traces equations in continuous time for the direct gradient algorithm. In this thesis we will not look at
this derivation, but we will use his results and extend it for the residual gradient and residual algorithm. The
continuous time eligibility traces for the direct gradient algorithm are the given by
e = (s
+ s
)e +
w
V
w(t)
(s(t)) (7.40)
This equation can be easily extended for the residual or residual gradient algorithms with the same justica-
tions we used for the discrete time case.
e = (s
+ s
)e
w
H(t) (7.41)
The weight update using e-traces is given by
w = H(t
0
) t
0
e(t
0
). (7.42)
These equations again assume that the gradient of H(t) is independent of the weight vector, which is only
true for linear approximators.
By using a xed discretization step t
i
= t we get the following discretized equation
e(t) = e(t 1) + e = e(t 1) + t e = (1 (s
+ s
) t) e(t 1) +
w
H(s
t
, s
t+1
) t (7.43)
which is the same as in the discrete time TD() algorithm if we set = (1 (s
+ s
) t) and use a
1
t
times higher learning rate (which is already included in the residual calculation in the Toolbox, so same
range of learning rates can be used for discrete and continuous time learning).
The continuous time TD() equations are almost the same as in the discrete case, the dierences are:
Another set of Parameters is used: s
and s
instead of and
The value function is scaled by the factor of
1
t
in the residual calculation, resulting in a higher
emphasis of the V-Function in comparison to the reward function.
The residual (the TD error) is calculated slightly dierent, depending on the used approximation
scheme.
So for value function learning there are only small dierences, and there is nothing really new to the discrete
algorithm. The big dierence to discrete time learning would be the incorporation of the gradient knowledge
dV(s)
ds
, but this is reported to be not stable in the proposed way, but using the gradient information somehow
dierently could be an approach which is worth investigating. But there are quite a few dierences and
advantages at the action selection part, which we will discuss in the following section.
7.3.4 Finding the Greedy Control Variables
Although there is hardly any dierence in learning the value function for continuous time learning, we
can use the state dynamics of the system in continuous time RL for action selection, which gives us an
advantage to use the system dynamics as a sort of prior knowledge (comparable to V-Function Planning).
For the continuous case we dened the optimal policy to maximize the Hamiltonian H.
(s) = argmax
uU
_
r(s, u) s
(s) +
V
(s)
ds
f (s, u)
_
= argmax
uU
_
r(s, u) +
V
(s)
ds
f (s, u)
_
(7.44)
This is the continuous counterpart of V-Function planning. The advantage is that we do not need to predict
all next states, the gradient of the value function (has to be calculated only once) and the state derivative
f (s, u) are used instead. So this approach is computationally cheaper (if the gradients can be calculated
easily), but the action is only optimal for a innitely small time interval t, so it will in general not nd as
good solutions as the standard planning technique. We also still have to use either a discretized action set or
we need complex optimization techniques to nd the optimal control vector.
Value Gradient Based Policies
If we are using a linear model with respect to the control variables ( f (s, u) = B(s) u + a(s), see 7.1.5) and
the reward signal r(s, u) has certain properties, the optimization problem of 7.44 has a unique solution and
we can nd a closed form expression of the greedy policy. The greedy action is now dened through the
equation
(s) = argmax
uU
_
r(s, u) +
V
(s)
ds
B(s) u
_
(7.45)
If the reward signal is independent of the action u (so r(s, u) = r(s) ) and the control vector is limited to the
interval [u
min
, u
max
] the greedy action can be easily found by looking on the signs of
V
(s
t
)
ds
B(s)). If a value
of this vector is positive, the maximum control value should be taken, otherwise the minimum control value.
This kind of policy is called the optimal bang-bang control policy.
(s) = u
min
+
sign
_
V
(s)
ds
A(s)
_
+ 1
2
(u
max
u
min
) (7.46)
The bang-bang control law always chooses the limit values of the control variables. Although the bang-bang
policy is optimal (for indenitely small time steps), the control is not smooth and chattering can destroy
physical systems. Therefore we can introduce a smoothed version of the bang-bang policy by smoothing
out the sign(x) function. This is done with the logsig function, which is a sigmoid function that saturates at
logsig() = 0 and logsig() = 1.
(s) = u
min
+ logsig
_
c
V
(s)
ds
A(s)
_
(u
max
u
min
) (7.47)
The vector c species the smoothness of the control, for c the policy is a bang-bang policy again. This
control law smoothes out the chattering, but it is also less eective.
If the reward is linear dependent with respect to the control and/or the control variables are not limited by
more complex conditions but are limited through a convex region the greedy action can be found by using a
linear program, or also a quadratic program for a reward with quadratic action costs (e.g. typical for energy
optimal control).
The design of the Toolbox has the same separation of value function learning and optimal action selection
as discussed in the theory section. So we can interchange the dierent approaches, and use for example
continuous time RL for value function learning, but other approach like V-Planning for action selection. In
the Toolbox a xed discretization time step t is used as it was done by Doya [17] and Coulom [15] in their
experiments.
Learning the continuous time value function
As we have seen there are only a few dierences to the standard TD techniques in continuous time RL.
Actually the only dierence is how to dene the residual. Our design of the TD-Learner classes already
allows the denition of own residual function, thus it would be convenient to use the same already existing
classes and just extend new residual classes. In continuous time TD() the update rules of the weights and
the e-traces are given by
w = H(t) t e(t).
e(t) = (1 (s
+ s
) t) e(t 1) +
w
H(t) t
In order to use the same TD learner classes, and to be able to use the same magnitude of learning rates, we
multiplied the continuous time residual H with the time step t. As a result we get rid of the multiplication
with t in both equations. The dierence in the factors used for attenuating the e-traces ((1(s
+s
) instead
of ) is neglected, because it is just another choice of parameters, and the choice of the parameter is
more intuitive anyway. The Toolbox contains two additional continuous time residuals.
Euler Residual: Uses the standard Euler numerical dierentiation to approximate the time deriva-
tive of V.
residual(r(t), V(t), V(t + 1)) =
_
r(t) +
1
t
_
(1 s
t) V(t + 1) V(t)
_
_
t
= r(t) t + (1 s
t) V(t + 1) V(t) (7.48)
w
residual(
w
V(t),
w
V(t + 1)) = (1 s
t)
w
V(t + 1)
w
V(t) (7.49)
Coulom Residual: Approximates the Hamiltonian in the entire intervall:
residual(r(t), V(t), V(t + 1)) = r(t) t + (1
s
t
2
) V(t + 1) (1 +
s
t
2
) V(t) (7.50)
w
residual(
w
V(t),
w
V(t + 1)) = (1
s
t
2
)
w
V(t + 1) (1 +
s
t
2
)
w
V(t) (7.51)
Action Selection
For continuous time V-Planning with a nite action set (so the actions have been discretized) we use the
same approach as for the discrete time V-Planning method. We build an extra read-only Q-Function class
CContinuousTimeQFunctionFromTransitionFunction which implements the continuous time version of the
action values.
Q(s, a) = r(s, a) +
V
(s)
ds
f (s, a) (7.52)
This Q-Function can again be used for any stochastic policy in the Toolbox.
The derivation of the value function with respect to the input state is done numerically (by the class CV-
FunctionNumericInputDerivationCalculator) with the three point rule
dV(s)
ds
i
=
V(s +
i
e
i
) V(s
i
e
i
)
2
i
(7.53)
, where
i
is a step size parameter for the dimension i. The step size can be chosen for each state variable
separately. The class is subclass of CVFunctionInputDerivationCalculator, so other approaches for calcu-
lating this derivation can be added easily. An analytic approach was not used because it would be very
complex to design that in general for linear feature functions because of the adaptive feature state model of
the Toolbox. But implementing the analytical approach for a single function approximator scheme would
be worth trying because of the expected increase of speed and perhaps also the performance of the policy.
The implementation of the smooth value gradient policy is done by the class CContinuousTimeAndAction-
SigmoidVMPolicy. The class is subclass of CContinuousActionController, so it already calculates continu-
ous action vectors, and does not choose a particular action object. The same approach for calculating the
derivation
dV(s)
ds
is used as above. Both approaches need a model which is linear with respect to the control
variables (class CLinearActionContinuousTimeTransitionFunction)
7.4 Advantage Learning
Advantage Learning was proposed by Baird [6] as an improvement of the Advantage Updating algorithm
[5] which is not covered in this thesis. Instead of learning the action value Q(s, a), the algorithm tries to
estimate for each state-action pair < s, a > the advantage of performing action a instead of the performing
the currently considered best action a
. Thus this algorithm can usually only be used for a discrete set of
actions, comparable to Q or SARSA learning.
The optimal advantage function A
(s, a) is dened to be
A
(s, a) = V
(s) +
E[r(s, a) +
t
V
(s
)] V
(s)
t K
(7.54)
Here
t
is the discount factor per time step (consequently is the discount factor for one second), K is the
time unit scaling factor. The value of state s is dened to be the maximum advantage of state s (similar to
the denition to the Q-Function).
V
(s) = max
a
A
(s, a) (7.55)
The Advantage can also be expressed in terms of action values
A(s, a) = V(s)
max
a
Q(s, a
) Q(s, a)
t K
(7.56)
7.5. Policy Search Algorithms 125
Under this denition, to provide a better understanding, we can see the advantage of the sum of the value
in the current state plus the expected value at which performing action a aects the total discounted reward.
The second term is obviously zero only for an optimal action and negative for all suboptimal actions.
Another important aspect of Advantage Learning is that the advantage gets scaled by the time step. For
t K = 1 the algorithm completely coincides with the Q-Learning algorithm. In an optimal control task, as
we choose a smaller t value, the Q-Values of state s will all approach to the value V(s) because perform-
ing dierent actions has less consequence for small time steps. But for advantage learning the dierence
between the advantages of the actions will stay the same because they get scaled by the time step t.
7.4.1 Advantage Learning Update Rules
Advantage Learning is also a value based algorithm, so we just need to dene a residual and we can use
the theory about value function approximation discussed in the previous sections. The residual of advantage
learning can easily be found by subtracting A
(s, a) on both sides from equation 7.54 and inserting equation

7.55.
residual(t) =
_
r
t
+
t
max
a
A(s
t+1
, a
)
_
1
tK
+ (1
1
tK
)max
a
A(s
t
, a
) A(s
t
, a
t
) (7.57)
This gives us the following residual gradient
w
residual(t) =
1
t K
w
max
a
A(s
t+1
, a
) + (1
1
t K
)
w
max
a
A(s
t
, a
)
w
A(s
t
, a
t
) (7.58)
The structure of the residual of advantage learning diers from our standard residual design because it
depends on three values, additionally the optimal value of the current state is needed. As a result we can not
use our residual gradient framework. For the advantage learning algorithm, we derive an individual class
from the Q-Function residual learner class. Just the calculation of the residual and the residual gradient is
changed, the rest of the functionality remains the same. As a result we can use the residual algorithm for
advantage learning with either a constant residual factor or the optimal factor. Again the direct gradient
algorithm is achieved by choosing a = 0.0 and the residual gradient algorithm by choosing a = 1.0 value.
7.5 Policy Search Algorithms
Policy search approaches try to nd a good policy directly in the policy parameter space, no value function
is learned. Often the gradient of the value of a policy V() (or some other performance measure) with
respect to the policy parameters is estimated and then used to improve the policy. These methods are usually
referred to as policy gradient approaches. But also every other kind of optimization method can be used
for improving the policy, like genetic algorithms, stimulated annealing or swarm optimization techniques.
Policy search algorithms try to avoid to learn the value function, this can be advantageous because learning
the (optimal) value function is usually very dicult. When using value function approximation, even with a
good approximation of the value function, a bad greedy policy can be created, which is another disadvantage
of value function learning. (see 7.2).
The disadvantage of policy search methods is, that the increase of performance is harder to estimate when
no value function is learned. When searching in the policy parameter space with gradient descent, we are
also likely to end in a local maximum of the performance measure.
In this thesis we will only discuss policy gradient algorithms. We will consider either a stochastic (s, a, )
or deterministic policy (s, ), which is parameterized by . As usual we want to maximize the expected
total discount reward V which is given by
V =
sD
d(s) V(s) (7.59)
for a nite set of states. D is the set of the initial states, and d(s) is the probability distribution over these
states. Consequently all policy gradient algorithm want to do gradient ascent on V, and thus we get the
weight update rule (we will again refer to the gradient
V
d
as
V or short as V).
=
V (7.60)
In the next sections we will at rst discuss a method for updating the weights given the gradient direction
and the learning rate, then we will cover two dierent approaches for learning rate adaption, and in the end
we will come to two approaches of estimating the gradient direction, GPOMDP [11] for stochastic policies
using a discrete action set, and PEGASUS [33] also for real valued policies.
7.5.1 Policy Gradient Update Methods
At rst we will discuss methods for updating the weights if we already have a given estimate of the gradient.
Baxter [11] proposed the CONJPOMDP algorithm which uses a variant of the Polak-Ribiere conjugate
gradient descent algorithm for the weight updates. Although Baxter uses the CONJPOMDP algorithm for
his GPOMDP algorithm, it can be used for any other gradient estimation approach. The algorithm is listed
in algorithm 6.
Algorithm 6 CONJPOMDP
g = h = getGradient()
while ||g|| do
= getLearningRate(h, )
= + h
= getGradient()
=
(g)
||g||
2
h = + h
if h < 0 then
h =
end if
g =
end while
The algorithm is terminated when the norm of the gradient is smaller than a given constant . getGradient
returns an estimate of the gradient direction and getLearningRate provides a good choice of the learning
rate.
7.5.2 Calculating the learning rate
In this section we will discuss two dierent algorithm for calculating the learning rate, both are based on
line search approaches. The estimation of the gradient usually needs a lot of training examples, so we have
to exploit this information optimally. Gradient estimation schemes usually tell us the direction of optimal
update, but they do not tell us the step size of that update.
Value Based Line Search
This approach is a straightforward line search approach. Given a list of possible step sizes, it applies all step
sizes and estimate the expected discounted reward V by simulation. At the end we can alternatively search
further between the given step sizes or immediately return the best step size. We estimate the

V value by n
dierent simulation trials with dierent initial states.
The disadvantage of this approach is that the list of step sizes has to be given, and the expected discounted
reward estimates can be very noisy. Since we have to estimate the best learning rates, we have to compare
the expected discounted rewards V(
i
), thus we have to calculate sign[V(
i
) V(
j
)]. If the estimates are
noisy, then the variance of sign[V(
i
) V(
j
)] approaches to 1.0 (the maximum) as
i
approaches
j
. But if
we are using the same initial state set (reduces a lot of noise in the estimates) for all estimations of V(
i
) this
eect can be reduced at a minimum at least for the deterministic benchmark problems.
Gradient Based Line Search
Gradient Based Line Search (GSearch) was proposed by Baxter [11] and used with the GPOMDP algorithm.
GSearch tries to nd two points
1
and
2
in the direction of the current estimated gradient =
dV(
0
),
such that
V(
1
) > 0
V(
2
) < 0 (7.61)
The maximum must lie between these two points. The advantage of this approach is that even for noisy
estimates, the variance of sign(
V(
i
) ) and sign(
V(
j
) ) is independent of the distance of the two
parameter vectors. Calculating many new gradients for updating the parameter vector with a single gradient
vector does not seem to be very eective, we can use much more noisy (so less training examples are used
to calculate it) gradients for estimating the step size, than the gradient used for updating the weights.
The GSearch algorithm starts with a initial step size
0
and calculates the new gradient at this position. If
the product with the given update gradient is positive, the algorithm searches with the double step size at
each trial until the product of the new calculated gradient and the update gradient is negative (actually this
condition is relieved to be below a certain in order to add a robustness to error of the gradient estimates). If
the product of the rst gradient calculation with the update gradient is negative, the same procedure is done
with the half of the step size at each trial. The search interval is restricted to a certain search range. The
last two search points
k1
and
k
either bracket the maximum (sign(
V(
1
) ) sign(
V(
2
) )), or
no maximum was found in the search interval (then the last tw search points are close to one of the search
limits). In either case the last two step sizes are used to calculate the maximum.
If a maximum was found (if we have a positive and negative gradient product) the two step sizes are then
used to estimate the maximum between them by calculating the maximum of a quadratic which is dened
by these two points and the slope information coming from the gradient. If no maximum was found, the
algorithm applies the midpoint of the last two search step sizes.
7.5.3 The GPOMDP algorithm
This algorithm was proposed by Baxter [9] [11] to estimate the gradient of a stochastic, parameterized policy
for a partial observable Markov decision process (POMDP).
At rst we consider a MDP with nite state space S. Given the parameterized stochastic policy (s, a, ) the
stochastic matrix P() = [p
i j
()] denes the transition probabilities of the states s
i
to s
j
. For the GPOMDP
algorithm, we have to make some additional assumptions on the Markov chains which are created by the
transition matrix P(). Each Markov chain M() following the transition matrix P() has to have a unique
stationary distribution d() = [d(s
0
, ), d(s
1
, ), . . . , d(s
n
, )] which satises the balance equation
d()
T
P() = d()
T
(7.62)
The stationary distribution gives us the probability of being in state s after having done innite transitions
according to P(). The stationary distribution is independent of the initial state. The spectral resolution
theorem [25] states that the distribution of the states converges to the stationary distribution at an exponential
rate, the time constant of this rate is called the mixing time. The stationary distribution (if it exists) is the
rst left eigenvector (eigenvector with the highest eigenvalue) of the transition probability matrix and has
the eigenvalue
1
= 1. The mixing time is given by the second eigenvalue
2
of P().
The algorithm tries to optimize the average reward criterion
V
A
() = lim
T
1
T
T
t=0
r
t
= d()
T
r (7.63)
where r is the reward vector [r(s
0
), r(s
1
), r(s
2
), . . . ]. We can also extend the algorithm to action dependent
rewards, but this is not done in this thesis. Note that optimizing the average reward and optimizing the total
discount reward is theoretically equivalent, furthermore it can be shown that the discounted reward criterion
V can be expressed by

V
A
[9].
V =
V
A
1
(7.64)
so we do not loose any generality by optimizing the average reward instead of the discounted reward.
Baxter [9] proved that the gradient of the average reward V
A
() = d() r is equal to
V
A
(), which is
given by:
V
A
() = d()
T
P() V
i, j
d(i, )p
i j
() V
( j)
=
i, j
d(i, )p
i j
()
p
i j
()
p
i j
()
V
( j) (7.65)
for lim
1
. V
measures the merit of state s and is equivalent to the denition of the value function.
V
(s) = E
_
i=0
i
r(s
i
)|s
0
= s
_
_
The variable sets the bias variance trade o, for = 1 we have an unbiased estimator of the gradient, but
with high variance, small values of give lower variance, but the estimate
V
A
() might not even be close
to the needed gradient V
A
().
The term
p
i j
()
p
i j
()
= ln p
i j
can be seen as making the transition p
i j
more probable. So this equation can in-
tuitively be seen as increasing the probability of state transitions to states with a high performance measure
(V
) more than to states with smaller performance measure. This gradient can be rewritten for a stochastic
policy (s, a, ) and action dependent transition probabilities p
i j
(a) . We can write for the transition proba-
bility p
i j
() =
_
a
p
i j
(a)(s
i
, a, ) and thus for p
i j
() =
_
a
p
i j
(a)(s
i
, a, ) Inserting this in equation 7.65,
we get the following gradient calculation rule
V
A
() =
i, j,a
d(i, )p
i j
(a)
(i, a, )
(i, a, )
(i, a, ) V
( j) (7.66)
GPOMDP uses one step samples of this equation to estimate
V
A
(), the algorithm is listed in algorithm 7.
Algorithm 7 GPOMDP gradient estimation
z = 0
= 0
t
, a
t
, r
t
, s
t+1
>, t {1 . . . T} do
z
t+1
= z
t
+ log (s
t
, a
t
, )
t+1
=
t
+ r
t
z
t+1
end for
t
=

t
T
return
t
It uses two traces for each weight, z
t
and
t
. Baxter [9] proved that the series
t
gives an unbiased estimate
of equation 7.66, i.e. that
lim
t
t
=
V() (7.67)
In combination with the result that lim
1
V() = V(), it is proved that GPOMDP produces unbiased

estimates of V() if we set to 1.0.
7.5.4 The PEGASUS algorithm
PEGASUS stands for Policy Evaluation of Merit and Search Using Scenarios and was proposed by Ng and
Jordan [33]. They successfully used the PEGASUS algorithm to control an inverted helicopter ight in
simulation and also had good results using the learned policies for a real model-helicopter. The PEGASUS
algorithm can be used to learn any stochastic or deterministic policy, but it makes additional assumptions on
the model in such a way that only simulated tasks can be learned.
A big problem for policy search algorithm is the noise in the performance estimate, consequently it is
usually hard to decide which of two policies is better. The noise can be introduced by dierent initial state
samples, dierent (lucky or unfortunate) noise in the model and also in the controller if we use some sort of
exploration policy. As performance measure we again use the expected value of a policy V(). If we want
to refer to the value of the policy in the (PO)MDP M we write V
M
().
Converting stochastic (PO)MDPs to deterministic (PO)MDPs
PEGASUS solves this problem by adding additional assumptions to the model. In optimal control the MDP
(or POMDP if parts of the model state are not observable) is dened via a generative model s
t+1
f (s
t
, a
t
)
which is usually a stochastic function. For PEGASUS we assume to have a stronger model, we use the
deterministic function g : S A[0, 1]
p
S for our state transition. The function g additionally depends
on p random variables, so that g(s, a, p) with an uniformly distributed p vector is distributed in the same
way as the distribution of the stochastic transition function f (s
t
, a
t
). Consequently the function g has an
additional input vector to specify the internal random process of f . The model g is called deterministic
simulative model. From probability theory it is known that we can sample any distribution by transforming
one or more samples from the uniform distribution, so we can construct a deterministic simulative model
for all generative models f . The deterministic model is obviously a stronger model than a generative model,
but for simulated tasks, which typically use uniform random samples from a random generator to simulate
noise, already have indirectly this interface to the random generator, so this assumption on deterministic
simulative models does not restrict us rigorously, if we are using a simulated model.
Having access to the deterministic model g, it is easy to transform an arbitrary (PO)MDP M into an equiva-
lent (PO)MDP M
with deterministic transitions. The transformation is accomplished by adding an innite

number of uniformly sampled random state variables to the initial state s
0
=< s
0
, p
0
, p
1
, >. For sim-
plicity we assume scalar p values, the extension to vectors is trivial. The transition s
t+1
f (s
t
, a
t
) is now
replaced through the transition
s
t+1
=< s
t+1
, p
t+1
, p
t+2
, >=< g(s
t
, a
t
, p
t
), p
t+1
, p
t+2
, >
Consequently the randomness of the MDP is xed at the beginning of each episode. The rest of the MDP M
remains the same in the MDP M
, the dependency of the policy and the reward function R is still reliant
original state space, not on the additional random variables. The initial state distribution D
now consists of
the initial state distribution D of the original MDP and innitely many uniform distributions for the random
state variables.
The question is whether is makes sense to use the same random numbers for two dierent policies, because
the random number p
t
might aect the performance measure positively when following the policy
1
, but it
might have a negative eect when following policy
2
because of the dierent context. Thus the performance
estimates can still be noisy because the random variables p do not give the same conditions for all policies.
PEGASUS Policy Search Methods
If only the original state space s is observed during an episode, one obtains a sequence that is drawn from
the same distribution as would have generated from the original MDP M. Thus it is also clear that a policy
will have the same expected value in M as in M
(V
M
() = V
M
()), as a result we can optimize V
M
()
instead of V
M
().
The value of the policy is given by
V
M
() = E
s
0
D
[V
M
(s
0
)]
The expectancy value can be estimated by choosing n samples from the initial state distribution D
V
M
()
1
n
m
i=1
V
M
(s
i
0
) (7.68)
The value of a initial state V
(s
i
0
) can be calculated by simulation and by summing up the discounted
rewards. The real value can only be calculated by an innite sum, a standard approximation is to truncate
the sum and use only H reward values for the calculation.
V
M
(s
0
) =
H
t=0
t
r(s
t
, a
t
, s
t+1
) (7.69)
Since we use a nite horizon H we can restrict our initial state to H random variables, the rest is not needed
anyway. For a given approximation error , H can be calculated by H
= log
_
(1)
2r
max
_
, where r
m
ax is
the maximum absolute reward value. Due to the xed randomization of the (PO)MDP, the value of the
policy

V
M
() is a deterministic function. Consequently, we can use any standard optimization technique for
nding a good policy. As in our case, if the state, action and policy parameter space is continuous and all
the relevant quantities are dierentiable we can use gradient ascent methods for optimization. A common
problem for gradient ascent is that the reward signal must be continuous and dierentiable, which is not the
case if, for example, we give rewards only for specic target states or regions. One approach to deal with
with that barrier, which is often used in RL for optimal control anyway, is to smooth out the reward signal
and use, for example, a distance measure to the target state as reward. This method is often referred to as
shaping the reward function. (For a more detailed discussion about using shaping for RL see [39]).
The Toolbox contains two ways of calculating the gradient, which are introduced in the next sections.
Calculating the gradient numerically
One obvious way to calculate the gradient is to use numerical methods. We use the three point rule to
calculate the derivative of the value of policy with respect to the policy parameter
i
.
V() =
V( + e
i
)) V( e
i
))
2
(7.70)
Thus, the value of the policy has to be estimated for each weight of the policy twice, which is computa-
tionally very expensive. This approach is very cost intensive, but gives us quite accurate estimates of the
gradient.
Ng and Jordan [33] used the numerical gradient for most of their experiments, but no detailed explanations
where given as to how they calculated the gradient numerically.
Calculating the gradient analytically
The gradient of the value of the policy () can also be calculated analytically. For simplicity we assume
that the reward signal just depends on the current state and is also dierentiable with respect to the input
state variables.
The value of state s
0
is given by
V
M
(s
0
) = r(s
0
) + r(s
1
) +
2
r(s
2
) + . . . (7.71)
Given the derivation of the reward function and the model and the policy parameter derivation, we can also
calculate the gradient of V
M
(s
0
) analytically with considerable savings in computation time.
M
(s
0
) =
dr(s
1
)
ds
ds
1
d
+
2
dr(s
2
)
ds
ds
2
d
+ . . . (7.72)
The derivative of the successor states s
t+1
with respect to the policy parameters
ds
t+1
d
can be calculated
incrementally given the derivative of state s
t
.
ds
t+1
d
=
g(s
t
, (s
t
), p)
d
=
_
dg(s
t
,(s
t
,), p)
ds
dg(s
t
,(s
t
,), p)
da
_
_
ds
t
d
d(s
t
,)
d
_
(7.73)
The derivation
d(s
t
,)
d
can be further resolved
d(s
t
, )
d
=
_
d(s
t
,)
ds
(s
t
,)
_
ds
t
d
I
_
(7.74)
where I is a p p identity matrix, p being the number of parameters used for the policy. Thus we need
to know the input state derivations of
dr(s)
ds
,
dg(s,a, p)
ds
,
dg(s,a,p)
da
and
d(s
t
,)
ds
. If an analytical solution of these
gradients is not available, these quantities can also be calculated numerically with only a little loss of per-
formance.
Note that at our state of knowledge this is the rst time the gradient has been calculated analytically in this
way, we will show in the experiments section that the speed of this approach signicantly outperforms the
numerical approach at a comparable learn performance.
The design of the policy gradient learner classes matches our description of the structure of policy gradient
methods. Since the policy gradient update scheme does not match the standard per step update in RL, policy
gradient learner classes cannot be used as pure agent listeners. The policy gradient estimator classes usually
have a listener part, to get informed about the sequence of steps, but they also have methods for controlling
the agent class (e.g. to tell the agent to simulate n episodes).
Policy Updater Classes
Policy updater classes (subclasses of CGradientPolicyUpdater) acquire the estimated gradient as input and
have to update the policy. Therefore the class is supposed to calculate a good learning rate and then directly
update the policy via the gradient update function interface of the policy. There are three implementations
of the updater class.
Constant Step Size Update (CConstantPolicyGradientUpdater): Almost self-explanatory, uses a
constant learning rate for each update.
Value Based Line Search (CLineSearchPolicyGradientUpdater): Implements the discussed value-
based algorithm. For the value estimation, a policy evaluator object is used. Hence we can use
the value or the average reward of a policy as performance measure. We can also set the number
of episodes and steps per episode used for the performance estimation. The algorithm searches at
the given step sizes and stores the performance. If there are any search steps left (searchS teps <
maxS teps) after having estimated all learning rates, the algorithm continues the search in the neigh-
borhood of the maximum by searching in the middle of two adjacent points. Eventually the learning
rate with the maximum value is applied to update the policies parameters.
Gradient Based Line Search (CGSearchPolicyGradientUpdater): The GSEARCH class has a pol-
icy gradient estimator class object as input, thus it can calculate the gradient for a specied learning
rate. This gradient estimator is usually a less accurate version of the gradient direction used for up-
dating. The search process is then done in the discussed way and at the end the best learning rate is
applied. We can set the search interval [
min
,
max
], the start learning rate
0
.
Policy Gradient Learner Classes
The task of the policy learner classes is to put the functionalities of the gradient estimator and the gradient
update classes together. In the Toolbox there is just one implementation, the CONJPOMDP algorithm, but
averaging over the old gradient estimates can be turned o. The class has access to a policy updater class
and a policy gradient estimator class. The gradient for the update is calculated according to algorithm 6 as
the weighted average over former gradient estimates.
Policy Gradient Estimator Classes
The gradient estimates themselves are calculated by subclasses of CPolicyGradientCalculator, these classes
usually have direct access to the agent. The gradient is again represented by a feature list such as the one we
already used for the gradient calculation of the value function. We will come to the dierent implementations
of the estimator classes after having discussed the GPOMDP and the PEGASUS algorithm.
GPOMDP Gradient Estimation
The GPOMDP gradient estimator class (CGPOMDPGradientCalculator) implements the policy gradient
estimator and also acts as an agent listener interface. The class has access to the agent, in the policy gra-
dient estimator interface, the class adds itself to the agent listener list and executes the specied number of
episodes and steps. In the agent listener interface, the class maintains the two traces z
t
and
t
for the local
gradient of one episode. The gradient log((s
t
, a
t
, )) is calculated by the stochastic policy interface (see
6.2.5). After each episode the local gradient of one episode is added to a global gradient object, which is
returned by the policy gradient estimator interface in the end.
PEGASUS Gradient Estimation
The transformation of the stochastic (PO)MDP M to the deterministic MDP M
is not done explicitly, rather,

we just use an individual random generator function rlt rand. For our random generator function we can
set a list of random variables uniformly distributed in [0, 1]. If this list has been specied, a value from
this list in ascending order, is taken instead of a real random number. This has the same eect as the
deterministic simulative model, if we restrict all simulated models to using this random number generator
instead of the standard random generator. The problem is that we do not know exactly how many random
samples are needed for one episode. Therefore, we used the following approach, in the rst trial of the
PEGASUS gradient estimation, no list is used. Instead the list is created simultaneously. For the following
PEGASUS calls this list is always used, if we happen to need more random samples are needed later on, the
list is enlarged again. As a result we get our deterministic (PO)MDP M
Estimation of the numerical gradient requires the use of the class CPEGASUSNumericPolicyGradientCal-
culator with the three point method.
The analytical algorithm is more complex. We use her again, the policy gradient estimator and the agent
listener interface simultaneously (similar to the GPOMDP algorithm). The policy gradient estimator part
adds itself to the agent listener list and then starts the agent for a specied number of episodes and steps per
episode.
In the agent listener part we need to calculate the derivative of the reward function, the policy and the
transition function g. For calculating the derivative of the reward function a new method is added to the state
dependent reward interface (CStateReward). In this method the user has to implement the derivative of the
reward function if he wants to use the PEGASUS algorithm.
For calculating the derivative of the transition function and of the policy (with respect to the state), own
dierentiation classes are used (CTransitionFunctionInputDerivationCalculator and CCAGradientPolicyIn-
putDerivationCalculator). There is one implementation for each of these dierentiation classes which do
the dierentiation numerically using the three point method, but an analytical solution can easily be included
with this approach. The described equations could all have been implemented by matrix multiplications, but
because we are dealing with gradients, which are likely to be sparse, we decided on another approach. For
ds
t
d
,
ds
t+1
d
and also for
d(s
t
,)
d
we maintain an individual data-structure, which consists of a list of gradient
objects (feature lists), one list for each continuous state variable (derivation of the successor states) and one
for each control variable (derivation of the policy). We can write equation 7.73 in the following vector form:
ds
i
(t + 1)
d
=
n
j=1
dg
i
(s(t), a, p)
ds
j
ds
j
(t)
d
+
m
k=1
dg
i
(s(t), a, p)
da
k
d
k
(s(t))
d
+
m
l=1
dg
i
(s(t), a, p)
da
l
n
p=1
d
l
(s(t))
ds
p
ds
p
(t)
d
(7.75)
where the subscripts i, j, k, l, p are the indices of the state(s) or control variable(s). Hence, all the mathe-
matical operations consist of multiplying a list of gradients with a matrix (the derivatives of the policy and
transition function) to get another list of gradients, and add the result to an existing list of gradients. We
implement an own function for doing that. The function takes the input gradient list of size n, an output
gradient list of size m and a multiplication matrix of size m n as input, then it calculates the product of the
i
th
input vector with the i
th
column of the multiplication matrix and adds the result to the output gradient list
o
j
= o
j
+ M( j, i) u
i
, for all 1 i m and 1 j n (7.76)
where o is the output list, u is the input list and M the multiplication matrix. With this operation, we can
calculate
ds
t+1
d
from
ds
t
d
with equation 7.75. We maintain an individual list of gradients for the derivative of
the policy
d(s
t
)
d
separately, which is calculated by equation 7.74. The gradient feature list of
ds
t
d
is always
stored for the calculations required in the next step, this gradient list is only cleared at the beginning of a
new episode. Finally having calculated
ds(t)
d
, the gradient list is multiplied with the reward gradient vector
dr(s(t))
ds
and the result is added to a global gradient feature list grad.
grad = grad +
n
i=1
i1
dr (s(t))
ds
i
ds
i
(t)
d
(7.77)
After having executed all the gradient estimation episodes, this global gradient is returned from the gradient
estimator interface.
7.6 Continuous Actor-Critic Methods
Actor-Critic methods can be viewed as a mixture of value based and policy search methods. They learn the
value function while representing the policy in a separate data structure. We have already discussed Actor-
Critic methods for a discrete state and action set in chapter 4. By using a function approximator for the value
function and using the gradient of the policy for the updates instead of the state indices these approaches
are easily extended to a continuous state space. But if we already use an individual parametrization for the
policy, it would be more ecient to use continuous control values for our policy. In this chapter we will
present two dierent methods for Actor-Critic learning with continuous control policies. At rst we will
discuss the stochastic real valued algorithm (SRV) [19], and then we will come to a new approach which is
proposed in this thesis, which we will call policy gradient Actor-Critic algorithm (PGAC).
7.6.1 Stochastic Real Valued Unit (SRV) Algorithm
The SRV algorithm was proposed by Gullapalli [19] for continuous optimal control problems. The initial
denition was just for associative reinforcement learning, i.e. the algorithm is only used to optimize the im-
mediate performance return. But by taking a learned value function as performance measure, this algorithm
7.6. Continuous Actor-Critic Methods 135
is easily extended to discounted innite horizon control problems [17]. In this case, as for the Actor-Critic
algorithms, the algorithm is independent of the critic part, so we can use any V-Learning algorithm to learn
the V-Function. For more details about Actor-Critic architecture please refer to section 4.6
SRV Units
In the SRV algorithm, we have an SRV unit for each continuous control variable. An SRV unit returns a
sample from the normal distribution N((s
t
, ), (s
t
, w)).
The mean value is dened by the actors parameters . The actor can be represented by any kind of function
approximation scheme. is a monotonic descending, non negative function depending on the performance
estimate. The variance may depend on the current performance estimate, which in our case is the value
of the current state V
w
(s
t
). The better the performance estimate, the lower the values that are used. For
example, we can use a linear scaling of the value function for calculating the value
(t) = K
_
1
V(s
t
) V
min
V
max
V
min
_
(7.78)
where K is a scaling constant. A multi-valued policy is dened by several SRV units
(s
t
) = (s
t
, ) + n(t) (7.79)
where n(t) is the noise vector which is sampled from the distribution N(0, (s
t
)). Alternatively we can also
use a ltered noise signal to have certain continuity properties for the policy. In order to impose the control
limits of the control variables we can use a saturating function as discussed in section 7.1.1.
SRV Update Rules
The key idea of the SRV algorithm is to perturb the current policy with a known noise signal (dened by ).
If the performance of the perturbed control signal is better than the estimated performance of the original
policy the policys output value is adapted to move in the direction of the noise signal. If the performance is
worse, the output value is adapted to move in the opposite direction.
The old performance estimate V
old
(s
t
) =

V
w
(s
t
) uses the value function in state s
t
. The new performance
estimate is calculated in the standard value-based way V
new
(s
t
) = r
t
+

V
w
(s
t+1
). The dierence of both
coincides with the temporal dierence from TD-learning.
Consequently, the parameters of the actor are updated in the following way
t
= td(t)
n(t)
(t)
(s
t
, )
d
(7.80)
7.6.2 Policy Gradient Actor Learning
Policy Gradient Actor Learning is a new Actor-Critic approach which is proposed in this thesis. It is a
mixture of the analytical PEGASUS algorithm (see 7.5.4) and V-Learning (we need an exact model of the
process). Again we want to calculate the gradient of the value of the policy with respect to the policys
parameters
dV
d
. But now we can learn the value function explicitly with any V-Learning algorithm.
Furthermore we can assume once more that the reward function is dependent only on the current state s
t
and
dierentiable. We can estimate the value of the policy in state s
t
with the standard value based approach
V(s
t
) = r(s
t
) +

V(s
t+1
). The successor state s
t+1
was created by following the policy (s
t
, ), so we have
a dependency on in this equation. Thus, an obvious approach is to calculate the derivation of this equation
with respect to .
dV(s
t
)
d
=
d

V(s
t+1
)
ds
ds
t+1
d
(7.81)
ds
t+1
d
can be calculated in a similar way, as in the analytical PEGASUS algorithm.
ds
t+1
d
=
dg(s
t
, (s
t
, ))
da

d(s
t
, )
d
(7.82)
We try to move the state s
t+1
in the direction of the gradient of the value function
d

V(s
t+1
)
ds
by updating the
weights of the actor. Thus, the policy is improved, if the value function is correct and an appropriate learning
rate was used.
We can further extend this approach. At time step t we can look k steps into the past and l steps into the
future to estimate the value of in state s
tk
more accurately.
V(s
tk
) = r(s
tk
) + r(s
tk+1
) +
2
r(s
tk+2
) + +
k1
r(s
t
) + +
k+l2

V(s
t+l
) (7.83)
Again we can calculate the gradient of this equation with respect to , which is now a more accurate version
of equation 7.81.
By calculating the gradient of equation 7.83 with respect to we get
=
V(s
tk
)
d
=
dr(s
tk+1
)
ds
ds
tk+1
d
+
2
dr(s
tk+2
)
ds
ds
tk+2
d
+ +
k1
dr(s
t
)
ds
ds
t
d
+ +
k+l1
d

V(s
t+l
)
ds
ds
t+l
d
(7.84)
s
can be calculated from s

1
by equations 7.73 and 7.74.
Hence, we can choose an l-step prediction and a k-step horizon for the past. The PGAC algorithm is listed
in algorithm 8.
There is only a small dierence between using prediction horizons or past horizons, which will be conrmed
by our experiments. For the forward horizon, the policy gets updated before really executing the action for
the state s
t
. Another advantage is that the predicted states get recalculated at each step, while for the
backward horizon the states are stored k steps before. Thus, using large k values for the backwards horizon
can be risky, because the policy parameters change at each state. As a result, the stored state sequence might
not be representative for the current parameter setting any more. Using an l-step prediction horizon is related
to the presented V-Planning method by using a search tree over the value function. This helps to reduce the
eect of imprecise value functions.
The advantage of the new approach is that the computational costs only increase linearly (instead of the
exponential costs of the search tree) with the prediction horizon; so we can predict more steps into the
future easily. In the experiment section, we will show the immense advantage in computation speed of this
approach in comparison to V-Planning. Of course another advantage is that we can produce a continuous
action vector instead of using a discrete action set.
An approach for improvement would be using an adaptable learning rate for the actor, since we know
w
s
t+k
,
the new state s
t+k
, which is reached by the agent if we update the weights and simulate k steps, can easily
be estimated by
s
t+k
= s
t+k
+
s
t+k

Thus, a line search can be implemented to search on V(s
t+k
) for a good learning rate for w.
7.6. Continuous Actor-Critic Methods 137
Algorithm 8 The Policy Gradient Actor-Critic algorithm
states = list of last k states.
t
, a
t
, r
t
, s
t+1
> do
Put s
t+1
at the end of states
Predict l 1 states from s
t+1
and put them at the end of states
s = 0
grad = []
for i = t k to t + l 1 do
s states(i)
grad = grad +
dr(s)
ds
s
=
_
d(s,)
ds
d(s,)
_
s
I
_
s =
_
dg(s,(s,))
ds
dg(s,(s,))
da
_
_
s
_
end for
s states(t + l)
grad = grad +
dV(s
ds
s
= grad
dismiss predicted states and s
t+k
from states
end for
Another idea for improvement is to use dierent time intervals for the updates during learning. E.g. in the
beginning of a learning trial large prediction/backwards horizons can be used because the value function
estimate is very noisy in this stage of learning. The time intervals can then be reduced again at a later
learning phase where the value function estimate is already more reliable.
SRV Algorithm
The SRV algorithm ts perfectly into our Actor-Critic architecture, so we can implement the actor as an
error listener of the TD-error. The actor maintains a continuous action gradient policy object, which is
also supposed to be used as agent controller. Continuous action controllers already contain an own noise
controller, thus SRV units are already implicitly implemented. The dependency of the random controllers
variance on the value function can be modeled by our adaptive parameter approach, here we can use the
CAdaptiveParameterFromValueCalculator class.
Fromthe policy object, the SRValgorithm(CActorFromContinuousActionGradientPolicy) can retrieve both,
the noise vector n(t) and the used value (see section 7.1.1). In this approach, the noise vector is always
recalculated by calculating the dierence between the executed control signal and the control signal without
noise. This has to be done because the noise vector is not be stored with the action object. Through this
approach, it is also theoretically possible to use another controller as the agent controller (e.g. imitation
learning), because the dierence of the policies output to the executed control signal is always used as the
noise signal n(t). With this information, the update of the actor is straightforward and given by equation
7.80
PGAC Algorithm
Policy gradient Actor-Critic learning is implemented by the class CVPolicyLearner. This class implements
the discussed policy update rules given a dierentiable value function as critic and a dierentiable policy
as actor. The updates are done in the agent listener interface, hence the algorithm consists of two agent
listeners, one for the critic updates and one for the policy updates. The critic updates have to be done before
the update of the policy, consequently the critic learner has to be added before the policy learner to the
listener list. We maintain a list of states (state collection objects) for < s
tk
, s
tk
, . . . s
t+l
>. The backwards
horizon k and the prediction horizon l can both be set with the parameter interface of the Toolbox.
The k latest past states are always stored in the list. At each new step, the predicted l future states are added
to the state list, these future states are deleted from the list again at the end of the update. Corresponding to
that list of states, a list of vectors containing the derivatives
dr(s
t
)
ds
is maintained. Since these derivatives are
time independent for a certain state, we do not have to calculate the reward derivatives for the whole state
list again, just the derivatives for the new, predicted future states. Additionally we can implement a function
for calculating
ds
t+1
d
from
ds
t
d
, which is done in a similar manner to the analytical PEGASUS algorithm (see
7.5.4). Now we can use this function, the state list < s
tk
, s
tk+1
, . . . s
t+l
> and the list of reward derivatives
<
dr(s
tk
)
ds
,
dr(s
tk+1
)
ds
, . . . ,
dr(s
t
+l)
ds
> to calculate the gradient given in 7.83. This gradient calculation is done at
each time step, so it is quite time consuming for bigger update intervals [t k, t + l].
Chapter 8
Experiments
In this chapter we will test the RL Toolbox for continuous control tasks on three benchmark problems: the
pendulum swing up task, the cart-pole swing up task and the acrobot swing up task. These three tasks are
standard benchmark problems for optimal control, with a relatively small state space (two resp. four state
variables) and only one continuous control variable. The benchmark tests were done quite exhaustively,
which meant we had to choose tasks with a rather small state and action space to reduce the computation
time required. The simulation time step was set to 3
1
3
milliseconds for our experiments, which was a good
tradeo between accuracy and computation speed. The time step used for learning was set to 0.05 seconds
if not stated otherwise. For all tests, the average height of the end point was taken as performance measure.
Learning was stopped every k episodes, and the average height was measured for l episodes, following the
xed policy of the learner. Then learning was continued again; this was repeated until a xed number of
episodes had been reached. Hence, the plotted learning curves show the average height measured every k
episodes. For one learning curve the whole process was repeated n times and averaged to get a more reliable
estimate of the learning curve. If we talk about the performance of a specic test-suite (a specic algorithm
with a xed parameter setting), the average height during learning is always meant. This is obtained by
averaging all the average reward measure points of all the learning trials using the same test-suite.
We will begin by dening the system dynamics of the tasks and discuss their properties in the context of
learning. Then we will come to the comparison of our algorithms. We will compare dierent value function
learning algorithms, in combination with dierent action selection policies. The inuence of the used time
step t on the performance of the dierent time steps is also evaluated. Additionally dierent type of
eligibility-traces have been used. This will all be done for grid-based constant GSBFN networks, FF-NNs
and also Gaussian Sigmoidal networks. We will also investigate the improvement in performance of these
methods if we use a prediction horizon greater than one with V-Planning or if we use directed exploration
strategies.
After this we will investigate Q-Function based algorithms which do not require knowledge of the model.
These algorithms are Q-Learning and Advantage Learning. Basically we ran the same tests for the used time
steps as for V-Learning. Additionally, we tried the use of the Dyna-Q approach and discuss the performance
improvement.
After this, we will come to the Actor-Critic methods. Firstly we will test out the standard Actor-Critic
approaches for a discrete action set, then we will come to the continuous action algorithms. For the SRV al-
gorithm, the performance was tested with dierent kinds of noise; the policy gradient Actor-Critic algorithm
was tested with dierent backward and forward prediction horizons. This was done for the constant GSBFN
networks and also for the FF-NNs. We also investigated an intermixing of the function approximators, for
example using a GSBFN as policy and an FF-NN to represent the value function. This test can illustrate
139
140 Chapter 8. Experiments
whether it is helpful to use FF-NNs for the value representation even if we use good representations for the
policy, which are easier to learn.
Then we will look at the policy gradient methods (GPOMDP [11] and PEGASUS [33]). Both were tested
for FF-NNs and the constant GSBFN network. At the end of each test there will be a discussion about the
results and how these results can be further improved. The last section will be a general conclusion about
the Toolbox and the algorithms.
In all experiments with a discretized action space, either a soft-max policy were used to incorporate ran-
dom exploration to the action selection. If a real valued policy was used, a noise controller was used for
incorporating exploration. A ltered gaussian noise was used as noise if not stated otherwise:
n(t) = n(t 1) + N(0,
t
)
t
was scaled by the value of the current state.
t
=
V
max
V(t)
V
max
V
min
When using FF-NNs the standard preprocessing steps were applied to the input state (scale all state variables,
use cos() and sin() as input for all angles). The learning rates for the output weights were scaled according
to the Vario- algorithm by the factor of
1
m
, m being the number of hidden neurons. All weights of the FF-
NNs were initialized with a standard deviation of
1
k
, k being the number of inputs of the neuron. This was
also done for the sigmoidal part of the GS-NNs.
8.1 The Benchmark Tasks
All benchmark tasks are mechanical models which are linear with respect to the control variable u. Hence,
all models are implemented by derivating the class CLinearActionContinuousTimeTransitionFunction and
specifying the matrix B(s) and the vector a(s) for the model s = B(s)u + a(s). For all benchmark problems,
own reward functions were implemented; the reward is always only dependent on the current state s
t
. The
derivative of the reward function
dr(s
t
)
ds
(needed for the analytical policy gradient calculation and the policy
gradient Actor-Critic algorithm) was also implemented, using the interface of the class CStateRewardFunc-
tion. For algorithms which need a discrete action set the action space was discretized into three dierent
actions for all benchmark problems, the minimum torque a
min
, the maximum torque a
m
ax and a zero torque
action a
0
.
8.1.1 The Pendulum Swing Up Task
In this task we have to swing up an inverted pendulum from the stable down position s
down
to the up-position
s
up
(see gure 8.1). We have two state variables, the angle and its derivative, the angular velocity

, and
which is limited to |
| < 10 in our implementation (higher absolute values are not relevant). We can apply
a limited torque |u| < u
max
at the xed joint, since the torque is not sucient to directly reach the goal state
s
up
, so the agent has to swing up the system and decelerate again if the goal state is reachable.
A few experiments with the pendulum swing up task can be found in the articles by Coulom [15] with
FF-NNs and Doya [17], who ran dierent experiments with continuous time RL and the SRV algorithm.
Generally we can say that this task is not trivial because of the swing up, but it is still relatively easy to
learn. The advantage of this task is that we can learn it very quickly, so we can do many experiments even
with many trials for averaging. Even though the results cannot be directly transferred to more complex high
dimensional tasks, they can indicate what works well and what works poorly.
8.1. The Benchmark Tasks 141
In the experiments one trial was simulated for 10 seconds, and a discretization time step of t = 0.05 was
used, resulting in 200 steps per episode. If we have an average reward per episode of 0.5 (empirically
evaluated) for randomly chosen start states the swing up has been successfully learned.
Figure 8.1: Pendulum, taken from Coulom [15]
Parameters of the system
Name Symbol Value
Maximal torque u
max
10
Gravity acceleration g 9.81
Mass of the pendulum m 1.0
Length of the pendulum l 1.0
Coecient of friction 1.0
Unlike to Coulom [15] and Doya [17], we used a higher friction coecient of 1.0. This value was
intuitively found to be more realistic. It also makes the swing up task a little more complicated to learn due
to the extenuated velocity of the system.
System Dynamics
The pendulum has the following dynamics:
=
1
ml
2
(
+ mgl sin + u) (8.1)

From this equation the matrix B(s) and the vector a(s) can be found easily.
Reward Function
The reward was simply given by a measure of the height of the pole. In order to incorporate exploration due
to optimistic value initialization, we chose to give the negative distance of the pole to the horizontal plane
located on the top position.
r(,

) = cos() 1 (8.2)
Used Function Approximators
For the constant GSBFN network, the RBF centers were uniformly distributed within a 15 20 grid over
the state space. For the sigma values the rule
i
=
1
p
i
2
is generally used, where p
i
is the number of centers
used for dimension i.
For the FF-NN we used 12 hidden neurons, which results in a network of 412+13 = 61) weights (remember
that we have three input states for the neural network because of the angle , plus one weight for the oset
per node).
As localization layer for the Gaussian-Sigmoidal NN 10 RBF-centers were distributed uniformly over each
state variable. Thus, the input state for the FF-NN of the sigmoidal part of the GS-NNs has 20 input
variables. We used 10 nodes in the hidden layer, which gave us an NN with 21 10 + 11 = 221 weights.
8.1.2 The Cart-Pole Swing Up Task
Again, the task is to swing a pole upwards, now hinged on a cart (as illustrated in 8.2). We have four state
variables: the position of the cart x, the pole angle with respect to the vertical axis and their derivatives
with respect to time ( x,

). We can apply a limited force u to the cart in the x-direction. This task is very
popular in the general optimal control and reinforcement learning literature, because it is already complex
enough to be quite challenging, but it is still manageable by many algorithms from optimal control (fuzzy
logic, energy based control, see [1] for a description of an optimal control approach using energy-based
constraints). In the area of RL, this task has been solved by a few researchers for example by Doya [17],
Coulom [15] and Miyamoto [28]. There is also another, simpler, task called The Cart-Pole Balancing task.
In this case, the pole just has to be balanced, beginning at an initial position near the goal state. This task
can be seen as subtask of the swing up task.
In our implementation, a fth state variable
was introduced. This represents the angle rotated up to now

(thus it contains the same information as but it is not periodic). This variable is only used to prevent the
pole from over-rotating. If the pole was rotated more than ve times (|
| > 10), or if the cart left the track,

the learning trial was aborted. In the experiments each episode lasted for 20 seconds, a time step of 0.05s
was used if not stated otherwise. High episode lengths were chosen in order to investigate whether the pole
could be balanced for long enough. As the measure of performance, once again the average height minus
1.0 was used, if an episode was aborted because of over rotating the pole or leaving the track, the minimal
height measure of 2.0 was used for the remaining time steps. An average performance measure better than
0.3 during 20 seconds can be considered as successful swing up and balancing behavior.
Name Symbol Value
Maximal force u
max
5
Mass of the pole m
p
0.5
Mass of the cart m
c
1.0
Length of the pole l 0.5
Coecient of friction of the cart on track
c
1.0
Coecient of friction of pivot
p
0.1
Half length of the track L 2.4
Again, higher friction coecients were used than in [15] or [17] for a increased realism.
Figure 8.2: The Cart-pole Task, taken from Coulom [15]
System Dynamics
The system can be described by a set of dierential equations to be solved (taken from the appendix of [15]):
_
lm
p
cos (m
c
+ m
p
)
4
3
l cos
_
.,,.
C
_
x
_
=
_
_
lm
p
2
sin +
c
sign( x)
g sin

p
lm
p
_
_
.,,.
d
+
_
u
0
_
(8.3)
By multiplying this equation with C
1
, we can again split this equation into a B(s) u and an a(s) part.
s = C
1
(s) d(s)
.,,.
a(s)
+C
1
_
1
0
_
.,,.
b(s)
u (8.4)
Reward Function
The negative distance to the horizontal plane from the top position was again used as the reward signal.
Since the reward function is very at in the region of the goal state, we added a reward term specifying the
distance to the upwards position ( = 0). This peak in the reward function resulted in a better performance
for all algorithms, so it was preferred to the at reward function. Additionally, we punished over-rotating
and leaving the track by incorporating the distance of the signicant state variables (x for leaving the track,
for over-rotating) in the reward function.

r(x, x, ,

) = (cos() 1) + exp(25
2
)
.,,.
target peak
100 exp ((|x| L) 25)
.,,.
punishment for leaving track
20 exp(|
| 10 )
.,,.
punishment for over-rotating
(8.5)
By using the exponential function, the original reward function is only disturbed by the punishment terms
in the relevant areas.
For the constant GSBFN network we used a 771515 grid resulting in 11025 weights. For the adaptive
GSBFN, a grid of 5 5 7 7 was used for the initial center distribution. The FF-NN has 20 hidden
neurons, thus 620 + 21 = 141) weights. For the GS-NNs we used the same number of partitions as for the
constant GSBFN (but partitioning each state variable separately), so we have 44 input states to the FF-NN.
The FF-NN itself used 20 hidden neurons, resulting in 945 weights.
8.1.3 The Acrobot Swing Up Task
The acrobot has two links, one attached at the end of the other (see gure 8.3). There is one motor at the
middle of the two links which can apply a limited torque (|u| < u
max
). The task is to swing up both links to
the top position. Again we have one control variable and four state variables (
1
,
2
and its derivatives

1
,
2
).
Figure 8.3: The Acrobot Task, taken from Yoshimoto [57]
If the task is simply to balance the acrobot from an initial position in the neighborhood of the goal state we
talk about the acrobot balancing task. The acrobot task is also a standard benchmark problem in the eld
of optimal control [14], fuzzy controllers, energy based control. Also genetic algorithms have been used to
solve this task too. By using a pure planning approach (similar to predictive control) Boone [13] was able
to get probably the best results of anyone in this eld.
The task is also very popular for testing RL algorithms and has been solved for dierent physical parameters
with dierent levels of diculty for the learning task. Sutton gave an example of the acrobot learning task
using Tile Coding architecture and Q-Learning in [49], but in this case the task was just to swing the rst
leg up to 90 degrees. Coulom [15] used FF-NNs to learn the task with continuous time RL; the acrobot
managed to swing up and reach the goal position at a very low angular velocity, but it could not keep
balance. Coulom used a maximum torque of u
max
= 2.0Nm, which is at to knowledge the most dicult
conguration of the acrobot task that has been used until now. Solving the task with FF-NNs is also the only
approach to learning the task which has a at, non-hierarchic architecture. There was no further information
given about the learning time, but as was shown for the other experiments Coulom did, the learning time
was high. Nishimura et al. [34] used RL to learn a switching between several predened local controllers.
The task was successfully learned, but a maximum torque of 20Nm was used. Thus, even though other
physical parameters were used, the task is likely to be simpler than the conguration that Coloum used. The
balancing task was also investigated intensively, since when of more distant initial states are chosen, the task
is already quite complex. Here, we should mention the work of Yoshimoto [57] using an NG-net and an
Actor-Critic algorithm.
Even though we still have just four continuous state variables, depending on the physical parameters, the
standard acrobot task is already very dicult to learn, as can be seen from the literature. Only a few
experiments could be made with this task due to the lack of time available for optimizing the parameters.
Usually, many steps are needed to reach the goal, moreover the dierence caused by executing dierent
actions can have very small immediate eects. Therefore, an accurate value function is needed, which makes
the use with RBF-networks almost impossible. Another severe problem of this task is a very tempting local
maxima in the value function, which is to balance the second link upwards, swinging the rst link slightly.
This solution is easy to nd, and has a considerably better value than trying to learn how to swing up the
acrobot. Almost all approaches tried in our experiments only found this solution, and did not recover from
this local maxima.
None of the standard approaches which worked well for the previous tasks lead to any success and only
found the sub-optimal solution as local minima. Experiments have been done for dierent time scales and
maximum torques. The task was only solved for u
max
> 10, which simplied the task drastically. These
experiments are not shown in this thesis.
Name Symbol Value
Maximal torque u
max
2.0
Mass of rst link m
1
1.0
Mass of second link m
2
1.0
Length of rst link l
1
0.5
Length of second link l
2
0.5
Coecient of friction for the rst joint
1
0.05
Coecient of friction for the second joint
2
0.05
System Dynamics
The system can again be described by a set of dierential equations to be solved (taken from the appendix
of [15]):
C(s)
_
2
_
= d(s) +
_
u
u
_
(8.6)
With
C =
_
(
4
3
m
1
+ 4m
2
)l
2
1
2m
2
l
1
l
2
cos(
2
)
2m
2
l
1
l
2
cos(
2
)
4
3
m
2
l
2
2
_
(8.7)
d =
_
_
2m
2
l
1
l
2

2
2
sin(
2
) + (m
1
+ 2m
2
)l
1
g sin
1

1

1
2m
2
l
1
l
2

1
2
sin(
2
) + m
2
l
2
g sin
2

2

2
_
_
(8.8)
Again we can multiply this equation with C
1
split into a B(s) u and an a(s) part.
s = C
1
(s) d(s)
.,,.
a(s)
+C
1
_
1
1
_
.,,.
B(s)
u (8.9)
Reward Function
Again, as reward signal the distance of the pole to the horizontal plane at the top position was used. Addi-
tionally, a peak is added to the reward function at the goal state.
r(
1
,

1
,
2
,

2
) = l
1
cos(
1
) + l
2
cos(
1
+
2
) + 0.5 exp((
2
1

2
2
) 25)
.,,.
target peak
l
1
l
2
(8.10)
Several grid-based and also more sophisticated RBF positioning schemes, have been used for the acrobot
task with very limited success. For the FF-NN a 30 neuron (241 weight) network was used. No tests were
done for the GS-NN due to the lack of time and the already bad results for the cart-pole task.
8.1.4 Approaches from Optimal Control
An energy based control scheme for the cart pole can be found in [1]. Another interesting approach is taken
by Olfati-Saber [35], who uses a xed point controller. A good, but unfortunately old overview of existing
approaches from optimal control can be found in [14], where several control strategies are discussed for the
acrobot. Usually two dierent controllers are used in all approaches, one for balancing and one for swinging
up the acrobot. For the balancing task, a linear quadratic controller (LQR) or fuzzy controller is used. Both
approaches require ne tuning of the parameters, which is either done by hand or by a genetic search
algorithm. The swing up task is controlled by a PD controller, working on the linearized system, using
feedback linearization. With feedback linearization, a controller can be designed that pumps on average
energy into joint one during each swing, resulting in a swing up. Again, parameter tuning is needed for this
approach.
These approaches worked ne for an acrobot with a limited torque of |u| < 4.5Nm, and with physical
parameters other than those used in this thesis. Hence, optimal control already has good working solutions
for these problems. The advantage of RL is obviously that no parameter tuning is needed (at least not quite
as many parameters as would be needed in optimal control) and that it is a general framework.
8.2 V-Function Learning Experiments
In this section, we will investigate the discrete time V-Learning and continuous time V-Learning algorithms.
We will look at the performance of the dierent gradient calculation schemes for the three dierent function
approximators. These are constant grid-based GSBFNs, FF-NNs and GS-NNs. The inuence of the eligi-
bility traces with dierent settings is investigated, and we will also test the algorithms for dierent time
scales t.
Finally, we will try to improve the performance of the V-Learning algorithms by using a higher predic-
tion horizon for V-Planning, incorporating directed exploration information in to the policy and by using
hierarchic learning architectures.
8.2. V-Function Learning Experiments 147
For all experiments = 0.95, s
= 1.0, = 0.9 and = 20 (for the soft-max distribution) were used unless
stated otherwise. The time discretization used was t = 0.05s. As default replacing e-traces were used.
8.2.1 Learning the Value Function
Continuous RL and standard, discrete time RL formulation are compared in this section. We want to estimate
which algorithm works best for learning the value function, hence we will use all the tested algorithms with
the same policy, which is a one step lookahead V-Planning policy using a soft-max distribution for action
selection. For the continuous time algorithms, we had to scale the V-Function by
1
t
for the V-Planning part
because the continuous time value function is
1
t
times smaller than the discrete value function.
Constant Grid-Based RBF network
This FA has the best performance, and it is simple and easy to learn. V-Learning methods managed to learn
the Pendulum task after ve to ten episodes. In gure 8.4(a), we can see three learning curves, for the
discrete time residual, the Euler residual and the residual used by Coulom. There is no signicant dierence
in the performance of these three algorithms. This is also due to the setting of the specic parameter s
to 1.0 and t = 0.05, resulting in an equivalent discrete time discount factor

d
= 1 s
t = 0.95 for
the continuous time algorithm. Figure 8.4(b) shows the comparison between the direct gradient, residual
gradient and the residual algorithm (with variable and constant ). Only the performance of the residual
gradient algorithm ( = 1.0) falls o, as expected from the theory. All other gradient calculation algorithms
do not dier signicantly for the RBF network. Each learning trial lasted for 50 episodes, and the plots are
averaged over 10 trials.
(a) Pendulum (b) Pendulum
Figure 8.4: (a)Learning curve for the discrete time and continuous time algorithms for the RBF network.
= 2.0 was used for all three algorithms. (b) Average reward during the learning for dierent gradient
calculation algorithms, plotted over varying . The discrete time V-Learning algorithms were used for this
illustration.
In gure 8.5(a) and (b), the same results are shown for the cart-pole task. The RBF network manages to
learn the task in approximately 400 to 500 episodes. The results and conclusions are almost the same as
for the pendulum swing up task, but this task is already much more complex since learning one trial takes
about 570s in real time (for 2000 episodes), in comparison with 30s for the pendulum task (50 episodes).
The performance of the direct gradient algorithm already stands out slightly, as the direct gradient algorithm
seems to be best suited for the use of linear feature states.
These experiments also illustrate the insensitivity of the RBF network with respect to the choice of the
learning rate. The algorithm manages to learn the pendulum and cart-pole task for a large range of valuesf.
Figure 8.5: Average height during learning the cart-pole task for dierent gradient calculation algorithms,
plotted over varying . The discrete time V-Learning algorithms were used for this illustration, and the
results were averaged over 5 trials, each trial lasting for 2000 episodes.
For the acrobot task, dierent resolutions (15151515 and 20202020) were used for the grid based
RBF positioning scheme. More sophisticated positioning schemes were also used, with a ner resolution
for small velocities and around the neighborhood of
1
= and
2
= 0. It was only a limited success. Only
a suboptimal solution was found for the acrobot task; if the resolution around the downwards position is not
optimized accurately enough, even this solution could not be found when starting in this position. In this
case, the agent did not manage to leave the downwards position due to the small dierence in the state space
when executing dierent actions.
FF-NNs are already dicult to learn for the pendulum task. On average, more than 400 episodes are needed
for a successful swing up (with an optimized parameter setting) and nding for a good parameter setting (,
, for the residual algorithm) is more dicult because the algorithms work only in a very small parameter
regime. We used a trial length of 3000 episodes for the pendulum task in order to investigate the long-term
convergence behavior of the algorithms, the results are averaged over 10 trials. Figure 8.6 illustrates the
performances of the discrete time and the continuous time algorithm using the Euler residual for dierent
gradient calculation schemes. Surprisingly, the performance of the continuous time algorithmis signicantly
worse than that of its discrete time counter-part. Even using the best empirically ascertained parameter
setting ( = 0.3, = 16, see gure 8.6 (b)) the algorithm managed to learn the swing up in only four
out of 10 trials. How can this behavior be explained? For the continuous time residual, the inuence of
the value function to the received reward is much higher than in the discrete time algorithm (in our case
1
t
= 20 times higher). In the case of the linear approximator, this does not make any dierence, because
the V-Function is linear in the weights and thus just
1
t
times smaller. But for FF-NNs, where we have a
randominitial weight vector, the inuence of this initial weight vector is much higher for the continuous time
algorithm. With the same arguments the dierent learning rates intervals can also be explained. We have to
use higher learning rates in order to increase the inuence of the received reward for the weight update. We
also ran one experiment to check these ideas, using the continuous time algorithm with a reward function
which is 20 times higher than the original one (so we have the same weighting for the reward function and
the value function). Now the algorithm should do exactly the same as the discrete time algorithm, and in
fact, the results were almost the same as in gure 8.6 (a) for the discrete time algorithm. Consequently,
the performance of FF-NNs additionally depends on the relationship between the weighting of the value
function and the reward function in the residual calculation. A question which still has to be examined is
wether we can optimize this relationship for a given learning task and how this optimization can improve
the performance of FF-NNs.
Figure 8.6: Performance with dierent gradient calculation schemes for the (a) discrete time and the (b)
continuous time algorithm (Euler residual) for learning the value function with an FF-NN.
For the discrete time algorithm, plots of the dierent gradient calculation schemes are more expressive.
Here, we can see that using the residual algorithm with a constant factor clearly outperforms the direct
gradient algorithm. The best performance of the residual algorithm is signicantly better (average height
of 0.55 instead of 0.8) and also the range of good working learning rates is higher. The residual gradient
algorithm ( = 1.0) has a worse performance than the residual algorithm, but still outperforms the direct
gradient algorithm. Surprisingly, the residual algorithm with the adaptive calculation also falls o in
respect to the constant conguration. Apparently, the approximations of the real epoch-wise gradient with
the traces w
D
and w
RG
were not good enough. This experiment was done for the averaging parameters
= [1.0, 0.95, 0.9, 0.7, 0.0], and the plot used the most ecient parameter setting = 0.9. In gure 8.7 we
see typical learning curves for the direct gradient and the residual gradient algorithm with a constant value
of 0.3. The direct gradient algorithm also learns quickly, but then it unlearns the swing up in almost every
trial again very quickly (if it is learned at all). This is a consequence of the poor convergence behavior of
the direct gradient algorithm and coincides with the theory. The residual algorithm with constant beta also
unlearns the swing up behavior again, but not so often and quickly.
Cart-pole task learning with an FF-NN is already a very hard task. Over 50000 episodes are needed to
learn the task, and the algorithm only works within a very small range of parameter settings. One learning
Figure 8.7: Learning curves of ve dierent trials of the (a) direct gradient and the (b) residual algorithm
(with constant = 0.3) with an FF-NN as function approximator. The thick red line represents the average
of the ve trials.
trial with 100000 episodes lasted for 24000 seconds, which made a exhaustive search for good parameter
settings almost impossible. Nevertheless, we tested the use of FF-NNs with the discrete and continuous time
algorithms with dierent gradient calculation schemes. Plots of the results can be seen in 8.8. The results
are averaged over 3 trials, consequently we have to consider that the results are quite noisy. In general,
residual algorithm tends to works best again, with a high setting, but surprisingly not with the discrete
time algorithm. For this task only the continuous time algorithm clearly outperformed the discrete time
algorithm. Even if the discrete time algorithm also manages to learn the task, its learning performance is
more unstable and it only works for an even smaller parameter range. Again, if a smaller value for the
learning rate is used for the discrete time algorithm, with the same parameter setting as for the continuous
time algorithm, learning is not possible at all. Seemingly, the relationship between the inuence of the value
function and the reward function of the continuous time algorithm is the correct (or at least better) one for
the cart-pole task. These results suggest that this relationship chosen should depend on the complexity of
the learning task, i.e. an algorithm with an adaptableship relation between the value function and the reward
function could be preferable. This presumption obviously needs more investigation.
GS-NNs are the intermixing of the localizing RBF networks and sigmoidal FF-NNs. They are supposed to
be easier to learn as the FF-NN, and require fewer weights than the RBF networks. For the pendulum task,
the dierence in the number of weights is, unfortunately not given. Even worse, the GS-NN requires more
weights for this small dimensional task (200 instead of 150 for the RBF network). But for the Cart-Pole task
we need only 945 weights instead of the 11025 weights we needed with the RBF network. Unfortunately,
computing the GS-NN is quite slow, so the tests could not be done exhaustively. For the pendulum task, the
results (gure 8.9(a) and (b) for the discrete time algorithm) show a comparable performance to FF-NNs.
Again, the residual algorithm with constant signicantly outperforms the direct and residual gradient
algorithm, and also the variable calculation falls o in its performance. The continuous time algorithm
again suers from the same diculties as the FF-NN; these plots are not shown here. The disadvantage in
(a) Cart-Pole (b) Cart-Pole
Figure 8.8: Performance of the (a) continuous time and (b) discrete time RL algorithms with an FF-NN as
function approximator.
comparison to FF-NNs is the increased computational complexity, which makes learning very slow. For the
cart-pole task, a successful parameter regime has not yet be found. One reason for this is the long learning
time needed for the cart-pole task. Another reason is that the GS-NN approach does not scale up to more
complex tasks easily, at least it becomes even more sensitive to accurate parameter settings. We even tried
to learn the value function if the agent followed an already learned policy, but learning was done for 30000
episodes (70000 seconds learning time) without any success.
Figure 8.9: Performance of the GS-NN network for dierent gradient calculation algorithms for the pendu-
lum task.
8.2.2 Action selection
Basically we have three dierent kinds of policies: stochastic policies using a one step forward-prediction
(discrete time V-Planning), stochastic policies using continuous time system dynamics (continuous time V-
Planning) and the real valued value-gradient based sigmoidal policies (see 7.3.4). For the stochastic policies
a soft-max action distribution is used. In 8.10(a) we see the performance of the three policies with roughly
optimized parameters ( = 20 for the stochastic policies and C = 100 for the value-gradient based policy).
The performance was only tested for the constant RBF network. As expected the discrete time planning
(a) Pendulum (b) Cart-Pole
Figure 8.10: (a) Average reward during the learning for dierent policies, plotted over varying . The Euler
residual was used for this illustration. (a) Pendulum Task (b) Cart-Pole Task
algorithm slightly outperforms the other two approaches for the pendulum task, because it uses the most
accurate estimates of the value of the next state. The two continuous time approaches perform equally
well for the pendulum task. For the cart-pole task the dierence is more drastic. Both continuous time
approaches manage to learn the task, but need signicantly more learning steps to do it. For the cart-pole
task the dierence is even greater. While all three dierent policy learning schemes manage to learn the task,
the discrete time V-Planning approach is signicantly more ecient due to the more accurate estimates of
the values of the next states. An interesting question is how these algorithms would behave if we were to use
a learned, inaccurate model of the system dynamics. In this case, the dierence is likely to be not that great,
but this has not been investigated. We can also see from the results in 8.10 (b) that the value gradient-based
policy outperforms the continuous time V-Planning policy.
8.2.3 Comparison of Dierent Time Scales
We tested the discrete time algorithm with discrete time V-Planning, the continuous time algorithm with
discrete time V-Planning and also with the value-gradient based policy for dierent time scales. According
to theory, continuous time RL should work best for small time steps. For larger time steps the approximations
of the Hamiltonian and in particular of the value gradient based policy get inaccurate, hence the performance
gets worse. Since continuous time RL uses an equivalent discrete discount factor of
d
= 1 s
t, we
also tested the discrete time algorithm with this adaption law of the discount factor, just to be sure that better
performance is not just due to cheating with a preferable discount factor setting.
The pendulum task results are shown in gure 8.11(a). The discrete time algorithm with the constant
setting of 0.95 does not manage the swing up for small time steps, but by adapting the value, this algorithm
has almost the same performance as the continuous time algorithm. Therfore the only advantage of using
continuous time RL for the value function is a better, time scale dependent choice of the discount factor.
The value gradient based policy has the best performance for small time steps, but, as expected, could not
learn the task for larger time steps.
Figure 8.11: Performance plots for dierent t. discGamma refers to the discrete time algorithm with set
to 1s
t (usually a value of 0.95 is used). contEuler represents the learning curve of the value gradient
based policy. (a) Pendulum Task, one trial had 100 episodes, one episode lasted for 10s (b) CartPole Task,
2000 episodes, 20s per episode
8.2.4 The inuence of the Eligibility Traces
The parameter has dierent eects for dierent function approximators. The succeeding plots (gure
8.12 and 8.13)show the average height during the whole learning trial plotted with varying learning rates
for = [0.0, 0.5, 0.7, 0.9, 1.0]. There is always one plot for replacing e-traces, and one for accumulating
e-traces. In the case of linear approximators, where the gradient
w
V(s) does not depend on w, a high
value results in a better learning performance. For global non-linear function approximators like FF-NNs,
the results show that using high values is rather dangerous, because we rely on the assumption that
w
V(s)
does not change during an episode. Due to the high number of dierent test cases, this experiment was only
done for the pendulum task.
Constant Grid-Based GSBFNs
Figures 8.12 and 8.13 show the performance of the direct gradient and residual algorithms with dierent
settings. For the linear approximator, the results are nearly the same for the dierent gradient calculation
schemes as expected, but the results already show that using e-traces for the residual and residual gradient
algorithm can be advantageous. There is also hardly any dierence between replacing and accumulating
e-traces when using a linear function approximator, non replacing e-traces need a lower learning rate for
obvious reasons.
Figure 8.12: Performance plots for dierent settings for the Pendulum Task with RBF networks, using the
direct gradient. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over 10
trials with 50 episodes
Figure 8.13: Performance plots for dierent settings for the Pendulum Task with RBF networks, using
the residual algorithm with the variable calculation scheme. (a) Replacing e-traces (b) Non-Replacing
e-traces. Average reward is determined over 10 trials with 50 episodes
In gures 8.14, 8.15 and 8.16 we can see the performance plots of the direct gradient, the residual algorithm
with constant = 0.3 and also with the residual algorithm with variable beta. The performance of the
direct gradient algorithm could be improved slightly, since accumulating e-traces with a low value seems
to have a better performance than the used standard parameters ( = 0.9, replacing e-traces). Also, the
performance of the residual algorithm with variable beta calculation could be improved in the same way,
using low values or no e-traces at all. Interestingly this is not the case for an empirically optimized
constant value, as in this case high using values with replacing e-traces can signicantly improve the
performance. Surprisingly, accumulating e-traces have a signicantly worse performance for the residual
algorithm with a constant setting. Also, this result suggests that eligibility traces are useful when used with
the residual algorithm, in particular, that our implementation of replacing e-traces for weights of a function
approximator is justied.
Figure 8.14: Performance plots for dierent settings for the Pendulum Task with FF-NNs, using the direct
gradient. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over 10 trials
with 3000 episodes
GS-NNs behave a bit dierently to FF-NNs when changing the parameter. As a consequence of the lo-
calization layer, high values have a good performance for all gradient algorithms, even if they narrow
the area of good working parameter settings for (particularly the accumulating e-traces). The results are
shown in gure 8.17 for the direct gradient, and in gure 8.18 for the residual algorithm with a constant
setting of 0.6. Our replacing e-traces approach clearly outperforms the standard accumulating e-traces ap-
proach, which, in combination with the results from the FF-NN suggests that replacing e-traces are generally
preferable for TD() learning with function approximation
8.2.5 Directed Exploration
In our experiments with dierent exploration strategies, we used a counter based local and distal exploration
measure. For the counter and the exploration value function, again, we used a function approximator, in this
Figure 8.15: Performance plots for dierent settings for the Pendulum Task with FF-NNs, using the
residual algorithm with the const = 0.6. (a) Replacing e-traces (b) Non-Replacing e-traces. Average
reward is determined over 10 trials with 3000 episodes
Figure 8.16: Performance plots for dierent settings for the Pendulum Task with FF-NNs, using the
residual algorithm with the variable calculation scheme. (a) Replacing e-traces (b) Non-Replacing e-
traces. Average reward is determined over 10 trials with 3000 episodes
case the standard constant RBF network was used. Both approaches are model based, using the generative
model for the prediction of the exploration measure of the next state. For distal exploration, a standard TD
V-Learner was used with a learning rate of 0.5. For all other parameters, the standard values were used. We
tested the benets of directed exploration for the three function approximators (RBF, FF-NN and GS-NN),
the results are illustrated in 8.20 for the pendulum task. The plot shows the performances of the algorithms
with an ascending exploration factor . Due to higher exploration measures when using distal exploration
(the measure is the expected future local exploration measure), we used smaller values than for the local
Figure 8.17: Performance plots for dierent settings for the Pendulum Task with GS-NNs, using the direct
gradient algorithm. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over
10 trials with 3000 episodes
Figure 8.18: Performance plots for dierent settings for the Pendulum Task with GS-NNs, using the
residual algorithm with = 0.6. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is
determined over 10 trials with 3000 episodes
exploration. But the results of the local and distal exploration measure do not dier signicantly anyway.
For learning the real value function, the discrete time algorithm was used with V-Planning as the policy. The
best learning rates from the previous experiments were always used.
The RBF network already uses a sort of directed exploration in the optimistic value initialization, using
directed exploration could not improve the performance. But for FF-NNs and GS-NNs (note that unlike the
previous experiments, only 1000 episodes were used for learning), the performance is signicantly improved
by directed exploration. This experiment was also done for the cart-pole task with the FF-NN and the RBF-
Figure 8.19: Plots with varying exploration factor , showing the performance of the dierent FAs. (a)
Local Exploration (b)Distal Exploration. For the RBF network one learning trial has 50 episodes, for the
non-linear FAs one trial has 1000 episodes. The results are averaged over 10 trials.
network. The results show again, that using a directed exploration scheme with the RBF-network does not
result in any benets, but the performance of FF-NN was again considerably improved. In this experiment,
we can also see the benet of distal directed exploration over local directed exploration. Learning with the
FF-NN was still very unstable, but good policies were found after 20000 episodes, which is a considerable
improvement in performance.
(a) CartPole (b) CartPole
Figure 8.20: Plots with varying exploration factor , showing the performance of the dierent FAs. (a)
Local Exploration (b)Distal Exploration. For the RBF network one learning trial has 2000 episodes, for the
FF-NN FAs one trial has 50000 episodes. The results are averaged over 10 respectively 5 trials.
These experiments show, on the one hand, that the poor performance of the non-linear approximation meth-
ods partially come from their poor exploration ability. The drawback of this experiment is that we used a
linear approximator for the counter, so we suer again from the curse of dimensionality, which should be
avoided by the use of a non-linear FA. But intuitively, the function approximator of the counter does not
need to be as exact as for the value function, so fewer features can be used. Another possibility is a memory
based counter representation, storing n states from the past and counting the experienced states in the region
of the current state. But these issues were not investigated any further. Using directed exploration has a
low complexity and hardly aects the computation speed of the simulation, so its use should be considered
especially for non-linear FAs.
8.2.6 N-step V-Planning
Having knowledge of the model, planning methods can be used with a higher a prediction horizon (in our
experiments we uses two, three or ve time steps). The experiments were again done for our three common
FAs. The results for the pendulum task are plotted in gure 8.21(a). We can see a signicant improvement in
performance, especially for the FF-NN and the GS-NN. With the RBF-network, no signicant improvement
was observed due to the simplicity of the learning task. We also tested the V-planning approach for the
cart-pole task, but only with RBF-networks. The result is illustrated in 8.21(b): in this case performance
increased dramatically, resulting in the best performance seen for the cart-pole algorithm.
The disadvantage of this approach is clearly the computational costs, which are exponential in the prediction
horizon. For example of the pendulum task, it took 30s for one evaluation with search depth one, 90s with
search depth two, 300s with three prediction steps and over 2400s with ve prediction steps. For the cart-
pole task, one trial lasted 570 seconds if a search depth of 1 was used, and over 40000 seconds with a search
depth of 5. Thus, large prediction horizons cannot be used at least for real time control, but the speed can
certainly be further improved by a faster implementation, or by adding some other kind of heuristics to prune
the search tree, but we will not get rid of the exponential time dependency.
Figure 8.21: Average Reward for dierent search depths using V-Planning. (a) Pendulum Task (b) Cart-Pole
Task
8.2.7 Hierarchical Learning with Subgoals
In this experiment we investigate the use of subgoals (similiar to the approach used by Morimoto [29], see
section 5.2.4) for continuous control tasks. Hierarchical learning was applied for the cart-pole task, since
the pendulum task was considered to be too simple for adding a hierarchy. In our subgoal model, for each
subgoal a target region and a failed region can be dened. Additionally, we dened a sequential order of
the subgoals, so if subgoal g
1
has reached its target area, subgoal g
2
will be activated. If subgoal g
2
fails
(reaches the failed region), subgoal g
1
is activated again. Each subgoal has an own reward function, and for
every subgoal g
i
an own value function V
i
is learned. The reward function of a subgoal is an exponential
function of the distance to the target area and the failed area.
r
t
= R
C
+ r
1
exp(
dist
t
1
) r
2
exp(
dist
f
2
) (8.11)
where dist
t
is the minimum distance to the target area and dist
f
to the failed area. R
C
is a constant, negative
reward oset, r
1
and r
2
specify the inuence of the target and failed area in the reward function,
1
and
2
specify the attenuation of the reward with the distance to the specied area.
In addition to the subgoal V-Function a global V-Function V
g
is learned with the standard reward function.
The global value function is needed because we dene the target areas of the subgoals very coarsly, so this
V-Function is needed to determine good intersection points of the two subgoals for achieving the global goal
of swinging up and balancing the system. For the policy, we used the standard 1-step V-Planning algorithm,
the value is calculated as the weighted sum of the current subgoals value function and the global value
function
V = V
g
+ (1 )V
i
(8.12)
Our experiments will test the hierarchical structure with dierent values ( = 0 corresponds to pure
hierarchic learning, = 1 to the standard at learning approach).
For the Cart-Pole task we divided the whole task into three dierent subgoals. The rst subgoal is to swing
the pole up to an angle of

2
, the velocities and x position were not further specied. The second subgoal
needs to swing the pole up to an angle of

9
. We also already restricted the velocities in the x-direction and
of the angle to the interval [2, 2]. The second subgoal has failed if the pole is in a downwards position
( > 0.5 or < 0.5) with a low velocity (if the absolute angular velocity of the pole is lower than 0.1).
The third subgoal is the balancing task and has no target area. The subgoal has failed if the absolute angle
of the pole is higher than

6
. For the reward functions, we used the following values: r
1
= r
2
= 40 and
1
=
2
=
1
300
for every subtask. All three subgoal value functions and the global value function use the
standard RBF network as function representation, except for the third subgoal task. In this case, another grid
for and

was used. was divided into ve partitions within the interval [
9
,

9
and

into 10 partitions
within the interval [5, 5]. This was considered to be the relevant area for the balancing task.
In gure 8.22 (a), the results are illustrated for dierent values. The hierarchical architecture signicantly
outperforms the at approach ( = 1.0) for a broad range of values. The pure hierarchical approach
without the global value function or using small values also has a bad performance due to the lack of
good interconnection points between the subgoals. The learning curves of the at, the pure hierarchic and
the intermixed approach with = 0.6 are illustrated in gure 8.22 (b). With, setting of = 0.6, the
agent successfully learned the swing up within 150 episodes, which is more than twice as good as the at
approach. The performance of the learned policy is also remarkably good, comparable with policies after
2000 learn-episodes for the at approach.
This approach was also tried for the acrobot task, using individually optimized RBF-networks for each
subtask, four subtasks were used in this case. The agent managed to escape the locally optimal solution
Figure 8.22: (a) Average Reward for dierent factors for hierarchic learning. (b) Learning Curves of the
at architecture ( = 0.0), the pure hierarchic architecture and the intermixing with = 0.6
mentioned earlier, and was also able to reach the upwards position, but with a non-zero velocity so it could
not hold the balance. Further optimization of the subgoals and the RBF-networks would have been needed,
which is already a lot of work for the acrobot task. This is the biggest disadvantage of the hierarchical
approach.
8.2.8 Conclusion
Learning the value function is quite stable, it works with all the used function approximator schemes, at
least for the pendulum task. RBF-networks have a reasonable performance, but they cannot be used for high
dimensional control tasks. Even for 4-dimensional control tasks, and if a highly accurate value function is
needed as for the acrobot task, it was not possible to learn the swing up. Many dierent accuracies (number
of centers) and positioning strategies for the centers have been tried without success. The advantages of
RBF-networks, or in general of linear approximators, are that they are much easier and faster to learn. They
also work for a broad range of parameter regimes and also for all the algorithms we used, which is crucial if
a long learning time is needed, and consequently, an appropriate search for good parameter settings is very
hard. Another advantage is the ability to use use optimistic value initialization, which spares us having to
usea more sophisticated exploration strategy.
Non-Linear function representations have a very poor performance in comparison to linear function repre-
sentations, so they are not useful for real applications. Our experiments show that, even for low dimensional
simulated tasks, as we use them in our thesis the use of non-linear FAs is fraught with diculty. FF-NNs
are able to scale up to high dimensional problems (as shown by Coulom [15]), and are also more accurate
than RBF-networks. But due to their long learning time and their sensitivity to the parameter settings of the
learning algorithm, they become impractible to use. After a long parameter search, we managed to learn the
cart-pole task with a neural network, but thanks to the learning time needed we think that using FF-NNs is
not eciently applicable. Another problem is the instability of learning. The algorithm may unlearn a good
policy, and performance is also highly dependent on the initial weight vector of the FF-NN. The residual
algorithm, with an appropriate constant setting, can partially solve this problem and leads to considerably
better results, but with the drawback of an additional parameter that must be optimized. E-traces are also
not that easily applied as for linear approximators, which is another reason for the loss of performance. An
approach for an automatic learning rate detection is, in our opinion a very promising to making non-linear
function approximators more useable for RL.
GS-NNs, as intermixing of the two approaches, also did not quite meet our expectations. Although it was
possible to learn the pendulum task in less episodes than with the FF-NN, the real learning time was in-
creased due to the additional complexity of the GS-NN. Moreover, the performance impact was not such
that we could say GS-NN behave better than FF-NN. The non-linear weight updates still seems to be prob-
lematic with GS-NNs. We did not manage to nd a good parameter setting for the cart-pole task, which is
a consequence of the very long learning time of a GS-NN. We even tried to learn the value function alone,
even given an policy that had already been optimized. Learning was done in this conguration for over
30000 episodes without any success. In retrospect, we think that memory based representations like locally
weighted regression, or adaptive GSBFNs (normalized RBF networks), are the most promising approaches
for RL in continuous state spaces. Although adaptive GSBFNs are built in the Toolbox, there was not enough
time for testing it thoroughly.
The use of a directed exploration scheme, a higher planning horizon or using an actor to stabilize learning are
some approaches which address the diculties with FF-NNs and GS-NNs. Unfortunately, there was only
time to experiment with these approaches with the pendulum task. Although the results are quite encourag-
ing, further experiments are needed with more complex tasks to verify the benets of these approaches. A
principal benet of incorporating the knowledge of the system dynamics, is the ability to plan, which dras-
tically improved the performance of value function learning. On the other hand, the system dynamics can
also be used for the continuous time value-gradient based policy, which provides a real valued policy. But
the experiments show that performance already falls away for the cart-pole task due to a less accurate value
prediction, so planning approaches seem to be more eective, albait at the expense of discretized actions
and a higher computation time.
The comparison between the continuous time algorithm and the discrete time algorithm is a little ambivalent.
While there is no signicant preference shown in the experiments for the RBF network, the results dier
quite notably for the FF-NN and GS-NN function representation schemes. The discrete time algorithm
works well for the pendulum task, but did not manage to learn the cart-pole task. The opposite could be
observed for the continuous time algorithm. These results suggest that the optimal solution is an adjustable
relationship between the weighting of the value function and the reward function in the residual calculation.
This optimal relationship is likely to dier for dierent tasks and function approximators.
Hierarchical RL helps us to improve the speed of learning. It is also sometimes also the only way to
prevent the algorithm from getting stuck in a local minima. Our hierarchical approach was very simple,
with predened, succesive subgoals, but this approach is already very dicult to apply for the acrobot task.
More complex approaches with an automatic detection, or at least adaption, of the hierarchic structure are
very promising in this context, which is undoubtedly necessary for a more sophisticated learning system for
more complex tasks. But a lot of further research has to be done in this area.
8.3 Q-Function Learning Experiments
In this section, we will compare the performance of Q-Function learning algorithms, which are Q-Learning
and Advantage Learning, to each other and also to V-Learning algorithms in order to determine the benets
of using the system dynamics as prior knowledge. The basic disadvantage of Q-Learning is the need for a
discretization of the action set, but this causes no problems for the low dimensional control spaces of the
benchmark tasks. In our experiments for Q-Learning, only the RBF network and two dierent FF-NNs
architectures were used.
8.3. Q-Function Learning Experiments 163
8.3.1 Learning the Q-Function
In these experiments, we compared Q-Learning to Advantage Learning. Again dierent gradient algorithms
were used and plots are shown for dierent learning rates (gure 8.23). For action selection, a soft-max
distribution was used, with a = 20 setting.
Constant Grid-Based GSBFNs
For the pendulum task one trial lasted for 200 episodes. All plots are averaged over 10 trials. The best
conguration of the Q-Learning algorithm managed to learn the task in approximately 40 episodes. The re-
sults show that Q-Learning slightly outperforms Advantage Learning for the time scale factor K = 1, which
might be a result of better optimized parameters for Q-Learning, because the dierence is not signicant.
For advantage learning a discount factor of
A
= 0.95
1
t
= 0.3585 (t = 0.05s) was used, which is equiva-
lent to the discount factor used by Q-Learning. Experiments with a higher discount factor (e.g.
A
= 0.95,
= 0.95
t
= 0.9974) resulted in a signicantly worse performance.
Figure 8.23: Performance plots of the (a) Q-Learning algorithm(b) Advantage Learning algorithm(K = 1.0)
for dierent gradient calculation schemes.
For the cart-pole tasks, only the direct gradient and the residual gradient were tested. This can be seen
in gure 8.24(b). For the advantage learning algorithm again a discount factor of
A
= 0.3585 was used.
One learning trial took 4000 episodes. The results are averaged over ve learning trials. In this task, the
advantage learning algorithm already had a signicantly worse performance than standard Q-Learning. The
time scale factor (K = 1.0 was used) was not further optimized, but these results indicated that, at least
without this optimization of K, advantage learning has no advantage over Q-Learning.
FF-NNs
The use of FF-NNs for Q-Learning was only investigated for the pendulum task due to the long learning
time. We tested two possibilities using FF-NNs to represent the Q-Function. The rst representation uses
a FF-NN which takes the action value as additional input. The FF-NN was created with 12 neurons in the
hidden layer (resulting in 5 12 + 13 = 73 weights). The second approach uses an individual FF-NN (again
(a) CartPole (b) CartPole
Figure 8.24: Performance plots of the (a) Q-Learning algorithm(b) Advantage Learning algorithm(K = 1.0)
for dierent gradient calculation schemes.
with 12 hidden neurons) for each discretized action. Since there are three dierent discretized actions, we
have 61 3 = 183 weights. The rst approach might benet from the generalization capabilities of the
FF-NN, but is also intuitively harder to learn. In the experiments, one trial lasted for 5000 episodes, and the
plots are averaged over 10 trials.
Figure 8.25: Performance plots for learning the Q-Function with an FF-NN (a) one single FF-NN (b) an
individual FF-NN for each action.
The results show that FF-NNs are even more dicult to use for Q-Functions than for V-Functions. For
certain initial (random) congurations of the neural network, and good congurations of the residual al-
gorithm, learning was actually successful and sometimes even quite fast, but this was not reproducible for
all initial congurations of the neural network. A directed exploration strategy is likely to attenuate the
inuence of poor initial congurations of the FF-NN. Again, learning with the direct gradient algorithm was
8.3. Q-Function Learning Experiments 165
unsuccessful for both types of FF-NNs. The residual algorithm with a high setting or even the residual
gradient algorithm ( = 1.0) are more promising. Learning was successful in 7 out of 10 cases for the best
conguration of the residual algorithm with the single FF-NN approach. The rst FF-NN approach also per-
forms slightly better than the second approach with three individual FF-NNs. This indicates that using the
generalization capability of the FF-NN between actions is a better approach then using separated FF-NNs
for the actions.
8.3.2 Comparison of dierent time scales
This experiment is similar to the time scale experiments in the V-Learning section. We compare Q-Learning
with a constant setting of 0.95, Q-Learning with an adaptive of 0.3585
t
, which is equivalent to the
value used by advantage learning, and advantage learning (time scale factor K was set to 1.0). This
experiment is also done only for the pendulum task. The results are illustrated in gure 8.26.
The Q-Learning algorithm outperforms the advantage learning algorithm for larger time scales, but advan-
tage learning is better for small time steps because the advantage values are scaled by the inverse time step
1
t
. But using such small time steps like t = 0.005 is typically not useful for learning, so this benet of
advantage learning is not very useful.
Another surprising result is that the Q-Learning algorithm with the adapted parameter performs worse for
all time scales. High discount factors seems to be good in general (for V-Learning the adaptive discount
factor setting with higher discount factors outperforms the constant discount factor setting of 0.95), but do
not work in this case.
(a) Pendulum (b)
Figure 8.26: Experiments with Q-Function Learning using (a) dierent time scales, (b) the Dyna-Q algo-
rithm with a dierent number of planning updates
8.3.3 Dyna-Q learning
In this section, we will investigate the use of simulated experience from the past to update the Q-Function
as it is done by the Dyna-Q algorithm (see 4.8.1. At each step, we update the Q-Function with 0, 1, 2, 3
or 5 randomly chosen experiences from the last 40 episodes. We tested the Dyna-Q algorithm for both the
pendulum and the cart-pole task using our standard RBF-network. As the learning algorithm, the standard
Q-Learning algorithm was used, for the Dyna-Q updates a Q-Learner was also used, but without eligibility
traces. The learning rate used for both learning algorithms was 0.75. The results are plotted in gure 8.26
(b). The plot is averaged over 20 trials for the pendulum and over 10 trials for the cart-pole task. For the
pendulum task, using the Dyna-Q planning updates did not have any eect on the performance, this task
seems to be too simple. The results for the cart-pole task do show a slight improvement in performance.
Surprisingly, the performance gets worse again if we use more planning updates. The reason for this might
be that too many o-policy planning updates disturb the approximation of the Q-Function.
Our experiments with Dyna-Q Learning unfortunately do not show a clear advantage for this planning
approach. Additional experiments are needed to illustrate the benets of Dyna-Q Learning.
8.3.4 Conclusion
Learning the Q-Function succeeded for all algorithms using the RBF-network for the pendulum and the
cart-pole task, thus Q-Learning can solve problems as complex as the V-Learning approach. The advantage
of V-Learning is the superior learning speed. But when using FF-NNs, the comparison looks dierent.
Learning the Q-Function is in this case more important than learning the V-Function, resulting in a very
unstable learning performance. Advantage Learning does not seem to have a signicant advantage over
Q-Learning, at least with the use of RBF-networks. Further tests with FF-NNs or GS-NNs have not been
done. The benet of Q-Learning, is that it can be used even if the system dynamics are not known, we also
have the possibility of using the Dyna-Q algorithm, or other approaches, to incorporate experience from the
past, which is not possible for V-Learning because V-Learning is an on-policy algorithm. An interesting
approach would be to incorporate the planning part used by V-Learning for Q-Learning (e.g. by calculating
the value of a state by V(s) = max
a
Q(s, a)). We ran some tests for this rst naive approach (using planning
for action selection instead of the Q-Values), which only resulted in a divergent behavior of the Q-Function.
The reason for this is that the Q-Values which are considered best do not have to be taken at all, and so these
values do not get updated. A planning approach which simultaneously updates the Q-Values in the planning
phase would be appropriate.
8.4 Actor-Critic Algorithm
This section covers the two Actor-Critic algorithms for a discrete action set introduced in section 4.6, the
stochastic real valued algorithm (SRV) and the new policy gradient Actor-Critic algorithm (PGAC).
8.4.1 Actor-Critic with Discrete Actions
In our experiments with the discrete Actor-Critic algorithm, an RBF network was used for the actor and
the critic. We tested two dierent algorithms (introduced in section 4.6). The rst algorithm uses the
temporal dierence for the update, whereas, the second algorithm also weights the update of the actor by
the probability of taking the current action as well: at least half of the learning rate is used for the update
if the probability is very high. The tests were done both with and without eligibility traces for the actor.
For creating a policy from the actors action value function, a standard soft-max policy was used with
= 20. Since these methods also use a discrete action set, they are comparable to the Q-Function Learning
algorithms.
In gure 8.27(a), we can see the results for the pendulum task, and in 8.27(b) for the cart-pole task. In
both benchmark tasks, the algorithms perform well. For the pendulum task, the algorithms performed sig-
nicantly better than the Q-Learning approach, although the performance for the cart-pole task was almost
8.4. Actor-Critic Algorithm 167
identical. Interestingly the results of the tests where we used e-traces are drastically dierent for the two
benchmark problems. While using e-traces had a signicantly worse performance for the pendulum task, it
performed well in the cart-pole task. This indicates that using e-traces for the actor considerably improves
the performance of more complex tasks. The comparison of the two Actor-Critic algorithms tends slightly
to the second approach, consequently an emphasis on updates of actions with low probability is a good
strategy.
Figure 8.27: Experiments with discrete Actor-Critic algorithms for the (a) pendulum tasks (200 episodes,
averaged over 10 trials) (b) cart-pole task (4000 episodes, averaged over 5 trials)
We also investigated the use of FF-NNs for the actor and the critic simultaneously, which did not work at all
even if we use the best known parameter conguration for the critic from the previous experiments.
8.4.2 The SRV algorithm
The SRV uses a continuous policy as actor, which is a clear advantage over Q-Learning and the preceding
Actor-Critic algorithms. For the SRV algorithm, noise plays a signicant role for learning, so we tested the
SRV for dierent amounts of noise and also for dierent smoothness levels of noise. A disadvantage of this
algorithm is that the optimal learning rate depends on the noise of the controller, so the learning rate has to
be optimized for each dierent noise setting. We used a sigmoidal policy (see 7.1.2) as the actor, which is
either implemented as an RBF network or a FF-NN network. For the critic, again either an RBF network
or an FF-NN is used, both with the most ecient algorithm and parameter conguration determined in the
previous experiments. Our experiments with dierent function representations of the actor and the critic
illustrate, whether it is useful to use a non-linear FA for the value function (due to the high dimensional
state space), if we know an easy to learn representation of the policy (for example the RBF network or
parameterized controllers from optimal control).
For the pendulum task, one learning trial lasted for 200 episodes, if an RBF network was used for the actor
and the critic, and 3000 episodes if one or more FF-NNs were used. All results are averaged over 10 learning
trials. The tests were done for values of [0.0, 0.7, 0.9] and values of [15.0, 10.0, 7.5, 5.0, 1.0]. Note that
the limit of the control variable is [10, 10], so for high noise values, often just the limits of the control
variable were taken. Thus for ltered noise, often the same limit value is chosen for several time steps,
which ensures a certain smoothness in time.
For each noise setting, the learning rate of the actor was roughly optimized in the interval [1, 4] for RBF
networks, and in the interval [0.001, 0.025] for FF-NN networks.
For the pendulum task, we can see the performance of the RBF actor using an RBF critic in gure 8.28,
and in gure 8.28 (b) the use of an FF-NN critic. In the case of the RBF critic, high exploration factors
seem to be advantageous. The results for dierent noise smoothness factors do not signicantly dier. But
for the FF-NN critic, which cannot track the evolution of the actor that quickly, we see a huge dierence
in the performance for dierent smoothness factors. Obviously, the smoothness in time of the ltered noise
signal resembles the smoothness in time of the optimal policy, resulting in a good performance of this noise
distribution. The result even competes with the V-Planning policy using a FF-NN, which is quite remarkable,
because the V-Planning policy uses the system dynamics as prior knowledge. It appears that the RBF-actor
can stabilize the learning process of the FF-NN because its parameter changes only aects the actor locally.
Figure 8.28: Performance plots of the SRV learning algorithm using an RBF actor with (a) RBF critic (b)
FF-NN critic for dierent noise signals.
The use of a FF-NN as actor is very problematic, resulting in a very unstable learning performance. Sur-
prisingly we get better results for the FF-NN critic than for the RBF-critic. But perhaps the reason for this
is that the parameters (learning rate of the actor) were not optimized accurately enough. The performance
curve of the FF-NN critic is plotted in gure 8.30 (a), in 8.30 (b) we can see the learning curves of 10 trials
using the most ecient noise signal and learning rate. The algorithm managed to learn the swing up in two
out of 10 cases, even in these two cases the performance was still very unstable. Seemingly, when using
an FF-NN to represent the policy , the learning of even simple tasks becomes dicult, in fact, even more
dicult than using an FF-NN for the value function. In particular, we are likely to end in a local minima,
for all the gradient descent/ascent algorithm for the policy representation, which might be the reason for the
bad performance. Adding some restrictions to the FF-NN, like certain smoothness criteria, or limiting the
norm of the weights may help to overcome this problem, but were not further investigated.
For the cart-pole task, the SRV algorithm experiments were only done for the RBF-actor, with either an RBF
or an FF-NN critic. In the tests with the RBF-critic, 4000 episodes were used as in the Q-Function Learning
experiments. The SRV approach could not arm the promising results from the pendulum task. For the
pendulum task, the algorithm could outperform Q-Learning methods. Now, the performance is signicantly
Figure 8.29: (a) Performance of the SRV learning algorithm using an FF-NN actor and an FF-NN critic. (b)
Learning curves of the same algorithm, using the best determined conguration. The algorithm managed to
learn the task in two out of 10 trials. The thick line represents the average learning curve.
Figure 8.30: (a) Performance of the SRV learning algorithm using an FF-NN actor and an FF-NN critic. (b)
Learning curves of the same algorithm, using the best determined conguration. The algorithm managed to
learn the task in two out of 10 trials. The thick line represents the average learning curve.
worse than in the Q-Learning algorithm. This means that the SRV algorithm is more dicult to scale up
to more complex tasks than other algorithms; a potential problem is the choice of the noise signal for more
complex tasks.
The experiments with the FF-NN critic did not lead to any good results for all noise settings, but one learning
trial had 10000 episodes in order to compare the results to the PGAC algorithm. Thus a fair comparison
with V-Planning approaches is not possible, because the V-Planning method needed over 50000 episodes to
learn the task.
(a) Cart-Pole
Figure 8.31: (Performance of the SRV learning algorithm using an RBF actor and an RBF critic. One trial
lasted for 4000 episodes and the plots are averaged over 5 trials
8.4.3 Policy Gradient Actor-Critic Learning
In this section we test the new proposed Policy Gradient Actor-Critic (PGAC) algorithm. This algorithm
also uses a continuous valued policy, but unlike to the SRV algorithm, it requires the knowledge of the
system dynamics for the policy updates. We will compare this algorithm to the SRV algorithm, because it
is also an Actor-Critic algorithm, and also to V-Planning policy, because in this case the system dynamics
are also used. Again, we tested the algorithm for dierent combinations of representations (FF-NNs or RBF
networks) for the actor and the critic. The actor is, once more, a sigmoid policy, the random noise controller
was disabled in these experiments. Each actor-critic conguration was tested for dierent prediction and
backwards horizons. For the critic part, the most ecient algorithms and parameters were always used.
For the pendulum task one trial took 50 episodes if only RBF networks were used, and 3000 episodes if one
or more FF-NNs were used(same trial length as for V-Planning). All plots are averaged over 10 trials for
RBF networks and over 5 trials when an FF-NN was used. The results where a forward prediction horizon
was used, are shown in gure 8.32(a) and the results where a backwards horizon was used in gure 8.32 (b).
Learning with an RBF actor was successful, but using a bigger prediction or backwards horizon only slightly
improved the performance, at least with the FF-NN as critic. With the RBF-network as critic, using a bigger
time interval for the updates did not seem to result in any improvement for the pendulum task; the task seems
to be too simple to benet from bigger time windows for computing the gradient of the policy. Comparing it
the SRV algorithm, the performance is almost the same. For the an FF-NN critic, the SRV even outperforms
this algorithm slightly. We think this is due to the optimized noise level in the SRV experiments. The PGAC
algorithm can also compete with the V-Planning approach for both kinds of critics. The use of an FF-NN as
actor is also critical with the PGAC algorithm critical and it does not work properly even for the pendulum
task, probably for the same reasons as for the SRV algorithm.
Figure 8.32: Performance of the PGAC algorithm for the pendulum task using (a) a forward prediction
horizon or (b) a backwards view horizon
We ran the same tests for the cart-pole task, the results are illustrated in gure 8.33. For the RBF critic,
one trial lasted for 4000 episodes, for the FF-NN critic 10000 episodes. For this task we can already see a
considerable dierence between the dierent time intervals used for the update; with a time interval of 1.0
neither the RBF-Critic nor the FF-NN critic can learn the task, but performance can be drastically improved
by using a larger time interval for the updates. The RBF-critic approach learns the cart-pole task quite well,
and outperforms the SRV for a time interval length of two. The algorithm has a performance comparable to
the Q-Learning approaches, with the benet of a continuous valued-control. The PGAC algorithm could not
compete with the discrete time V-Planning approach, which gives particularly good solutions with a higher
prediction horizon. In particular, the learned policy of the V-Planning approach is very ecient, which could
not be achieved with the PGAC algorithm, because such sharp decision boundaries cannot be expressed with
the RBF-actor.
For the FF-NN critic, we did experiments with a large backwards horizon (7 was used) to illustrate the
strength of the PGAC approach, the learning curve of one parameter setting is illustrated in gure 8.34. The
algorithm managed to learn the task after 10000 episodes with a stable performance, which is a considerable
improvement over the standard V-Learning approach. Again, the RBF-Actor managed to stabilize the FF-
NN critic. In comparison with the SRV algorithm, this result also shows the performance advantage of
the PGAC algorithm over the SRV algorithm. A disadvantage of the PGAC approach, is the long learning
time when using a large backwards or forwards horizon (one trial with 20000 episodes and a backwards
horizon of 7 took 120000 seconds) due to the complex gradient calculation, and further optimization of the
implementation is needed.
The prediction horizon was supposed to outperform the backwards horizon approach, because the actor is
updated with future information. But both approaches lead to almost the same results for the pendulum and
the cart-pole task, so the backward horizon approach should be preferred due to the lower computational
costs (no state prediction is needed).
Figure 8.33: Performance of the PGAC algorithm for the pendulum task using (a) a forward prediction
horizon or (b) a backwards view horizon
Figure 8.34: Learning curve of the PGACalgorithmfor the cart-pole task using a FF-NNcritic. Abackwards
horizon of 7 was used. One trial lasted for 20000 episodes. The plots are averaged over 2 trials
8.5. Comparison of the algorithms 173
The PGACalgorithmoutperforms the SRValgorithm, and almost reaches the performance of the V-Planning
algorithm if we choose larger time intervals for the actor updates. Compared with the V-Planning algorithm,
the PGAC algorithm has the signicant advantage that for action selection, no planning is needed. This
means action selection is very fast. Using a higher prediction (or also backwards horizon) for the weight up-
dates is also possible, which can drastically improve the performance, as it does when using V-Planning, but
unlike V-Planning, the computation time does not depend exponentially on the prediction horizon (O(|A|
n
)).
This dependency is linear (O(n)) for the PGAC algorithm. In our experiments for the pendulum task, a
forward prediction horizon of ve time steps took about twice as long as using no prediction horizon at all.
In comparison with V-Planning with a search depth of ve, which needed approximately 120 times longer,
this is a considerable saving of computation time. For the cart-pole task, a similar benet in computation
time could be observed, but in this case the policy gradient calculation is more complex, resulting in four
times as much computation time. For a prediction horizon of ve, the V-Planning approach took approx-
imately 80 times as long as the standard approach with a search depth of one. For the PGAC algorithm,
even higher prediction/backwards horizons can be used, particularly at the beginning of learning, due to the
linear computation time.
8.4.4 Conclusion
Actor-Critic algorithms are very promising, and have in our opinion, a great potential. The discrete action
algorithms compete with Q-Learning, and almost reach the performance of V-Planning approaches without
requiring knowledge of the system dynamics. The continuous action algorithms both perform quite well;
the SRV algorithm performs well for the simple pendulum task, but for the cart-pole task an exact tuning
of the noise signal is necessary for an acceptable performance, so scaling it to more complex tasks is likely
to be quite dicult. In this context, the PGAC algorithm is more promising, because its learning ability is
easily scalable by extending the size of the time interval used for the updates. The PGAC algorithm could
even outperform the 1-step V-Planning approach considerably for dicult constellations using an FF-NN as
critic for the cart-pole task. This is actually a very promising result, indicating the power of this approach.
Another interesting aspect is that because both algorithms use dierent information to update the policy, but
may use the same representation of the policy, it is possible to combine the SRV and the PGAC algorithm
for the policy updates. Whether this leads to superior performance has yet to be investigated.
8.5 Comparison of the algorithms
In this section, we will compare the results of the individual algorithms to each other. We will always
use the best congurations of the algorithm. The plots show the averaged learning curves. Firstly we will
do the comparison for the RBF-network. In gure 8.35(a), we can see the comparison for the methods
using the system dynamics (V-Planning, value gradient-based policy, PGAC). The results show a slightly
slower learning process for the value gradient policy, the PGAC algorithm (5 steps forward horizon is used)
performs equally as well as the V-Planning approach. For the cart-pole task 8.36(a) these results can be seen
even more clearly. The performance of the value-gradient based policy falls o considerably, the PGAC
approach ( a forward horizon of ve was used) has the same learning speed as the V-Planning approach, but
cannot reach the quality of the learned policy. This can be explained by the use of the real-valued RBF actor,
which cannot represent hard decision boundaries as with the V-Planning approach. In this plot, we can also
see the power of V-Planning approach if we plan for more than one step. Using ve steps for the prediction,
the algorithm nds an optimal solution almost four times as fast as with the standard one step prediction.
The quality of the learned solution is also considerably better.
In gure 8.35(b), we can see the results for model free approaches (Q-Learning, Advantage-Learning, Actor-
Critic Learning, SRV). All three Actor-Critic approaches outperform the Q-Learning approach. Advantage
learning learns as fast as Q-Learning but the quality of the learned policy does not seem to be as good as
for the other algorithms. The results for the cart-pole task can be seen in 8.36(b). Q-Learning and the
discrete Actor-Critic learning algorithms have the best performance in this case. Surprisingly, the quality of
the learned policy for Q-Learning is even better than for actor-critic learning. The SRV algorithm struggles
with the complexity of this task, and, on average, does not manage to learn a good policy in 4000 episodes.
Figure 8.35: (a) Learning Curves of the V-Planning (one step prediction), the value gradient policy, and
the PGAC algorithm. (b) Comparison of Q-Learning, Advantage Learning, the two Actor-Critic Learning
approaches and the SRV algorithm. All algorithms use the RBF network. Plots are averaged over 10 trials.
Figure 8.36: (a) Learning Curves of the V-Planning (one and ve step prediction), the value gradient policy,
and the PGAC algorithm. (b) Comparison of Q-Learning, Advantage Learning, Actor-Critic Learning and
the SRV algorithm. All algorithms use the RBF network, plots are averaged over 5 trials.
8.6. Policy Gradient Algorithm 175
Figure 8.37: Learning Curves of the V-Learning (one step prediction), V-Planning (ve step prediction), the
PGAC and the SRV algorithm using FF-NNs for the pendulum task, plots are averaged over 10 trials.
8.6 Policy Gradient Algorithm
This section will cover the policy gradient algorithms included in this thesis, which are CONJPOMDP, and
two variants of the PEGASUS algorithm using gradient ascent, namely the numerical and the analytical
policy gradient calculations. The tests in this section are unfortunately not that extensive due to the lack of
time.
8.6.1 GPOMDP
The GPOMDP algorithm was tested with the RBF network and also with the FF-NN for the pendulum
task. Due to the huge learning time needed for this task, learning was not tried on more complex tasks. A
stochastic policy with a soft-max distribution ( = 20) is used to represent the policy. We only tested the
CONJPOMDP algorithm with the original setting, so GSEARCH was used for determining the optimum
learning rate. Dierent numbers of episodes (5, 20, 50 and l00) were tried for the gradient estimation, which
is plotted in 8.6.1. The GSEARCH algorithm always uses
1
5
episodes for its gradient estimation, which
does not need to be that accurate. We limited the learning rate calculated by the GSEARCH algorithm by
[0.1, 160] for the RBF policy, and by [0.005, 5.0] for the FF-NN policy. The start learning rates used were
10.0 and 0.5.
In gure 8.6.1(b), we can see the learning curves for the RBF network, using dierent numbers of episodes
for the gradient estimation. After each weight update, the average height of 20 episodes was recorded. As
we can see, the CONJPOMDP algorithm needs a huge number of gradient updates to converge. Learning
was only successful using 100 episodes per gradient estimation, and the algorithm needed approximately
3000 weight update steps, resulting in about 80 million (!!) learn steps, or over 40000 seconds for learning
a task which can also be learned in 10 episodes with 200 steps, or 5 seconds using the same RBF network as
function approximator for the value function. In the huge variance of the performance estimate, we can also
see that learning is quite unstable. This is also a consequence of the value of 0.95 used for the GPOMDP
algorithm. This value species the bias-variance trade-o of the gradient estimation, high values have
a large variance, but a small bias to the real gradient. Learning with the FF-NN did not lead to any success
within 5000 weight updates, so continuing the learning was not considered to be useful.
Figure 8.38: (a) Performance of the GPOMDP algorithm using an RBF network or an FF-NN (b) Learning
curve for the RBF network with dierent number of gradient estimation episodes.
These results resemble the results shown by Baxter [11], where a vast number of trials were needed to learn
to navigate a puck on a plateau. Perhaps the choice of dierent values for the GPOMDP algorithm (0.95
was used) might have improved the performance of the algorithm, but due to the poor performance and the
long learning time this algorithm was not investigated any further.
8.6.2 The PEGASUS algorithm
The PEGASUS algorithm was tested with an FF-NN or the standard RBF network representing the contin-
uous policy. Again, a sigmoidal policy was used to limit the control values, and the noise controller was
disabled. We tested two approaches for calculating the optimal learning rate in this case, the GSEARCH
algorithm and the standard line search algorithm discussed in section 6.2.5. Again we tested the algorithms
with a dierent number of episodes used for the gradient estimation.
The analytical algorithm must calculate the gradient of the policy and the transfer function, which is done
numerically. The step-size for the three-point method used was 0.005.
On the other hand, the numerical algorithm, which needs to calculate the gradient of the value of a policy
directly, used a dierentiation step size for the weights of 0.05. Both dierentiation step sizes (for the
numerical and the analytical algorithm) were empirically chosen, and only roughly optimized.
We tested the algorithm for the pendulum task with the following parameters: For the RBF network the
initial learning rate
0
of the GSEARCH algorithm was set to 10, and the learning rates were limited to
[0.1, 160]. For the value based line search algorithm, we used the learning rates [0.1, 1.0, 5.0, 10.0, 30.0, 60.0, 120.0, 240.0]
as search points. Learning was done for 50 weight updates. Using the FF-NN, the start learning rate
0
was
set to 0.1 and limited to [
1
160
, 160], and the values [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0] were used as
search points for the value based line search. For each value evaluation of the line search algorithm, 50
episodes were simulated, using the same initial states at each evaluation. Learning was done for 1000
weight updates for the FF-NN.
We can see the results in 8.39(a) for the RBF network, using a dierent number of gradient estimation
episodes. All algorithms managed to learn the RBF-policy, even with only ve gradient estimation episodes.
Usually, learning was successful already after 10 weight updates, which is an immensely dierent from the
8.6. Policy Gradient Algorithm 177
GPOMDP algorithm. The line search algorithm for estimating the optimal learning rate seems to be more
robust than the GSEARCH algorithm in this case, because we eliminated most of the noise in the value
estimates due to the PEGASUS framework, this the GSEARCH algorithms advantage of being insensitive
to noise is therefore eliminated. The numerical solution falls o performance, and it only managed to
learn the task in 7 out of 10 cases. If it learns the solution correctly, the results are as good as for the
other two algorithms. Optimizing the numeric step size would probably have solved this problem, but this
was not done due to the high learning time required by the numeric solution. For the pendulum task, the
numeric solution took about 800 seconds of computation time, whereas the analytical solution needed just
30 seconds, which is a considerable improvement in computation time. Especially for policies with many
weights (like a policy based on an RBF network), the numerical solution has a huge speed disadvantage.
The results for the FF-NN are, unfortunately, not encouraging at all. Learning was not successful for one
learning trial after 1000 weight updates for all three tested algorithms. Many dierent magnitudes for the
interval of the optimal learning rate were tried without success. The exact reasons for this have yet to
be investigated, as unfortunately, there was no time left for doing exhaustive tests for the policy gradient
algorithms, but intuitively we got stuck in a local minima with our gradient approach very quickly. In
general, we can say that the analytical version of the gradient estimation works very eciently and also
more accurately than the numerical solution. Complementing this gradient method with other optimization
methods like genetic algorithms or stimulated annealing, which avoid local maxima, would be a promising
approach, e.g. for learning the policy with an FF-NN.
Figure 8.39: (a) Performance of the PEGASUS algorithm using a RBF network (b) Learning curve for the
RBF network with the dierent PEGASUS approaches.
8.6.3 Conclusions
Our tests with policy gradient algorithms were unfortunately not as exhaustive as for the other learning
algorithms. The GPOMDP algorithm is only theoretically important. The poor performance of this algo-
rithm makes it impracticle. However several extensions or related algorithm of the GPOMDP algorithm
exist which are supposed to have good performance, but were not tested in this thesis. The Pegasus algo-
rithm with the analytical gradient estimation is more promising, as it is able to calculate the gradient quite
accurately, while having a good speed performance. Policy gradient algorithms are also often used with
open loop control, like controlling the gait of a 4-legged robot [24]. It would be interesting to see how the
analytical gradient estimation approach works for these policies with a low parameter space. The analytical
gradient algorithm does basically the same calculations as the PGAC algorithm, but without the use of the
value function (it can be seen as special case of the PGAC algorithm with a very long time interval for
the weight updates). In order to nd good learning rates, and exact gradients, the policy gradient approach
usually needs more simulation steps to learn the task than value based methods, thus, the PGAC should be
preferred, if the complexity of the learning task allows the learning of a value function.
8.7 Conclusions
RL is still quite tricky to use for continuous optimal control tasks. While it is very promising for small
toy examples like the pendulum swing up task, it does not scale up well to tasks with more dimensions and
greater complexity like the acrobot task. The eort required to use RL for these fairly small benchmark tasks
was drastically underestimated, resulting in a considerable delay in nishing this thesis. Some experiments
with the used benchmark problems, or even new benchmark problems, could not be done due to the lack
of time. Nevertheless, this thesis hopefully gives a good overview of the use of RL algorithms for optimal
control, their strengths and weaknesses and possible applications. This thesis is also, to our knowledge,
the most extensive collection of comparative benchmark data for comparing the dierent RL algorithms.
In many areas, more extensive tests would be needed in order to draw more detailed conclusions, and the
implementation and comparison of other function approximation schemes (locally weighted learning, NG-
nets, echo-state networks) and a few more algorithms, like additional policy gradient algorithms or Actor-
Critic algorithms would be interesting. The RL Toolbox provides a very good framework for additional
experiments with RL already, and it is already in use by approximately 20 researchers all over the world
and will hopefully be used by more users in future. Generally we can say that the pure RL algorithms
used to learn the value function or to estimate the policy gradient are already quite sophisticated, but nearly
all of them lack the ability to scale up to more complex tasks. This is a consequence of the function
approximators used. RBF-networks scale badly due to the curse of dimensionality, and they are also dicult
to apply if a high accurate value function is needed. FF-NNs and GS-NNs have comparatively poor learning
performance, even if good parameter settings have been found, therefore it is very dicult to use them for
more complex tasks. A sophisticated learning system which uses RL for complex high dimensional tasks
would need to combine all the benets of the dierent algorithms presented here in this thesis, and obviously
solve some of the other problems, like autonomous sub goal detection or nding good representations for
the value function. Many approaches introduced in this thesis, like directed exploration, using planning
or adding a hierarchic structure to the task showed very promising results. Further development of these
ideas, and combining them appropriately, will hopefully help us to cope with at least a few of the problems
occurring in RL.
Appendix A
List of Abbreviations
RBF Radial Basis Function
DP Dynamic Programming
EM Expectation Maximation
ESN Echo State Network
E-Traces Eligibility Traces
FA Function Approximator
FF-NN Feed Forward Neural Network
GSBFN Gaussian Soft-Max Basis Function Network
GS-NN Gauss Sigmoid Neural Network
HAM Hierarchy of Abstract Machines
LMS Least Mean Square
LQR Linear Quadratic Regulator
LWL Locally Weighted Learning
MDP Markov Decision Process
MSE Mean Squared Error
NG-net Normalized Gaussian networks with linear regression
PEGASUS Policy Evaluation of Goodness and Search Using Scenarios
PGAC Policy Gradient Actor Critic Algorithm
POMDP Partially Observable Markov Decision Process
PS Prioritized Sweeping
Q State Action Value
RARS Robot Auto Racing Simulator
RL Reinforcement Learning
RLT Reinforcement Learning Toolbox
SARSA State Action Reward State Action learning
SMDP Semi-Markov Decision Process
SRV Stochastic Real Valued Algorithm
STL Standard Template Library
TBU Truck-Backer-Upper Task
TD Temporal Dierence
V State Value
179
Appendix B
List of Notations
< > Tuple
[. . . ] Vector
A Set of all actions
A
s Set of actions available in state s
a, a
t Action
Controls the softness of the soft-max policy
Weghting factor for the residual algorithm
Bias-Variance trade-o factor in the GPOMDP algorithm
Termination condition of an option
C(s) Number of visits of state s
Exploration measure
critique TD comming from the critic
d(s) Probability of initial state s
D Set of all initial states
Change in a certain value
e(s
i
) E-Trace for state s
i
e(w
i
) E-Trace for weight w
i
E Error Function
E[] Expectation operator
Learning rate
f (s, a) State transition function
g(s, a, p) Deterministic simulative model
Discount Factor
H Hamiltonian
Selective attention factor
E-Trace attenuation factor
Gradient with respect to the weights (w or )
V GPOMDP estimation of the policy gradient)

p(s) Action value in actor-critic learning
Policy, state to action mapping
(s) Deterministic Policy
(s, a), (s, a) Stochastic Policy
i Activation function of feature i

180
181
Activation vector of all veatures
o Option
O Set of all Options
P(x = X) Probability that the random variable X has the value x
P(x = X|y = Y) Conditional probability that x = X if y = Y is already known
(s),
(s, a) Exploration (action) value function
Q
w
Approximated action value function
Q
(s, a) Action value, when taking action a in state s and then following
Q Optimal Q-function
r, r
t Reward
r(s, a, s
) Reward function
r
o
s
Option reward
residual(s, s
) Error of the Bellman equation

(s) sigmoidal squashing function (logsig)
Variance, Variance of the noise
S Set of all states
s, s
t State
s
0 Initial state
s
Successor state
s
Continuous time discount factor
s
Continuous time
td Temporal Dierence
Parameter vector of the policy
t Used time step
u Continuous control vector
U Set of all control values
V
w
Approximated valuefunction
V
A
(t), V
A
(s
t
) Value (future average reward) in time step t when following
V
A
() Expected future average reward beginning in a typical initial state
V
(t), V
(s
t
) Value (future discounted reward) in time step t when following
V() Expected future discounted reward beginning in a typical initial state
V Optimal value function
V
k
, Q
k (Action) Value Function at the k
th
iteration of DP
w Weight vector of the value function
w
D
Direct Gradient weight update
w
R
Residual weight update
w
RG
Residual Gradient weight update
W
D
Epoch-wise Direct Gradient weight update
W
R
Epoch-wise Residual weight update
W
RG
Epoch-wise Residual Gradient weight update
Appendix C
Bibliography
Bibliography
[1] P. Absil and R. Sepulchre. A hybrid control scheme for swing-up acrobatics. European Conference on
Control ECC 2001.
[2] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. NIPS 97, 1997.
[3] C. Atkenson, W. Moore, and S. Schaal. Locally weighted learning. Articial Intelligence Review,
11(1-5):75113, 1997.
[4] C. Atkeson, W. Moore, and S. Schaal. Locally weighted learning for control. Articial Intelligence
Review, 11(1-5):1173, 1997.
[5] L. Baird. Reinforcement learning in continuous time: Advantage updating. In International Confer-
ence on Neural Networks, June 1994.
[6] L. Baird. Reinforcement Learning Through Gradient Descent. PhD thesis, School of Computer Sci-
ence, Carnegie Mellon University Pittsburgh, 1999.
[7] A. Baron. Universal approximation bounds for superpositions of sigmoidal functions. IEEE Transac-
tions on Information Theory, 1993.
[8] A. Barto and R. Sutton. Neuron-like adaptive elements that can solve dicult learning control prob-
lems. In IEEE Transactions on Systems, Man, and Cybernetics, 1983.
[9] J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning: 1. gradient estimation algo-
rithms. Technical report, CSL, Australian National University, 1999., 1999.
[10] J. Baxter, A. Trigdell, and L. Weaver. Knightcap: a chess program that learns by combining TD()
with game-tree search. In Proc. 15th International Conf. on Machine Learning, pages 2836. Morgan
Kaufmann, San Francisco, CA, 1998.
[11] J. Baxter and L. Weaver. Direct gradient-based reinforcement learning: 2. gradient ascent algorithms
and experiments. Technical report, CSL, Australian National University, 1999., 1999.
[12] D. Bertsekas and J. Tsitsiklis. Neuro Dynamic Programming. Athena Scientic, 1998.
182
BIBLIOGRAPHY 183
[13] G. Boone. Minimum-time control of the acrobot. International Conference on Robotics and Automa-
tion, 1997.
[14] S. Brown and K. Passano. Intelligent control for an acrobot. Intelligent and Robotic Systems, 1996.
[15] R. Coulom. Reinforcement Learning using Neural Networks. PhD thesis, Institut National Polytech-
nique de Grenoble, 2002.
[16] T. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. 1998
International Conference on Machine Learning, 1998.
[17] K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12, 1999.
[18] P. Fidelman and P. Stone. Learning ball acquisition on a physical robot. In 2004 International Sympo-
sium on Robotics and Automation (ISRA), August 2003.
[19] V. Gullapalli. Reinforcement Learning and its Application to Control. PhD thesis, Graduate School of
the University of Massachusetts, , 1992.
[20] J. Izawa, T. Kondo, and K. Ito. Biological arm motion through reinforcement learning. In 2002 IEEE
International Conference on Robotics and Automation (ICRA02), 2004.
[21] H. Jaeger. The echo state approachto analysing training recurrent neural networks. GMD Report 148,
2001.
[22] H. Jaeger. A tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state
network approach. 2002.
[23] S. Kakade. A natural policy gradient. In NIPS. Advances in Neural Information Processing Systems,
2000.
[24] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In The
Nineteenth National Conference on Articial Intelligence, pages 611616, July 2003.
[25] P. Lancaster and M. Tsmenetsky. The Theory of Matrices, with Applications. Academic Press, San
Diego, 1984.
[26] Yann Le Cun, L eon Bottou, Genevieve B. Orr, and Klaus-Robert M uller. Ecient backprop. In Neural
Networks, Tricks of the Trade, Lecture Notes in Computer Science 1524. Springer Verlag, 1998.
[27] R. Makar, S. Mahadevan, and M. Ghavamzadeh. Hierarchical multi-agent reinforcement learning. In
AGENTS 01: Proceedings of the fth international conference on Autonomous agents, pages 246253,
New York, NY, USA, 2001. ACM Press.
[28] H. Miyamoto, J. Morimoto, K. Doya, and M. Kawato. Reinforcement learning with via-point repre-
sentation. Neural Networks 17 (2004), 2004.
[29] J. Morimoto and K. Doya. Hierarchical reinforcement learning of low-dimensional subgoals and high
dimensional trajectories. In The 5th International Conference on Neural Information Processing, vol-
ume 2, pages 850853, 1998.
184 Chapter C. Bibliography
[30] J. Morimoto and K. Doya. Reinforcement learning of dynamic motor sequence: Learning to stand up.
In IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 3, pages 17211726,
1998.
[31] J. Morimoto and K. Doya. Robust reinforcement learning. Advances in Neural Information Processing
Systems 13, pages 10611067, 2004.
[32] A. Ng and A. Coates. Autonomous inverted helicopter ight via reinforcement learning. In Interna-
tional Symposium on Experimental Robotics, 1998.
[33] A. Ng and M. Jordan. Pegasus: A policy search method for large mdps and pomdps approximation. In
Uncertainty in Articial Intelligence, Proceedings of the Sixteenth Conference, 2000.
[34] M. Nishimura, J. Yoshimoto, and S. Ishii. Acrobot control by learning the switching of multiple
controllers. volume 2. Ninth International Symposium on Articial Life and Robotics, 2004.
[35] R. Olfati-Saber, editor. Fixed Point Controllers and Stabilization of the Cart-Pole System and the
Rotating Pendulum, 1999.
[36] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural
Information Processing Systems, volume 10. The MIT Press, 1997.
[37] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. Third IEEE-
RAS International Conference on Humanoid Robots, 2003.
[38] M. Pfeier. Machine learning apllications in computer games. Masters thesis, Institute of Computer
Science, TU-Graz, 2003.
[39] J. Randlov. Solving Complex Problems with Reinforcement Learning. PhD thesis, Niels Bohr Institute,
University of Copenhagen, 2001.
[40] R.Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstrac-
tion in reinforcement learning. Articial Intelligence 112, pages 181211, 1999.
[41] J. Schaeer, M. Hlynka, and V. Jussila. Temporal dierence learning applied to a high-performance
game-playing program. International Joint Conference on Articial Intelligence (IJCAI), pages 529
534, 2001.
[42] K. Shibata, M. Sugisaka, and K. Ito. Hand reaching movement acquired through reinforcement learn-
ing. Proc. of 2000 KACC (Korea Automatic Control Conference), 2000.
[43] J. Si and Y. Wang. On-line learning control by association and reinforcement. volume 3. IEEE-INNS-
ENNS International Joint Conference on Neural Networks (IJCNN00), 2000.
[44] S. Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning,
22, 1996.
[45] W. Smart. Making Reinforcement Learning Work on Real Robots. PhDthesis, Department of Computer
Science, Brown University, 2002.
[46] W. Smart and L. Kaelbling. Practical reinforcement learning in continuous spaces. In Proc. 17th
International Conf. on Machine Learning, pages 903910. Morgan Kaufmann, San Francisco, CA,
2000.
BIBLIOGRAPHY 185
[47] W. Smart and L. Kaelbling. Reinforcement learning for robot control. Mobile Robots XVI, 2001.
[48] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dy-
namic programming. In Proceedings of the Seventh International Conference on Machine Learning,
pages 216224, 1990.
[49] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT press, 2004.
[50] G. Tesauro. Temporal dierence learning and td-gammon. Communications of the ACM, 38(3), 1995.
[51] S. Thrun. Selective exploration. Technical Report CMU-CS-92-102, Carnegie Mellon University,
Computer Science Department, Pittsburgh, 1992.
[52] J. Tsitsiklis and B. Van Roy. An analysis of temporal-dierence learning with function approximation.
Technical Report LIDS-P-2322, 1996.
[53] H. Vollbrecht. Hierarchic task composition in reinforcement learning for continuous control problems.
In ICANN98. Neural Information Processing Department, University of Ulm, 1999.
[54] R. Williams. A class of gradient-estimating algorithms for reinforcement learning in neural networks.
Proceedings of the IEEE First Anual International Conference on Neural Networks, 1987.
[55] J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis, Department of
Articial Intelligence, University of Edinburgh, 1997.
[56] T. Yonemura and M. Yamakita. Swing up control of acrobot. SICE Annual Conference in Sapporo,
2004.
[57] J. Yoshimoto and S.Ishii. Application of reinforcement learning to balancing of acrobot. In 1999 IEEE
International Conference on Systems, Man and Cybernetics, pages 516521, 1999.

Diplom Arbeit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Diplom Arbeit

Uploaded by

Copyright:

Available Formats

The Reinforcement Learning Toolbox,

Reinforcement Learning for Optimal Control Tasks

(s) so we can also write

was chosen according to the policy .

satises the condition

|s, a) is given by:

(s) = E[r(t) + V(s

)), for all s (4.11)

if done innitely often. This process is often

)), for all s and a (4.13)

), for all s (4.14)

), for all s and a (4.15)

, a), which is actually the

is only computed for all successor states in the forward

are increased by the expected change of the value of state s

using this approach.

and the values V(s

) for each action. Calculating this value can be quite

)|s, (s)] (4.40)

can be calculated by P(s

|s, a) b(s), b(s), the

that can reach s with a do

>). In order to update the probability of the

) by the new visit counter N(s

as index and the reward as

] be the expected reward when executing option o in state s The value

|s, o) is the probability that the executed option terminates in s

, and P(ks,o,s) is the proba-

and action value function Q

> hac occurred in the past. Carrying out the update

(s, o) = (1 (s)) Q(s, o) + (s) max

, k|s, a) are also well dened by the policies of the subtasks.

(i, s), dened

which minimizes the function f . One approach for this is gradient

(s) = ((s) + n) (7.5)

(s), the approximated value

(x) is the real value of the state x and

(x) is the approximated value coming from our function

). The resulting error

for stochastic processes.

(s)) of the greedy policy

. As a result this also

) is constant over time, which is in general not

)r(s(t), u(t))dt (7.27)

is the continuous discount

)r(s(t), u(t))dt + exp(t s

(s(t)) on each side and divide

t the Euler TD-error coincides with the

t) V(t + 1) V(t) (7.48)

(s, a) on both sides from equation 7.54 and inserting equation

V() = V(), it is proved that GPOMDP produces unbiased

with deterministic transitions. The transformation is accomplished by adding an innite

is not done explicitly, rather,

can be calculated from s

+ mgl sin + u) (8.1)

was introduced. This represents the angle rotated up to now

| > 10), or if the cart left the track,

for over-rotating) in the reward function.

to 1.0 and t = 0.05, resulting in an equivalent discrete time discount factor

V GPOMDP estimation of the policy gradient)

i Activation function of feature i

(s, a) Exploration (action) value function

) Error of the Bellman equation