Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study

Mapping the Design Space of Reinforcement Learning Problems a Case Study
Katharina Tluk v. Toschanowitz Neuroinformatics Group University of Bielefeld ktoschan@techfak.uni-bielefeld.de Barbara Hammer AG LNM University of Osnabr ck u hammer@informatik.uni-osnabrueck.de
Helge Ritter Neuroinformatics Group University of Bielefeld helge@techfak.uni-bielefeld.de
Abstract
This paper reports on a case study motivated by a typical reinforcement learning problem in robotics: an overall goal which decomposes into several subgoals has to be reached in a discrete large sized state space. For simplicity, we model this problem in a standard gridworld setting and perform an extensive comparison of different parameter and design choices. During this, we focus on the central role of the representation of the state space. We examine three fundamentally different representations with counterparts in real life robotics. We investigate their behaviour with respect to (i) the size and properties of the state space, (ii) different exploration strategies including the recent proposal of multistep-actions and (iii) the type and parameters of the reward function.
Introduction
Reinforcement learning (RL) provides an elegant framework for modeling biological and technical reward-based learning [7, 18]. In contrast to supervised learning which uses explicit teacher information, RL algorithms only require a scalar reinforcement signal. They are therefore ideally suited for complex real-world applications like self-exploring robots where the optimal solution is not known beforehand. Numerous applications of RL to various complex tasks in the area of robotics over the past years demonstrate this fact, including applications to robot soccer and the synthesis of non-trivial motor behaviour like standing-up [1, 9, 12, 17]. The experimental results can be put on a mathematical foundation by linking these learning strategies to dynamic programming paradigms [3]. The construction of reinforcement learning algorithms and their application to real-world problems involves many non-trivial design choices ranging from the representation of the problem itself to the choice of the optimal parameter values. Since real-world problems especially those encountered in human-inspired robotic systems tend to be very complex and high dimensional, these design choices are crucial for the successful operation and fast performance of the learning algorithm. RL
is well-founded and guaranteed to converge if the underlying observation space is discrete and full information about the process which must full the Markov property is available [3]. Real world processes, however, usually rely on very large or even continuous state spaces so that full exploration is no longer feasible. As an example, consider the task of learning different grasping strategies with a human-like pneumatic ve ngered hand which will be our ultimate goal application for RL: This process is characterised by twelve degrees of freedom. The question we have to answer is how to explore and map this high dimensional space of possibilities in order to create and apply an efcient RL system to learn optimal control strategies. On the one hand, we could use function approximation such as neural networks to directly deal with the real-valued state space (see e.g. [6, 14]). In that case, the convergence of the standard RL algorithms is no longer guaranteed and the learning process might diverge [2, 3]. On the other hand, we could approximate the process by a nite number of discrete values and learn a strategy in terms of the discrete state space. However, even a small number of intervals per variable yields a high dimensional state space. Therefore, additional information must be incorporated to shape the state space and the search process. This information might include the decomposition of the task into subgoals (for grasping, possible subgoals could be reaching a contact point or the sequential closing of the ngers around the object) or a specic exploration strategy (e.g. imitating a human). In addition, prior knowledge can be used to map the state space into lower dimensions by ignoring information which is irrelevant for an optimal strategy: In robot grasping, the exact position of an object is irrelevant whereas the direction in which the effector should move would be sufcient information to nd an optimal strategy. In this paper, we are interested in the possibilities of shaping a high-dimensional discrete state space for reinforcement learning. Our main focus lies on the particularly crucial choice of the representation of the state space. In a simple case study an articial gridworld problem which has been loosely motivated by the aforementioned task of robot grasping we exemplarily examine the effects of three different representation schemes on the efciency and convergence properties of RL. Note that a compressed state space representation introduces perceptual aliasing [23]: during mapping, states from the original problem become indistinguishable and the problem looses its Markov property. Consequently, the state space representation has a considerable effect on learning: on the one hand, the structure of the state space affects the exploration of the learner and might thus require a specic exploration strategy for optimal convergence. On the other hand, the learned control strategy is formulated in terms of the current state space with the result that the choice of its representation determines the existence, uniqueness, and generalisation abilities of the resulting strategy. During our experiments, we will put a special focus on the following aspects of the problem: different modes of representation of the state-action space are investigated with respect to (i) its size and specic properties, (ii) exploration versus exploitation, in particular speeding up the exploration by incorporating multi-step actions, and (iii) the choice of the reward function, in particular incorporating subgoals. A quite compact representation in combination with advanced exploration strategies will allow us to achieve robust control strategies which are also potential candidates for alternative settings. We rst introduce the general scenario and shortly recall Q-learning. Afterwards, we introduce and discuss three different representation schemes of the state space. We investigate the behaviour of these representations concerning different exploration strategies and reward types. We conclude with a discussion and a set of open questions for further research.
2 The Scenario
Our scenario is a simple articial gridworld problem which can be seen as a slightly modied version of rooms gridworld [19] with just one room and several (sub-)goal states that have to be reached in a certain order. The state space consists of a two-dimensional grid with n distinct goal states {g1 , . . . , gn }. The objective of the agent is to successively reach all of the goal states in the specied order 1, 2, . . . , n with a minimum total number of steps. If the agent reaches the nal goal state gn
Figure 1. Left: A 10 10 grid with one actor and a set of goal states that have to be reached in a certain order. The agent can perform four different movements. Right: The ad hoc representation using the (x,y)-values (3,3) in the compressed version of the state space (see text), the distance representation (3,3,2) and the direction representation (right, below, left). The grid shown here is smaller than the one used for the experiments.
after having visited the goal states gi in the correct order, the attempt is counted as a success, otherwise it is regarded as a failure. The possible actions are the primitive deterministic one-square-movements within the von Neumann-neighbourhood (see gure 1 (left)). The selection of this scenario as a testing problem was mainly motivated by two important aspects: on the one hand, it is a simple scenario so that we can focus on the aspect of representation and avoid additional difculties like noise or imprecision of the computation. On the other hand, this setting already incorporates important aspects of real life applications like grasping with a ve-ngered robotic hand: We can scale the problem by increasing the grid size which mirrors the possibility of discretising real-valued scenarios using differently sized meshes. The task of sequentially visiting the states g1 to gn causes a high dimensionality and complexity of the problem. In addition, this decomposition into several subgoals loosely mimics various sequential aspects of grasping such as the successive closing of the different ngers around an object or the transitions between the different phases of a grasping attempt, e.g. the closing behaviour of the ngers before and after the rst contact with the object [13, 16]. We set the immediate reward to 0.1 for each step that does not end on one of the goal states, 1 for reaching one of the (sub-)goal states {g1 , . . . , gn1 } in the correct order, and 10 for reaching the nal goal state gn . The aim is learning a strategy which maximizes the overall discounted reward at time point t: i i rt+i where rt+i is the reward at time t + i and < 1 is the discount factor. Standard one-step Q-learning [20, 21] nds an optimal strategy by means of the following iterative update: Q(st , at ) Q(st , at ) + [rt+1 + maxa (Q(st+1 , a) Q(st , at ))] where st and at are the state and action at time t, is the learning rate and is the discount factor. In the limit, Q estimates the expected overall reward when taking action at in state st . An optimal strategy always chooses action at = argmaxa Q(st , a). Q-learning is guaranteed to converge in a Markovian setting if all state-action pairs are adapted innitely often with an appropriate learning rate [3] but it is unclear what happens in scenarios that do not possess the Markov property.
3 Experiments and Results

At the beginning of the rst learning trial, the starting state and the n distinct goal states are selected randomly. The agent is then positioned in the starting state and starts choosing and performing actions, thereby changing its position in the state space. The trial terminates when the agent reaches the nal goal state after having visited all the (sub-)goal states in the correct order or when the total number of steps exceeds a specied maximum value. After one success (or failure due to step number), the agent starts a new trial with the same starting point and the same goal states, but it remembers the already partially adapted Q-matrix from the last trial. This procedure is repeated 100 times with the
same starting point and goal states and a successively enhanced Q-matrix so that the performance of the agent typically improves drastically with the increasing number of trials (see gure 2). We use n = 3 goal states in a 10 10 grid and a maximum number of 10,000 steps per trial as default values for all further experiments. In addition, we use periodic boundary conditions to avoid any impact of edge effects on our results. Unless otherwise mentioned, the action selection is performed using an -greedy strategy with = 0.1. We use a discount rate of = 0.9, a learning rate of = 0.1 and a zero initialised Q-matrix. After 100 trials with one set of goal states and starting point and one Q-matrix, a new set of starting conditions is chosen randomly, the Q-matrix is initialised with zeroes and the agent begins another 100-trial-period. In total, 500 trial periods of 100 trials each are performed for each parameter set. As an evaluation measure, we report the average number of steps needed until reaching the nal goal state over the course of 100 trials.
3.1
Representation of the State-Action Space
Most of the scenarios where reinforcement learning is applied and especially those in robotics have an underlying continuous state-action space rather than the grid-based state space of the current example. If these problems are to be solved with an algorithm working with a discrete state-action space like simple Q-learning, the discrete representation of the state-action space plays a pivotal role for the success and the speed of the learning algorithm. One of the main objectives when choosing a representation is a small resulting state-action space to reduce the necessary amount of exploration and thus to shorten the learning time. In addition, the algorithm should be able to generalise: the learned optimal policy should be applicable to a range of different settings including a scaled version of the basic problem. The main characteristics of the three representation possibilities that will be discussed in this section are summarised in table 1. A direct representation of our scenario is achieved through the use of the x- and y-coordinates of the agent and of each of the n goal states plus a success counter s ranging from 0 to n. Since this yields a deterministic Markovian scenario, the convergence of the single-step Q-learning algorithm is guaranteed for appropriate parameter values [3]. A disadvantage of this representation is the exponential increase of the size of the state-space with the number of subgoals n to be visited. This size can be reduced considerably if the x- and y-coordinates of the n goal states are omitted and just the x- and y-coordinates of the agent and the success counter s are used (see gure 1 (right)). This representation still provides a unique characterisation of each state if one xed setting is considered. However, the resulting strategy generalises neither to different positions of the subgoals nor to different grid sizes. We refer to this setting, a characterisation of the state by x and y coordinates of the agent and a counter for the already reached subgoals, by ad hoc representation. It might be interesting to model e.g. the situation that a xed grasp of one xed object in a well dened position is to be learned. As an alternative, we consider representations that generalise to different positions of the subgoals or even to different grid sizes. Naturally, they should result in a small state space comparable to the ad hoc representation. An interesting representation in the context of robotics is to maintain the distances of the agent to all (sub-)goals which are still to be visited and a success counter (see gure 1
situation information ++ + 0 perceptual aliasing 0 + ++ implementability 0 + size of the state space ++ 0 + generalisation capability 0 + ++
ad hoc distance direction
Table 1. Comparison of the three different representations including the implementability on a real robot (not discussed in the text).
scaling behaviour grid size
1 2 3 4 5
ad hoc dx dy n 10 10 20 20 100 400 200 800 300 1200 400 1600 500 2000
distance dy ( + 2 )n n 10 10 20 20 10 20 200 800 3000 24,000 40,000 640,000 500,000 16,000,000

dx 2
direction 5n n
any
5 50 375 2500 15625
Table 2. The dimension of the state space for the different representations. n is the number of goal states to be visited, dx dy is the dimension of the grid. The state space for an ad hoc representation with the positions of all n subgoals scales with (dx dy )n n.
(right)). This encoding mirrors the assumption that a robot only needs to know when it moves closer to or away from a goal. A representation in terms of distances can be expected to generalise to different goal positions within the grid and also, partially, to differently sized grids. The state space scales exponentially with the number of states to be visited; however, it is considerably smaller than a direct representation of all positions of the goals and, in addition, it can be expected to offer better generalisation capabilities. Unfortunately, this representation does not differentiate between all the states of the underlying system and thus need not yield a Markovian process. In addition, it is unclear whether an optimal strategy can be formulated in terms of this representation: If the current position of the agent is (5,5), the different goal positions (2,3) and (9,6) result in the same distance from the agent to the goal (5) but require different moves. A third representation that conserves only a small portion of the original information is the use of the directions from the agent to each of the n goal states plus a success counter s (see gure 1 (right)). Speaking in terms of robotics, the agent only needs to know the next grasping direction in order to nd a successful strategy. For simplication, the number of different directions is limited to ve (above, below, right, left,right here) in the current scenario. The direction representation results in a small state space which is independent of the underlying grid size. It can be expected to facilitate generalisation not only between scenarios with the same grid size but also between differently sized grids. Another advantage is the fact that an optimal strategy can clearly be formulated in terms of this state space (move in the direction of the current goal until you reach it). However, this strategy introduces severe perceptual aliasing and need not maintain the Markovian property of the process. If the current position of the agent is (5,5), for example, the different goal positions (7,7) and (8,3) result in the same direction from the agent to the goal (right) and are thus indistinguishable. As mentioned previously, a central aspect of the different representations that has a great inuence on learning speed and behaviour of the algorithm is the size of the state space which is given in table 2. As seen in this section the use of the direction and the distances representations can cause problems because the resulting state space is only partially observable. This phenomenon is known as perceptual aliasing [22, 23]. Consequently, the question arises whether the optimal strategy can be found in these representations using the classical learning algorithm which convergence is only guaranteed when the Markovian property holds.
convergence speed ++ 0 ++ solution quality ++ 0 ++ need for exploration + ++ ++
ad hoc distance direction
Table 3. The Results for different representations.
Figure 2. First results: Two exemplary learning curves using the ad hoc (left) and the direction (right) representation. A linear decay of with the parameter k is used (see section 3.2). Depending on the choice of parameter values, the direction representation either leads to a slower convergence of the learning algorithm or to a high number of outliers (trials with a very high number of steps until the nal goal is reached).
3.1.1
First Results
Table 3 provides a condensed map of the overall characteristics of the different experiments performed so far. These qualitative results mirror the quantitative behaviour of the algorithm. As expected, the distance representation needs a long learning time which is mainly due to the size of the state space which is about 10 times larger than in the other two representations (see table 2). In addition, it did not even succeed in learning the optimal policy in every set of trials which indicates that our assumption of a policy that is not representable in terms of the reduced state-action space is true. We therefore discarded this representation and use only the ad hoc and the distance representation during our further, more detailed, experiments. The (simplied) ad hoc representation and the direction representation both produce good results. Using the ad hoc representation, the learning of a near-optimal policy usually took place within about 40 trials and the resulting policy was very close to the global optimum (see gure 2). Depending on the choice of the learning parameters, the direction representation either needed a greater number of trials to learn an optimal policy or the learning speed was almost the same but with a greater number of outliers (trials with a very large number of steps until success or even with no success at all, see gure 2). This effect can be explained by perceptual aliasing and the resulting special structure of the Q-matrix: Since many different state-action pairs are mapped to the same entry, the algorithm needs a greater amount of random exploration until it has reached the goal state often enough to backpropagate its positive value. However, since the direction representation offers much more potential regarding the generalisation ability (as explained in section 3.1), it is certainly worth further investigation even if rst results seem to show some advantages of the (simplied) ad hoc representation.
3.2 Exploration vs. Exploitation

As seen in the last section, exploration is extremely important for the performance of the learning algorithm due to the phenomenon of perceptual aliasing. During learning, the amount of exploration and exploitation has to be carefully balanced. Since the learning algorithm does not know anything about its surroundings at the beginning, the rst goal must be to explore a high number of possible state-action combinations in order to gather as much information as possible about the system. Once a sufcient amount of information has been accumulated, it is more advantageous to at least partly exploit this information and to restrict the exploration to those areas of the state-action space that
Figure 3. Exploration results: ad hoc (left) and direction representation (right) seem to be relevant for the solution of the current problem (this problem is known as the explorationexploitation dilemma (see e.g. [18])). One way of achieving the desired behaviour is to use an -greedy strategy: the best action according to the current Q-matrix is chosen with probability (1 ), a random action is chosen with probability . We experimented with several different -greedy strategies with a decaying where {0, 1, . . . , 100} is the number of the current trial: (i) a linear decrease with decay factor k [0.01, 0.1]: = max(0.9 k , 0), (ii) an exponential decay with the parameter l [1, 10]: = 0.9 l and (iii) a sigmoidal decay with p, q N: = 1/(1 + e(q )/p ). 3.2.1 Results
During the following experiments, we rst determined the optimal parameter values for the decay of and afterwards compared the behaviour of the different scenarios using these optimal values. The use of sigmoidal decay generally resulted in a considerably slower learning behaviour than the other exploration-exploitation strategies. Consequently, this approach was soon abandoned and all further experiments were focused on the linear and the exponential decay of . In the ad hoc representation scenario, k = 0.03 was optimal for a linear decay of and l = 2 was optimal for an exponential decay (see gure 3). The latter shows a somewhat faster learning behaviour. In the direction representation scenario, k = 0.02 was optimal for a linear decay and l = 1.05 was optimal for an exponential decay. The latter again shows a slightly faster learning behaviour. A comparison of the optimal parameters shows a much greater need for exploration in the direction representation than in the ad hoc representation which tallies with the results from section 3.1.1. This is caused by the high amount of perceptual aliasing and the resulting special structure of the Q-matrix. Therefore the amount of exploration proved to be crucial for the performance of the learning algorithm.
3.3 Speeding up Exploration

In order to speed up the exploration process, we experimented with multi-step actions (MSAs) of a xed length [15]. A MSA corresponds to p successive executions of the same primitive action. The motivation behind MSAs is the fact that the immediate iteration of the same action is often part of an optimal strategy in many technical or biological processes. In grasping, for example, it is reasonable to move the gripper more than one step into the same direction before making a new decision, especially if the object is still far away. Thus, the incorporation of MSAs shapes the search space in these settings so that promising action sequences are explored rst. In our case, a set of MSAs consists of four multi-step actions corresponding to the four primitive actions. The parameters that need to be chosen are the number m of MSA sets that are to be added to the regular action set plus the repeat count pi for each of the sets. The use of MSAs requires an expansion of the reinforcement function:
Figure 4. Speeding up exploration via multi-step actions: in addition to the four primitive actions, there are now four multi-step actions of length 3 in the action set. The ad hoc representation is shown on the left, the direction representation on the right.
the value of a multi-step action is set to be the total (properly discounted) value of the respective single step actions rM SAi = pi j ri where ri is the reward for the primitive action i. In addition, j=0 we used the MSA-Q-learning algorithm [15] which propagates the rewards received for performing an MSA to the corresponding primitive actions in order to best exploit the gained information. 3.3.1 Results
The best results were achieved by adding just one set of MSAs of length 3 to the action set. As in section 3.2.1, we rst determined the optimal parameter values for the decay of within each scenario. The general shape of the resulting learning curves is very similar to the one seen in section 3.2.1, but there are two notable differences (see gure 4): Firstly, the speed of learning has improved so that the algorithm needs about 10 trials less to nd a near-optimal policy. Secondly, the optimal parameters for the decay of have changed drastically: from k = 0.03 (lin. decay) and l = 2 (exp. decay) to k = 0.07 and l = 10 in the ad hoc representation and from k = 0.02 and l = 1.05 to k = 0.05 and l = 5 in the direction representation. This shows that the algorithm now needs a substantially smaller amount of random exploration to learn the optimal policy because the exploration is biased into promising directions by the use of MSAs. This effect is especially pronounced in the direction representation which concurs with the results from sections 3.1.1 and 3.2.1. Consequently, the use of multi-step actions to speed up exploration can be seen as a good rst step in bringing the reinforcement learning algorithm closer to applicability in a real world problem where exploration is expensive.
3.4 Choice of Rewards

A second important aspect we investigated is the optimal choice of the reward function. In the default scenario, we gave a reward of 1 for reaching a sub-goal state, a reward of 10 for reaching the nal goal state and a reward of -0.1 for all other steps (called many rewards in this section). We tested two additional scenarios: One with just the rewards for the sub-goal states and the nal goal state, but no negative reward (called each success) and one where the reward is given only for reaching the nal goal state (called end reward). The results can be seen in gure 5. As expected, the strategy each success does not encourage short solutions and thus yields only suboptimal strategies. However, since no negative rewards are added to the Q-values, the Q-matrix and therefore also the graphs are very smooth. When using only end rewards, the convergence takes twice as long as with many rewards using the ad hoc representation (see gure 5) because no positive
Figure 5. Choice of rewards: Ad hoc representation, linear decay of with k = 0.07 is shown on the left, exponential decay of with l = 10 is shown on the right.
feedback is given until the nal goal state has been reached. Surprisingly, it eventually yields a nearly optimal strategy even though short solutions are not encouraged by the punishment of intermediate steps. This can be explained by the fact that, since exploration is so difcult for this task, only short solutions which can be explored in a short time survive. The many rewards approach offers the highest amount of information and thus leads to the fastest learning behaviour. This behaviour is even more pronounced if the compressed direction representation is used: Q-learning now does not succeed at all if combined with end rewards or each success. This is caused by the lack of sufcient structure of the state space which inhibits successful exploration of the complex task without guidance by intermediate rewards. Consequently, the division of the task into subgoals and the punishment of a large step number turn out to be crucial for the direction representation.
3.5 Scaling behaviour

An aspect that is of particular importance especially with regard to a possibly continuous underlying state space is the behaviour of the algorithm in a scaled version of the basic problem. We conduct several experiments with differently detailed discrete approximations of the same state space and measure the number of steps necessary for 100 successes (see table 4). We use only the primitive one-step-actions and no multi-step actions in this case because the latter would blur the differences between the different grid sizes. The step numbers are adjusted so that they refer to steps of equal length throughout the differently ne grids. As expected, a larger state space leads to a considerably higher step number due to the need for more exploration with an increasing grid size. There seems to be no signicant difference between the ad hoc and the direction representation. Note, however, that the direction representation offers a huge advantage: the resulting Q-matrix is independent of the grid size. Thus, the Q-matrix learned in a coarse discretisation (e.g. 10 10) can be directly applied to a ner discretisation (e.g. 30 30). Further experiments to determine the different qualities of Qmatrices learned with differently sized grids and their applicability to different settings are currently being conducted. This property would allow us to design efcient iterative learning schemes for
ad hoc (k = 0.03) n=1 n=2 n=3 2,623 5,736 8,969 8,793 20,607 31,491 17,394 36,852 55,858 direction (k=0.02) n=1 n=2 n=3 3,429 7,559 11,473 7,366 23,610 33,953 15,304 29,689 60,324
10 10 20 20 30 30
Table 4. Number of steps needed for 100 successes using different grid sizes.
the direction representation moving from a very coarse discretisation to a ne approximation of the continuous state space while saving a considerable amount of learning time.
4 Discussion
In this paper, we investigated different design choices of a reinforcement learning application to an articial gridworld problem which serves as an idea lter for real life target applications. As demonstrated, different representations cause major differences with respect to the basic aspects of the problem, notably the size of the state space and the mathematical properties of the process due to perceptual aliasing; consequently, they show a different robustness and need for shaping and exploration. We demonstrated that exploration is particularly relevant if the state space representation hides a part of the structure of the problem like the direction representation. In this setting, advanced exploration strategies such as multi-step actions prove to be particularly valuable. In addition, the decomposition of the problem into subgoals is crucial. Even in this simplied setting we found some remarkable and unexpected characteristics: the distances encoding of the state space is inferior to an encoding in terms of directions even though the latter provides less information about the underlying state. Multi-step actions improve the convergence behaviour, but simple multi-step actions which repeat each action thrice provide better performance than a larger action spectrum including MSAs of different repeat counts. The many rewards reinforcement function proved superior to the two alternative reinforcement signals given by end rewards or each success. However, unlike each success, end rewards converged to short solutions although no punishment of unnecessary steps has been incorporated in this setting. These ndings stress the importance of creating a map of simplied situations to allow a thorough investigation of reinforcement paradigms. The results found in these reduced settings can guide design choices and heuristics for more complex settings in real life reinforcement scenarios. An important issue closely connected to the scaling behaviour of learning algorithms is its generalisation ability. The design of the state space denes the representation of the value function and thus widely determines the generalisation ability of a reinforcement learner. So far, we did not yet explicitely address the generalisation ability of the various representations considered in this article. It can be expected that the direction representation facilitates generalisation to different settings within the same grid, and, moreover, to grid structures of different sizes. The latter property is particularly striking if an underlying continuous problem is investigated. As demonstrated, smaller scenarios require less exploration so that a good strategy can be learned in a short time for small grids. Appropriate generalisation behaviour would then allow us to immediately transfer the trained Q-matrix to larger grids. This could be combined with automatic grid adaptation strategies as proposed e.g. in [10]. Since a successful generalisation to larger grid spaces can only be expected if large parts of the Q-matrix have been updated, further reduction techniques which drop irrelevant attributes as proposed e.g. in [4] for supervised learning could also be valuable. Another innovative possibility of achieving inherent generalisation is to expand the capacity of the action primitives. Instead of simple one step moves, restricted though powerful actions which depend on the current perception of the agent could be introduced. Locally linear control models constitute particularly promising candidates for such a design because of their inherent generalisation capability combined with impressive, though linearly restricted, capacity [5, 8, 11]. So far we have tackled the above problem as an articial setting which mirrors important structures of real life problems such as differently sized grids, subgoals, a large state space. The incorporation of powerful action primitives, however, turns this simple strategy learned by reinforcement learning into a powerful control strategy applicable to real-life scenarios like grasping. Local nger movements can be reliably controlled by basic actions such as locally linear controllers where the applicability of a single controller is limited to a linear regime of the process. A global control of these local moves shares important characteristics with our setting: there is a comparably small number of possible basic actions which
has to be coordinated. Thus, our setting can be interpreted as a high-level reinforcement learner on top of basic independent actions. Consequently, a transfer of multi-step actions to this domain seems particularly promising: the iteration of the same action might correspond to the choice of the same local linear controller until its local application domain has been left. An interesting further direction within this context is to design perception dependent stop criteria for the durations of actions.
References
[1] C. G. Atkeson and S. Schaal. Robot learning from demonstration. In Proc. 14th ICML, pages 1220. Morgan Kaufmann, 1997. [2] L. C. Baird. Residual algorithms: Reinforment learning with function approximation. In Proc. 12th ICML, pages 3037. Morgan Kaufmann, 1995. [3] D. P. Bertsekas and J. N. Tsisiklis. Neuro-dynamic programming. Athena Scientic, Belmont, MA, 1996. [4] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance determination in learning vector quantization. In M. Verleysen, editor, ESANN2001, pages 271276. D-facto publications, 2001. [5] T. W. Cacciatore and S. J. Nowlan. Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in NIPS, volume 6, pages 719726. Morgan Kaufmann Publishers, Inc., 1994. [6] C. Gaskett, D. Wettergreen, and A. Zelinsky. Q-learning in continuous state and action spaces. In Australian Joint Conference on Articial Intelligence, pages 417428, 1999. [7] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4:237285, 1996. [8] Z. Kalm r, C. Szepesv ri, and A. L rincz. Module-based reinforcement learning: Experiments with a a a o real robot. Machine Learning, 31(13):5585, April 1997. [9] J. Morimoto and K. Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36:3751, 2001. [10] S. Pareigis. Adaptive choice of grid and time in reinforcement learning. Advances in Neural Information Processing Systems, 10:10361042, 1998. [11] J. Randlv, A. G. Barto, and M. T. Rosenstein. Combining reinforcement learning with a local control algorithm. In Proc. 17th ICML, pages 775782, 2000. [12] M. Riedmiller, A. Merke, D. Meier, A. Hoffmann, A. Sinner, O. Thate, and R. Ehrmann. Karlsruhe brainstormers a reinforcement learning approach to robotic soccer. In P. Stone, T. Balch, and G. Kraetzschmar, editors, RoboCup-2000: Robot Soccer World Cup IV, pages 367372. Springer, Berlin, 2001. [13] H. Ritter, J. Steil, C. N lker, F. R thling, and P. McGuire. Neural architectures for robotic intelligence. o o Reviews in the Neurosciences, 14(1-2):121143, 2003. [14] J. C. Santamara, R. S. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2), 1998. [15] R. Schoknecht and M. Riedmiller. Reinforcement learning on explicitly specied time scales. Neural Computing & Applications Journal, 12(2):6180, 2003. [16] J. Steil, F. R thling, R. Haschke, and H. Ritter. Learning issues in a multi-modal robot-instruction sceo nario. In Proc. IROS, volume Workshop on Robot Programming Through Demonstration, Oct 2003. [17] P. Stone and R. S. Sutton. Scaling reinforcement learning toward RoboCup soccer. In Proc. 18th ICML, pages 537544, San Francisco, CA, 2001. Morgan Kaufmann. [18] R. S. Sutton and A. G. Barto. Reinforcement Learning: - An Introduction. MIT Press, Cambridge, 1998. [19] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Articial Intelligence, 112(12):181211, 1999. [20] C. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989. [21] C. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279292, 1992. [22] T. Wengerek. Reinforcement-Lernen in der Robotik. PhD thesis, Techn. Fak., Universit t Bielefeld, 1995. a [23] S. D. Whitehead and D. H. Ballard. Active perception and reinforcement learning. In Proc. 10th ICML, pages 179188. Morgan Kaufmann, 1990.

Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study

Uploaded by

Copyright:

Available Formats

Mapping the Design Space of Reinforcement Learning Problems a Case Study

Helge Ritter Neuroinformatics Group University of Bielefeld helge@techfak.uni-bielefeld.de

3 Experiments and Results

Representation of the State-Action Space

ad hoc distance direction

scaling behaviour grid size

distance dy ( + 2 )n n 10 10 20 20 10 20 200 800 3000 24,000 40,000 640,000 500,000 16,000,000

ad hoc distance direction

Table 3. The Results for different representations.

3.2 Exploration vs. Exploitation

3.3 Speeding up Exploration

3.4 Choice of Rewards

3.5 Scaling behaviour

You might also like