You are on page 1of 12

DAMAS-Rescue 2006

Stephane Ross, Sebastien Paquet, and Brahim Chaib-draa


DAMAS laboratory, Laval University, Quebec (Qc), Canada
{ross,spaquet,chaib}@damas.ift.ulaval.ca,
http://www.damas.ift.ulaval.ca

Abstract. In this paper, we describe DAMAS-Rescue, a team of agents


participating in the RoboCupRescue simulation competition. In the following, we explain the strategies of all our agents that will be used at the
world competition in 2006 at Bremen in Germany. In short, FireBrigade
agents are choosing the best fire to extinguish based on the knowledge
they have learned with a selective perception learning method. AmbulanceTeams are always rescuing the same civilian based on the messages
received by the AmbulanceCenter. This center uses a scheduling algorithm to order the civilians in a way that maximizes the number of civilians rescued. PoliceForces use an online POMDP algorithm to choose
the best roads to visit at each turn.

Introduction

It is not an easy task to develop a multiagent system to act in the RoboCupRescue simulation environment, since it is a complex environment with a lot of
challenges. To share our work on this project, we explain in this article how the
DAMAS-Rescue [1] team works by describing the strategies that we have implemented for the 2006 world competition. The first method we have developed
is a selective perception learning method to enable the FireBrigade agents to
learn their effectiveness when they are extinguishing fires. By evaluating their
capacity and the utility to extinguish a fire, they are able to coordinate their fire
choices on the most important fires to extinguish. For the PoliceForce agents, we
have developed an online POMDP algorithm to enable them to evaluate different
paths and choose the best one based on their current belief state about the state
of the roads. This algorithm gives them flexibility in their decisions about which
roads to clear and at the same time it enables them to coordinate themselves.
We have conceived the reward function so that each PoliceForce agent repulse
the other PoliceForce agents, thus preventing them from working on the same
part of the city. For the AmbulanceTeam agents, we have adapted a scheduling
algorithm enabling agents to order the civilians in a way that maximizes the
number of civilians that can be saved before their death. Even though we consider that most readers of this article already know about the RoboCupRescue
simulation environment, we describe it very briefly in the next section to enable all people to understand it. Afterwards, we describe the strategies that we
have developed for the 2006 RoboCupRescue world competition. We begin by

describing the FireStation and the FireBrigades, then the PoliceForces and the
PoliceOffice and finally, the AmbulanceTeams and AmbulanceCenter.

The RoboCupRescue Environment

The goal of the RoboCupRescue simulation project aims to simulate rescue teams
acting in large urban disasters [2]. Precisely, this project takes the form of an
annual competition in which participants are designing rescue agents trying to
minimize damages, caused by a big earthquake, such as civilians buried, buildings
on fire and blocked roads. In the simulation, participants have approximately 30
to 40 agents of six different types to manage:
FireBrigade There are 0 to 15 agents of this type. Their task is to extinguish
fires. Each FireBrigade agent is in contact by radio with all other FireBrigade
agents as well as with the FireStation.
PoliceForce There are 0 to 15 agents of this type. Their task is to clear roads
to enable agents to circulate. Each PoliceForce agent is in contact by radio
with all other PoliceForce agents as well as with the PoliceOffice.
AmbulanceTeam There are 0 to 8 agents of this type. Their task is to search
in shattered buildings for buried civilians and to transport injured agents to
hospitals. Each AmbulanceTeam agent is in contact by radio with all other
AmbulanceTeam agents as well as with the AmbulanceCenter.
Center agents There are three types of center agents: FireStation, PoliceOffice
and AmbulanceCenter. These agents can only send and receive messages.
They are in contact by radio with all their mobile agents as well as with the
other center agents. A center agent can read more messages than a mobile
agent, so center agents can serve as information centers and coordinators for
their mobile agents.
In the simulation, each individual agent receives visual information of only
the region surrounding it. Thus, no agent has a complete knowledge of the global
state of the environment. This uncertainty complicates the problem greatly because agents have to explore the environment and they also have to communicate
to help each other to have a better knowledge of the situation.

FireBrigade and FireStation Agents

In this section, we focus on the FireBrigade and the FireStation agents. As mentioned before, the task of the FireBrigade agents is to extinguish fires. Therefore,
at each step in time, each FireBrigade agent has to choose which building on
fire to extinguish. However, in order to be effective, FireBrigade agents have
to coordinate themselves on the same buildings on fire, because more than one
agent is often needed to extinguish a building on fire. The main problem is that
they do not know how many agents are needed for each particular building on
fire. To learn this, agents are using a selective perception reinforcement learning
algorithm which is described in the next section.

3.1

Selective Perception

To learn the expected reward of extinguishing one building on fire or one fire
zone, we have used a selective perception technique [3], because the description
of our states is too big. With this technique, an agent learns by itself to reduce
the number of possible states. The algorithm uses a tree structure similar to
a decision tree. By building the tree, the agent learns by itself to reduce the
number of possible task descriptions. In fact, the agent regroups all similar tasks
together and it does not distinguish between tasks of the same group. It considers tasks to be similar if they have similar expected rewards because the agent
would take the same decision in all of those situations, therefore it does not
have to distinguish them. In this application, instances used in the tree contain
the number of agents accomplishing a task n and a task description d which
is described by: the intensity of the fire (3 possible values), the buildings composition (3 possible values), the buildings size (continuous value), the buildings
damage (4 possible values) and the number of adjacent buildings on fire (continuous value). Therefore, the number of possible instances is quite important.
In fact, with the continuous attributes buildings size and number of adjacent
buildings on fire, there is an infinite number of instances. However, with our
learning algorithm, the number of abstract task descriptions considered is kept
small.
3.2

Recording of the Agents Experiences

At each time step t, the agent records its experience captured as an instance
that contains the task it tried (dt), the number of agents that tried the same
task (nt) and the reward it obtained (rt). Each instance also has a link to the
preceding instance and the next one, thus making a chain of instances. In our
case, we have one chain for each task that an agent chooses to accomplish. A
chain contains all instances from the time an agent chooses to accomplish a
task until it changes to another task. Therefore, during its execution, the agent
records many instances organized in many instance chains. It keeps all those
instances until the end of a trial.
3.3

Tree Structure

To learn how to classify the instances, we use a tree structure similar to a decision
tree. The tree divides the instances in clusters depending on their expected
reward. The objective here is to regroup all instances having similar expected
rewards. The algorithm presented here is an instance-based algorithm in which
a tree is used to store all instances which are kept in the leaves of the tree. The
other nodes of the tree, called center nodes, are used to divide the instances
with a test on a specific attribute. Each leaf of the tree also contains a Q-value
indicating the expected reward if a task that belongs to this leaf is chosen. In
our approach, a leaf l of the tree is considered to be a task description (a state)
for the learning algorithm.

4
Building
Composition
Wood

Reinforced Concrete
Steel Frame

Fire
Intensity

LN

Weak

Building
Size

t1

Strong

> t1

Moderate

LN

LN

LN

Number Of
Agents

LN

t2

> t2

LN

LN

Fig. 1. Structure of a tree.

Algorithm 1 Algorithm used to update the tree.


Procedure Update-Tree(Instances)
Input: Instances: all instances to add to the tree.
Static: T ree: the tree.
for all i in Instances do
Add-Instance(T ree, i)
end for
Update-Q-Values(T ree)
Expand(T ree)
Update-Q-Values(T ree)

An example of a tree is shown in Figure 1. Each rectangular node represents


a test on the specified attribute. The words on the links represent possible values
for discrete variables. The tree also contains a center node testing on a continuous
attribute, the Building size. A test on a continuous attribute always has two
possible results, it is either less or equal to the threshold or greater than the
threshold. The oval nodes (LN) are the leaf nodes of the tree where the instances
and the Q-values are stored.
3.4

Update of the Tree

This section presents the algorithm used to update the tree using all the new
recorded instances. Algorithm 1 shows an abstract version of the algorithm and
the following subsections present in more detail each function used.
Add Instances After a trial, all agents put their new instances together. This
set of new instances is then used to update the tree. Thus, the first step of the
algorithm is simply to add all instances to the leaves they belong to. To find those
leaves, we simply start at the root of the tree and head down the tree choosing at
each center node the branch indicated by the result of the test on the instances
attributes, which could be one of the attributes of the task description d or the
number of agents n.

Update Q-values The second step updates the Q-values of each leaf node to
take into consideration the new instances which were just added. The updates
are done with the following equation:
+
Q0 (l) = R(l)

T(l, l0 )Q0 (l0 )

(1)

l0

where Q(l) is the expected reward if the agent tries to accomplish a task

belonging to the leaf l, R(l)


is the estimated immediate reward if a task that
belongs to the leaf l is chosen, T(l, l0 ) is the estimated probability that the next
instance would be stored in leaf l 0 given that the current instance is stored in

leaf l. Those values are calculated directly from the recorded instances. R(l)
0

is the average reward obtained when a task in l was chosen and T (l, l ) is the
proportion of next instances that are in leaf l 0 :
P
rt

(2)
R(l) = it Il
|Il |
|{it |it Il L(it+1 ) = l0 }|
T(l, l0 ) =
|Il |

(3)

where L(i) is a function returning the leaf l of an instance i, Il represents the


set of all instances stored in leaf l, |Il | is the number of instances in leaf l and rt
is the reward obtained at time t when nt agents were trying to accomplish the
task dt .
To update the Q-values, the equation 1 is applied iteratively until the average
squared error is less than a small specified threshold. The error is calculated using
the following equation, which is the average squared difference between the new
and the old Q-values:
P 0
(Q (l) Q(l))2
(4)
E= l
nl
where nl is the number of leaf nodes in the tree.
Expand the Tree After the Q-values have been updated, the third step checks
all leaf nodes to see if it would be useful to expand a leaf and replace it with a
new center node, thus dividing the instances more finely and refining the agents
representation of the task description space.
To find the best test to divide the instances, we try all possible tests, i.e. we
try to divide the instances according to each attribute describing a task or the
number of agents. After all attributes have been tested, we choose the attribute
that maximizes the error reduction as shown in equation 5 [4]
The error measure considered is the standard deviation (sd(Il )) on the instances expected rewards. Therefore, a test is chosen if, by splitting the instances, it ends up reducing the standard deviation on the expected rewards.
The expected error reduction obtained when dividing the instances Il of leaf l is

calculated using the following equation where Ik denotes the subset of instances
in Il that have the k th outcome for the potential test:
X |Ik |
sd(Ik )
(5)
Error = sd(Il )
|Il |
k

The standard deviation is calculated on the expected reward of each instance


which is defined as:
QI (it ) = rt + T(L(it ), L(it+1 )) Q(L(it+1 ))

(6)

where T(L(it ), L(it+1 )) is calculated using equation 3 and Q(L(it+1 )) using


equation 1. The Q-values for the new generated nodes are estimated by doing a
local update for the Q-values only on those new nodes.
3.5

Use of the Tree

During the execution, each agent uses this tree to estimate the number of partners that are needed to accomplish a task. Since the number of agents is considered as an attribute when the tree is learned, there are center nodes in the
tree testing on the number of agents. Therefore, if different numbers of agents
are tested, different leaves and thus different rewards may be found, even with
the same task description. Consequently, to find the number of agents needed
for a particular task, the agent can test different number of agents and look at
the expected rewards returned by the tree.
Algorithm 2 presents the function used to estimate the number of agents
needed for a given task. In this algorithm, the function Expected-Reward
returns the expected reward if n agents are trying to accomplish the task d. To
do so, the agent finds the corresponding leaf in the tree, considering the task d
and the number of agents n, and record the expected reward for this task that
is stored in the leaf found.
This function is called for all possible number of agents until the expected
reward returned by the tree is greater than a specified threshold. If the expected
reward is greater then the threshold, it means that the current number of agents
should be enough to accomplish the task. If the expected reward is always under
the threshold, even with the maximum number of agents, the function returns
, meaning that the task is considered impossible with the available number of
agents.
3.6

Agents coordination

During the simulation, the agents use the tree created offline to choose the best
fire zone and the best building on fire to extinguish. Since the F ireStation agent
has a better global view of the situation, it is its responsibility to suggest fire
zones to F ireBrigade agents. Those agents have however a better local view,
so they choose which particular building on fire to extinguish in the given zone.
By doing so, we can take advantage of the better global view of the F ireStation
agent and the better local view of the F ireBrigade agent at the same time.

Algorithm 2 Algorithm used to find the number of agents needed for a given
task description d.
Function Number-Agents-Needed(d)
Input: d: a task description.
Statics: T ree: the tree learned.
N : The number of available agents.
T hreshold: The limit to surpass.
for n = 1 to N do
expReward Expected-Reward(T ree, d, n)
if expReward T hreshold then return n
end for
return

Fire Zones Delimitation At the beginning of the simulation, we assign to


each building a fire zone by regrouping buildings that are immediately adjacent
in the same fire zone. The roads in the map are used as fire zone delimiters.
This delimitation is useful to take into account the way fire spread in the city :
adjacent buildings in the same fire zone are more likely to take quickly on fire
then buildings that are in adjacent fire zones, since the heat must travel a greater
distance to reach them.
Fire Zones Allocation The tree learned is used to get an estimation on the
number of agents that are needed to extinguish a fire. To allocate the fire zones,
the F ireStation agent has a list of all fire zones containing at least one building on fire. For each fire zone, it has to estimate the number of agents that are
needed to extinguish this zone. To do so, it makes a list of all the buildings on
fire and finds the number of agents that are needed to extinguish each of them,
using Algorithm 2. It then estimates the number of agents needed to extinguish
the fire zone as the maximum number of agents returned for one building in
the zone. The F ireStation agent does the same thing with all fire zones, ending up with a number of agents for each zone. The F ireStation also checks
whether the agents can reach or not each fire zone by using an unblocked route.
With this information, it then chooses the reachable zone for which the needed
F ireBrigade agents are the closest. If no fire zones are reachable, we simply
send the agents to the closest fire zone. Afterwards, it removes this zone and
the assigned agents from its lists and continues the process with the remaining
agents and the remaining fire zones until there is no agent or fire zone left. When
a zone is chosen, the assigned agents are those that are closer to this zone. At
the end, the F ireStation agent sends to each F ireBrigade agent their assigned
fire zone.
Choice of Buildings on Fire When choosing a building to extinguish, a
F ireBrigade agent has a list of buildings on fire it knows about in the specified
fire zone. More precisely, this list is a sorted list of buildings on fire in which

each building at the position i is the closest building, not already on the list, of
the building in position i 1. The first building is a reference building given by
the F ireStation when it assigns the fire zone.
All F ireBrigade agents have approximately the same list of buildings on
fire. To choose their building on fire they go through the list, one building at a
time. For each building, they use the tree to find the expected number of agents
needed to extinguish the fire, by using Algorithm 2. With this information, each
agent assigns a building to itself by considering that the other agents choose
their building according to the same information. If they actually have the same
information, they should be well coordinated on the buildings to extinguish.
Fire Prevention In order to prevent the fire from spreading to other buildings,
we use one F ireBrigade in each assigned fire zones to send water on buildings
not on fire to do fire prevention. Our experience with the fire simulator showed
that prevention on immediately adjacent buildings to buildings on fire is rarely
efficient. So we rather adopt a different strategy in which we try to prevent the
fire from spreading to other fire zones which do not already contain a building
on fire.
The F ireBrigade in charge of doing fire prevention use an heuristic function
to approximate which building is most likely to catch on fire in the adjacent fire
zones. This heuristic returns a probability for the building to catch on fire by
considering different parameters, such as the number of adjacent buildings on
fire within a certain radius, the distance separating the building to the other
buildings on fire, the buildings size and composition and the quantity of water
already sent on the building. The F ireBrigade will do fire prevention on the
building having the highest probability of catching on fire, if this probability is
higher than a certain threshold, else it will extinguish the buildings already on
fire.

PoliceForce and PoliceOffice Agents

Polices are playing a key role in the rescue operation by clearing the roads, thus
enabling all agents to circulate.Without them, some actions would be impossible
because agents would be indefinitely blocked by roads blockades. Therefore, it
is really important for them to be fast an efficient.
For the police agents, we developed an online POMDP approach based on a
look-ahead search to find the best action to execute at each cycle in the environment. Since we need a fast online algorithm, we opted for a factored POMDP
representation and a branch and bound strategy based on a limited depth first
search instead of classical dynamic programming. The tradeoff obtained between
the solution quality and the computing time is very interesting.
4.1

Belief State Value Approximation

The main idea of our online POMDP approach is to estimate the value of a
belief state by constructing a tree where the nodes are belief states and where

the branches are a combination of actions and observations. To do so, we have


defined a new function : B N R which is based on a depth-first search.
The function takes as parameters a belief state b and a remaining depth d and
returns an estimation of the value of b by performing a search of depth d. For
the first call, d is initialized at D, the maximum depth allowed for the search.

d = 0 : U (b)
X
(b, d) = d > 0 : max R(b, a) +
(7)
P (o|b, a)( (b, a, o), d 1)

a
o

When d = 0, we are at the bottom of the search tree. In this situation, we


need a utility function U (b), that gives an estimation of the real value of this
belief state. When d > 0, the value of a belief state at a depth of D d is the
immediate reward for being in this belief state added to the maximum discounted
reward of the subtrees underneath this belief state.
Finally, the agents policy which returns the action the agent should do in a
certain belief state is defined as:
X
(b, D) = arg max R(b, a) +
P (o|b, a)( (b, a, o), D 1)
(8)
a

4.2

RTBSS Algorithm

We have elaborated an algorithm, called RTBSS (see Algorithm 3), that is used
to construct the search tree and to find the best action. Since it is an online
algorithm, it must be applied each time the agent has to make a decision.
To speed up the search, our algorithm uses a Branch and Bound strategy
to cut some sub-trees. The algorithm first explores a child node in the tree and
computes its value which becomes a lower bound on the maximal expected value
of the current node. Afterwards, for each other child nodes, the algorithm can
evaluate with an Heuristic function if it is possible to improve the lower bound
by pursuing the search (at line 8). The heuristic function must be defined for
each problem and it must always overestimate the true value. Moreover, the
purpose of sorting the actions at line 5 is to try the actions that are the most
promising first because it generates more pruning early in the search tree.
With RTBSS the agent finds at each turn the action that has the maximal
expected value up to a certain horizon of D. As a matter of fact, the performance
of the algorithm strongly depends on the maximal depth D of the search.
4.3

RoboCupRescue viewed as a POMDP

We present how we modelled the problem of the RoboCupRescue as a POMDP,


from the point of view of a policeman agent. The different actions an agent
can do are: N orth, South, East, W est and Clear. A state of the system can be
described by approximately 1500 random variables, depending on the simulation.
Roads There are approximately 800 roads in a simulation and each road can
either be blocked or cleared. This makes 2800 possible configurations.

10

Algorithm 3 The RTBSS algorithm.


1: Function RTBSS(b, d) returns the estimated value of b.
Inputs: b: The current belief state.
d : The current depth.
Statics: D: The maximal search depth.
action: The best action.
2:
3:
4:
5:
6:
7:

if d = 0 then return U (b)


actionList Sort(b, A)
max
for all a actionList do
if R(b, a) + Heuristic(b,a,d)X
> max then
curReward R(b, a) +
P (o|a, b)RTBSS( (b, a, o), d 1)
o

8:
if curReward > max then
9:
max curReward
10:
if d = D then action a
11:
end if
12:
end if
13: end for
14: return max

Buildings There are approximately 700 buildings in a simulation. We consider


that a building can be on fire or not.
Agents position An agent can be on any of the 800 roads and theres usually
30-40 agents. This makes 80030 different possibilities.
If we estimate the number of states, we obtain 2800 2700 80030 states.
However, a strong majority of them are not possible and will never be reached.
The state space of RoboCupRescue is too important to even consider applying
offline algorithms. We must therefore adopt an online method that allows finding
a good solution very quickly.
In RoboCupRescue, the online search in the belief state space represents a
search in the possible paths that an agent can take. In the tree, the probability
to go from one belief state to another depends on the probability that the road
used is blocked. One specificity of this problem is that we have to return a path
to the simulator, thus the RTBSS algorithm has been modified to return the best
branch of the tree instead of only the first action. Moreover, we have defined a
dynamic reward function that gives a reward for clearing a road that depends on
the position of the fires and the other agents. This enables the agent to efficiently
compute its estimated rewards based on its current belief state without having
to explicitly store all rewards for all states.
A policeman agent needs to assign a reward to each road in the city, which
are represented as nodes in a graph (see Figure 2). The reward values change in
time based on the position of the agents and the fires, therefore the agent needs
to recalculate them at each turn. To calculate the reward values, the agent

11
2

6
FF : 3
F:3

FF : 2

FF : 4
P : -3

P : -1
FF : 4
F:1

Fire (F)

Policeman (P)

6
3

FF : 4
P : -1

3
FF : 5
P : -2

FF : 3
F:3

FireFighter (FF)

Fig. 2. Reward functions graph.

propagates rewards over the graph, starting from the rewarding roads, which
are the position of the agents and the fires. For example, if a fire fighter agent
is on road r1 then this road would receive a reward of 5, the roads adjacent to
r1 in the graph would receive a reward of 4, and so on. Also, we add rewards for
all roads in a certain perimeter around a fire.
What is interesting with this reward function is that it can be used to coordinate the policeman agents. The coordination is necessary, because we do not
want all agents to go to the same road. To do so, the agent propagates negative
rewards around the other policemen, such that they repulse each other.
Figure 2 shows an example of a reward graph. The nodes represent the roads
and the reward source is identified in each node. The big number over a node
is the total reward, which is the sum of all rewards identified in the node. As
we can see, roads around the fire fighter agent receive positive reward, while
roads around the policeman agent receive negative rewards. Therefore, the agent
would want to go to roads near the fire and not necessarily go to the fire fighter
because there is already a policeman agent near it. Consequently, agents are
coordinating themselves simply by propagating negative rewards. This is a nice
way to coordinate agents in an online multiagent POMDP.

AmbulanceTeam and AmbulanceCenter Agents

In this section, we present the AmbulanceT eam and the AmbulanceCenter


agents. In their case, the center has a lot of responsibilities, since it is the one
making all the decisions about which civilians to rescue and in which order.
5.1

AmbulanceCenter

At each turn, the AmbulanceCenter agent sends the ordered list of civilians to
rescue at the AmbulanceT eams. Those agents are described later, but for now
it is worth mentioning that every AmbulanceT eams are rescuing the same agent
to reduce the rescuing time.

12

First of all, the center agent sends to the ambulances the order to save the
other agents of the rescue team (i.e. F ireBrigades, P oliceOf f ices or other
AmbulanceT eams), if there are some that are buried. When all our agents have
been saved, the center agent calculates which civilians to save and in which order.
To do so, it uses a task scheduling algorithm to try to maximize the number of
civilians that could be saved. Each task corresponding at saving a civilian has
a time length giving the necessary time to save the civilian, taking into account
the travel time to go to the civilian location, the rescuing time and the time
to transport it to the refuge. Each task also has a deadline representing the
expected death time of the civilian. We also prioritize civilians that are near
buildings on fire since they are more likely to die quickly if their building catch
on fire.
5.2

AmbulanceTeam

As we have said before, AmbulanceT eams are rescuing civilians according to


the order given by the AmbulanceCenter. In each message from the center,
the ambulances receive two civilians to rescue. Since all AmbulanceT eam is
receiving the same messages, they are always rescuing the same agent. This
reduces the time necessary to rescue an agent. When there are no civilians to
rescue, AmbulanceT eams search to find buried civilians. If they found some
buried civilians, they send a message to the AmbulanceCenter indicating the
position and the healthiness of each civilian.

Conclusion

This paper has presented the strategies and algorithms used by all the agents
of the DAMAS-Rescue team. To resume, F ireBrigade agents are choosing the
best fire to extinguish based on the utility of a building on fire and their capacity
to extinguish it. The capacity is learned by the agent with a selective perception
learning method. AmbulanceT eams are always rescuing the same civilian based
on the messages received by the AmbulanceCenter. This center uses a scheduling
algorithm to order the civilians in a way that maximizes the number of civilians
rescued. P oliceF orces uses an online POMDP algorithm to choose the best path
to clear at each turn.

References
1. Paquet, S., S., R.: Damas-rescue web page. (2004)
2. Kitano, H.: Robocup rescue: A grand challenge for multi-agent systems. In: Proceedings of ICMAS 2000, Boston, MA (2000)
3. McCallum, A.: Reinforcement Learning with Selective Perception and Hidden State.
PhD thesis, University of Rochester, Rochester, New-York (1996)
4. Quinlan, J.: Combining instance-based and model-based learning. In: Proceedings of
the Tenth International Conference on Machine Learning, Amherst, Massachusetts,
Morgan Kaufmann (1993) 236243

You might also like