Professional Documents
Culture Documents
Introduction
It is not an easy task to develop a multiagent system to act in the RoboCupRescue simulation environment, since it is a complex environment with a lot of
challenges. To share our work on this project, we explain in this article how the
DAMAS-Rescue [1] team works by describing the strategies that we have implemented for the 2006 world competition. The first method we have developed
is a selective perception learning method to enable the FireBrigade agents to
learn their effectiveness when they are extinguishing fires. By evaluating their
capacity and the utility to extinguish a fire, they are able to coordinate their fire
choices on the most important fires to extinguish. For the PoliceForce agents, we
have developed an online POMDP algorithm to enable them to evaluate different
paths and choose the best one based on their current belief state about the state
of the roads. This algorithm gives them flexibility in their decisions about which
roads to clear and at the same time it enables them to coordinate themselves.
We have conceived the reward function so that each PoliceForce agent repulse
the other PoliceForce agents, thus preventing them from working on the same
part of the city. For the AmbulanceTeam agents, we have adapted a scheduling
algorithm enabling agents to order the civilians in a way that maximizes the
number of civilians that can be saved before their death. Even though we consider that most readers of this article already know about the RoboCupRescue
simulation environment, we describe it very briefly in the next section to enable all people to understand it. Afterwards, we describe the strategies that we
have developed for the 2006 RoboCupRescue world competition. We begin by
describing the FireStation and the FireBrigades, then the PoliceForces and the
PoliceOffice and finally, the AmbulanceTeams and AmbulanceCenter.
The goal of the RoboCupRescue simulation project aims to simulate rescue teams
acting in large urban disasters [2]. Precisely, this project takes the form of an
annual competition in which participants are designing rescue agents trying to
minimize damages, caused by a big earthquake, such as civilians buried, buildings
on fire and blocked roads. In the simulation, participants have approximately 30
to 40 agents of six different types to manage:
FireBrigade There are 0 to 15 agents of this type. Their task is to extinguish
fires. Each FireBrigade agent is in contact by radio with all other FireBrigade
agents as well as with the FireStation.
PoliceForce There are 0 to 15 agents of this type. Their task is to clear roads
to enable agents to circulate. Each PoliceForce agent is in contact by radio
with all other PoliceForce agents as well as with the PoliceOffice.
AmbulanceTeam There are 0 to 8 agents of this type. Their task is to search
in shattered buildings for buried civilians and to transport injured agents to
hospitals. Each AmbulanceTeam agent is in contact by radio with all other
AmbulanceTeam agents as well as with the AmbulanceCenter.
Center agents There are three types of center agents: FireStation, PoliceOffice
and AmbulanceCenter. These agents can only send and receive messages.
They are in contact by radio with all their mobile agents as well as with the
other center agents. A center agent can read more messages than a mobile
agent, so center agents can serve as information centers and coordinators for
their mobile agents.
In the simulation, each individual agent receives visual information of only
the region surrounding it. Thus, no agent has a complete knowledge of the global
state of the environment. This uncertainty complicates the problem greatly because agents have to explore the environment and they also have to communicate
to help each other to have a better knowledge of the situation.
In this section, we focus on the FireBrigade and the FireStation agents. As mentioned before, the task of the FireBrigade agents is to extinguish fires. Therefore,
at each step in time, each FireBrigade agent has to choose which building on
fire to extinguish. However, in order to be effective, FireBrigade agents have
to coordinate themselves on the same buildings on fire, because more than one
agent is often needed to extinguish a building on fire. The main problem is that
they do not know how many agents are needed for each particular building on
fire. To learn this, agents are using a selective perception reinforcement learning
algorithm which is described in the next section.
3.1
Selective Perception
To learn the expected reward of extinguishing one building on fire or one fire
zone, we have used a selective perception technique [3], because the description
of our states is too big. With this technique, an agent learns by itself to reduce
the number of possible states. The algorithm uses a tree structure similar to
a decision tree. By building the tree, the agent learns by itself to reduce the
number of possible task descriptions. In fact, the agent regroups all similar tasks
together and it does not distinguish between tasks of the same group. It considers tasks to be similar if they have similar expected rewards because the agent
would take the same decision in all of those situations, therefore it does not
have to distinguish them. In this application, instances used in the tree contain
the number of agents accomplishing a task n and a task description d which
is described by: the intensity of the fire (3 possible values), the buildings composition (3 possible values), the buildings size (continuous value), the buildings
damage (4 possible values) and the number of adjacent buildings on fire (continuous value). Therefore, the number of possible instances is quite important.
In fact, with the continuous attributes buildings size and number of adjacent
buildings on fire, there is an infinite number of instances. However, with our
learning algorithm, the number of abstract task descriptions considered is kept
small.
3.2
At each time step t, the agent records its experience captured as an instance
that contains the task it tried (dt), the number of agents that tried the same
task (nt) and the reward it obtained (rt). Each instance also has a link to the
preceding instance and the next one, thus making a chain of instances. In our
case, we have one chain for each task that an agent chooses to accomplish. A
chain contains all instances from the time an agent chooses to accomplish a
task until it changes to another task. Therefore, during its execution, the agent
records many instances organized in many instance chains. It keeps all those
instances until the end of a trial.
3.3
Tree Structure
To learn how to classify the instances, we use a tree structure similar to a decision
tree. The tree divides the instances in clusters depending on their expected
reward. The objective here is to regroup all instances having similar expected
rewards. The algorithm presented here is an instance-based algorithm in which
a tree is used to store all instances which are kept in the leaves of the tree. The
other nodes of the tree, called center nodes, are used to divide the instances
with a test on a specific attribute. Each leaf of the tree also contains a Q-value
indicating the expected reward if a task that belongs to this leaf is chosen. In
our approach, a leaf l of the tree is considered to be a task description (a state)
for the learning algorithm.
4
Building
Composition
Wood
Reinforced Concrete
Steel Frame
Fire
Intensity
LN
Weak
Building
Size
t1
Strong
> t1
Moderate
LN
LN
LN
Number Of
Agents
LN
t2
> t2
LN
LN
This section presents the algorithm used to update the tree using all the new
recorded instances. Algorithm 1 shows an abstract version of the algorithm and
the following subsections present in more detail each function used.
Add Instances After a trial, all agents put their new instances together. This
set of new instances is then used to update the tree. Thus, the first step of the
algorithm is simply to add all instances to the leaves they belong to. To find those
leaves, we simply start at the root of the tree and head down the tree choosing at
each center node the branch indicated by the result of the test on the instances
attributes, which could be one of the attributes of the task description d or the
number of agents n.
Update Q-values The second step updates the Q-values of each leaf node to
take into consideration the new instances which were just added. The updates
are done with the following equation:
+
Q0 (l) = R(l)
(1)
l0
where Q(l) is the expected reward if the agent tries to accomplish a task
leaf l. Those values are calculated directly from the recorded instances. R(l)
0
is the average reward obtained when a task in l was chosen and T (l, l ) is the
proportion of next instances that are in leaf l 0 :
P
rt
(2)
R(l) = it Il
|Il |
|{it |it Il L(it+1 ) = l0 }|
T(l, l0 ) =
|Il |
(3)
calculated using the following equation where Ik denotes the subset of instances
in Il that have the k th outcome for the potential test:
X |Ik |
sd(Ik )
(5)
Error = sd(Il )
|Il |
k
(6)
During the execution, each agent uses this tree to estimate the number of partners that are needed to accomplish a task. Since the number of agents is considered as an attribute when the tree is learned, there are center nodes in the
tree testing on the number of agents. Therefore, if different numbers of agents
are tested, different leaves and thus different rewards may be found, even with
the same task description. Consequently, to find the number of agents needed
for a particular task, the agent can test different number of agents and look at
the expected rewards returned by the tree.
Algorithm 2 presents the function used to estimate the number of agents
needed for a given task. In this algorithm, the function Expected-Reward
returns the expected reward if n agents are trying to accomplish the task d. To
do so, the agent finds the corresponding leaf in the tree, considering the task d
and the number of agents n, and record the expected reward for this task that
is stored in the leaf found.
This function is called for all possible number of agents until the expected
reward returned by the tree is greater than a specified threshold. If the expected
reward is greater then the threshold, it means that the current number of agents
should be enough to accomplish the task. If the expected reward is always under
the threshold, even with the maximum number of agents, the function returns
, meaning that the task is considered impossible with the available number of
agents.
3.6
Agents coordination
During the simulation, the agents use the tree created offline to choose the best
fire zone and the best building on fire to extinguish. Since the F ireStation agent
has a better global view of the situation, it is its responsibility to suggest fire
zones to F ireBrigade agents. Those agents have however a better local view,
so they choose which particular building on fire to extinguish in the given zone.
By doing so, we can take advantage of the better global view of the F ireStation
agent and the better local view of the F ireBrigade agent at the same time.
Algorithm 2 Algorithm used to find the number of agents needed for a given
task description d.
Function Number-Agents-Needed(d)
Input: d: a task description.
Statics: T ree: the tree learned.
N : The number of available agents.
T hreshold: The limit to surpass.
for n = 1 to N do
expReward Expected-Reward(T ree, d, n)
if expReward T hreshold then return n
end for
return
each building at the position i is the closest building, not already on the list, of
the building in position i 1. The first building is a reference building given by
the F ireStation when it assigns the fire zone.
All F ireBrigade agents have approximately the same list of buildings on
fire. To choose their building on fire they go through the list, one building at a
time. For each building, they use the tree to find the expected number of agents
needed to extinguish the fire, by using Algorithm 2. With this information, each
agent assigns a building to itself by considering that the other agents choose
their building according to the same information. If they actually have the same
information, they should be well coordinated on the buildings to extinguish.
Fire Prevention In order to prevent the fire from spreading to other buildings,
we use one F ireBrigade in each assigned fire zones to send water on buildings
not on fire to do fire prevention. Our experience with the fire simulator showed
that prevention on immediately adjacent buildings to buildings on fire is rarely
efficient. So we rather adopt a different strategy in which we try to prevent the
fire from spreading to other fire zones which do not already contain a building
on fire.
The F ireBrigade in charge of doing fire prevention use an heuristic function
to approximate which building is most likely to catch on fire in the adjacent fire
zones. This heuristic returns a probability for the building to catch on fire by
considering different parameters, such as the number of adjacent buildings on
fire within a certain radius, the distance separating the building to the other
buildings on fire, the buildings size and composition and the quantity of water
already sent on the building. The F ireBrigade will do fire prevention on the
building having the highest probability of catching on fire, if this probability is
higher than a certain threshold, else it will extinguish the buildings already on
fire.
Polices are playing a key role in the rescue operation by clearing the roads, thus
enabling all agents to circulate.Without them, some actions would be impossible
because agents would be indefinitely blocked by roads blockades. Therefore, it
is really important for them to be fast an efficient.
For the police agents, we developed an online POMDP approach based on a
look-ahead search to find the best action to execute at each cycle in the environment. Since we need a fast online algorithm, we opted for a factored POMDP
representation and a branch and bound strategy based on a limited depth first
search instead of classical dynamic programming. The tradeoff obtained between
the solution quality and the computing time is very interesting.
4.1
The main idea of our online POMDP approach is to estimate the value of a
belief state by constructing a tree where the nodes are belief states and where
d = 0 : U (b)
X
(b, d) = d > 0 : max R(b, a) +
(7)
P (o|b, a)( (b, a, o), d 1)
a
o
4.2
RTBSS Algorithm
We have elaborated an algorithm, called RTBSS (see Algorithm 3), that is used
to construct the search tree and to find the best action. Since it is an online
algorithm, it must be applied each time the agent has to make a decision.
To speed up the search, our algorithm uses a Branch and Bound strategy
to cut some sub-trees. The algorithm first explores a child node in the tree and
computes its value which becomes a lower bound on the maximal expected value
of the current node. Afterwards, for each other child nodes, the algorithm can
evaluate with an Heuristic function if it is possible to improve the lower bound
by pursuing the search (at line 8). The heuristic function must be defined for
each problem and it must always overestimate the true value. Moreover, the
purpose of sorting the actions at line 5 is to try the actions that are the most
promising first because it generates more pruning early in the search tree.
With RTBSS the agent finds at each turn the action that has the maximal
expected value up to a certain horizon of D. As a matter of fact, the performance
of the algorithm strongly depends on the maximal depth D of the search.
4.3
10
8:
if curReward > max then
9:
max curReward
10:
if d = D then action a
11:
end if
12:
end if
13: end for
14: return max
11
2
6
FF : 3
F:3
FF : 2
FF : 4
P : -3
P : -1
FF : 4
F:1
Fire (F)
Policeman (P)
6
3
FF : 4
P : -1
3
FF : 5
P : -2
FF : 3
F:3
FireFighter (FF)
propagates rewards over the graph, starting from the rewarding roads, which
are the position of the agents and the fires. For example, if a fire fighter agent
is on road r1 then this road would receive a reward of 5, the roads adjacent to
r1 in the graph would receive a reward of 4, and so on. Also, we add rewards for
all roads in a certain perimeter around a fire.
What is interesting with this reward function is that it can be used to coordinate the policeman agents. The coordination is necessary, because we do not
want all agents to go to the same road. To do so, the agent propagates negative
rewards around the other policemen, such that they repulse each other.
Figure 2 shows an example of a reward graph. The nodes represent the roads
and the reward source is identified in each node. The big number over a node
is the total reward, which is the sum of all rewards identified in the node. As
we can see, roads around the fire fighter agent receive positive reward, while
roads around the policeman agent receive negative rewards. Therefore, the agent
would want to go to roads near the fire and not necessarily go to the fire fighter
because there is already a policeman agent near it. Consequently, agents are
coordinating themselves simply by propagating negative rewards. This is a nice
way to coordinate agents in an online multiagent POMDP.
AmbulanceCenter
At each turn, the AmbulanceCenter agent sends the ordered list of civilians to
rescue at the AmbulanceT eams. Those agents are described later, but for now
it is worth mentioning that every AmbulanceT eams are rescuing the same agent
to reduce the rescuing time.
12
First of all, the center agent sends to the ambulances the order to save the
other agents of the rescue team (i.e. F ireBrigades, P oliceOf f ices or other
AmbulanceT eams), if there are some that are buried. When all our agents have
been saved, the center agent calculates which civilians to save and in which order.
To do so, it uses a task scheduling algorithm to try to maximize the number of
civilians that could be saved. Each task corresponding at saving a civilian has
a time length giving the necessary time to save the civilian, taking into account
the travel time to go to the civilian location, the rescuing time and the time
to transport it to the refuge. Each task also has a deadline representing the
expected death time of the civilian. We also prioritize civilians that are near
buildings on fire since they are more likely to die quickly if their building catch
on fire.
5.2
AmbulanceTeam
Conclusion
This paper has presented the strategies and algorithms used by all the agents
of the DAMAS-Rescue team. To resume, F ireBrigade agents are choosing the
best fire to extinguish based on the utility of a building on fire and their capacity
to extinguish it. The capacity is learned by the agent with a selective perception
learning method. AmbulanceT eams are always rescuing the same civilian based
on the messages received by the AmbulanceCenter. This center uses a scheduling
algorithm to order the civilians in a way that maximizes the number of civilians
rescued. P oliceF orces uses an online POMDP algorithm to choose the best path
to clear at each turn.
References
1. Paquet, S., S., R.: Damas-rescue web page. (2004)
2. Kitano, H.: Robocup rescue: A grand challenge for multi-agent systems. In: Proceedings of ICMAS 2000, Boston, MA (2000)
3. McCallum, A.: Reinforcement Learning with Selective Perception and Hidden State.
PhD thesis, University of Rochester, Rochester, New-York (1996)
4. Quinlan, J.: Combining instance-based and model-based learning. In: Proceedings of
the Tenth International Conference on Machine Learning, Amherst, Massachusetts,
Morgan Kaufmann (1993) 236243