Professional Documents
Culture Documents
Abstract: The aim of this paper is to solve the control problem of trajectory tracking of Autonomous Underwater Vehicles
(AUVs) through using and improving deep reinforcement learning (DRL). The deep reinforcement learning of an underwater
motion control system is composed of two neural networks: one network selects action and the other evaluates whether the
selected action is accurate, and they modify themselves through a deep deterministic policy gradient(DDPG). These two neural
networks are made up of multiple fully connected layers. Based on theories and simulations, this algorithm is more accurate than
traditional PID control in solving the trajectory tracking of AUV in complex curves to a certain precision.
Key Words: Autonomous Underwater Vehicles, Optimal Control System, Deep Reinforcement Learning
is the discount factor(0 < < 1), which is used to J=Q(st , (st |)|) (22)
weaken the influence of possible future state[10].
Therefore, this problem can be described as the following Differentiate (22)
mathematical problem. Q(st , (st |)|) (st |)
Problem 1: find argmin {R(st , at )} with the constraint J = (23)
st ,at
at
conditions: Use ADAM [12] to update the network:
s.t. amin at amax
smin st smax mt+1 = mt + (1 )
3.3 Critic function[22] =t+1 = =t + (1 )
In order to solve Problem 1, critic function is defined as: mt
mt =
Z 1 t
Q(st , at ) = R(st , at )= (t) r(st , at )d (12)
t t = =t
=
1 t
Discretize (12)
X 1
t+1 = t p mt
R(st , at )= (it) r(st , at ) (13) =t
i=t
Where and are the learning rate. Assumption IV: There should be the following numerical
To make both the actor and critic neural networks up- relationships:
date their weights in a more soft way. We create a copy When st st 0, at 0
of the actor and critic networks, named Q0 (st , at | 0 ) and When st st < 0, at > 0
0 (st |0 )[8]. The whole algorithm is as Algorithm 1. The al- It can easily be found that the AUV will travel backward
gorithm will only stop when the < r , where is the stan- when the boat is moving away from the target position, and
dard deviation of total reward function in each 100 episodes it will travel forward when the boat is approaching the target
and r is the threshold. position.
The analysis of the stability is as follows:
Algorithm 1: DRL algorithm Combine (1)(7)(9)
Initialize the networks of Q(st , at |) and (st |)
with weights and . (st |) = D(st )st + mst + g() + C(st )st (25 1)
Initialize the copy networks of Q0 (st , at | 0 ) and 0 (st |0 )
with weights 0 and 0 . It is easily known that D(st ) is not a singular matrix.
Initialize
sthe replay buffer R Applying transposition on (25 1), we can get:
M T
[rj (si ,ai )r(st ,at )]2
P P
while (
j=M -100 i=1
> r ) do
(st |) mst C(st )st g()
100T st = (25 2)
(M is the current training episode) D(st )
Initialize the state s0
According to assumption I:
for t=1, T do
Choose actor at = (st |)
g() = Const
Get the state st+1 according to the environment
compute r(st , at )
Take the derivatives of t of (25 2) two sides, we get:
store in transition R(st , at , rt , st+1 )
Randomly select N arrays from R D st
dst D(st )(K2 s K1 )
compute yi = ri + Q0 (si , ai | 0 ) = t t
(26)
dt 2
N [D(st )]
compute Loss = N1 (yi Q(si , ai |))2
P
i=1
N Among which,
i ,ai |)
1
(yi Q(si , ai |)) Q(s
P
compute Loss = N
i=1
update weight t+1 = t + Loss K1 = (st |) mst C(st )st (27)
N
Q(si ,ai |(si |)) (si |)
compute J = N1 (st |) st (st |) st C st st
P
ai
i=1 K2 = + m
compute mt = mt1 + (1 ) st t t t s t t
compute =t = =t1 + (1 ) (28)
From assumption III, we can know that s
t can be ig-
mt t
compute mt = 1 t
=t
compute =t = 1 t nored. After simplifying the above equations, we can get
update weight t+1 = t 1 mt (st |)
0
=t
0
dst t
update weight = + (1) = (29)
dt D(st )
update weight 0 = + (1)0
( is learning rate) Now we apply the Lyapunov function,
end for
end while 1 2
end L(t) = (st st ) (30)
2
Among which, st means the desired value.
4 Analysis of the Stability of the Control System Apply derivation of t on Lyapunov function
0.00
0.02
0.04
0.06
et [m]
0.08
0.10
episode=5
0.12 episode=10
episode=50
0.14 episode=500
episode=1000
0.16
0 50 100 150 200 250 300
time[s]
Fig. 3: The trajectory of DRL(training epidode 1000) and Fig. 6: The total error of DRL in different training episodes
PID controller
50
100
reward
150
200
xd (t) = t
The new PID parameters are set at: Kp = 3.01, KI =
T T
4.96, Kd = 0.0005. We set = [0, 0, 0] v = [0, 0, 0] at
the beginning of the simulation.
The simulation results are as follows:
Fig. 5: The total error of DRL(training episode 1000) and Figure 8 shows the trajectory of the DRL (training episode
PID controller 3000) and PID controller, and figure 9 is the trajectory of the
DRL in different training episodes.
Figure 10 shows the total error of DRL(training episode
3000) and PID controller. Figure 11 shows the total error of
Figure 7 indicates the change of average reward in relation DRL in different training episodes. It can be seen from the
to the change of the number of training episodes. The reward figures that more episodes results in better accuracy in the
value converges dramatically and becomes stable after 200 practical trajectory. The performance of the DRL (episode
episodes. 3000) is more stable and robust than that of the PID con-
In conclusion, after carrying out sufficient trials in re- troller.
inforcement learning, the trajectory tracking of the DRL Figure 12 illustrates the change of average reward in rela-
shows better performance than that of the PID controller, tion to the change of training episodes. It can be seen that the
with higher accuracy and more stability. reward value shows an upward trend with significant fluctua-
0.3
0.2
0.1
0.0
et [m]
0.1
0.2
0.3 episode=50
episode=500
0.4 episode=1000
episode=3000
0.5
0 50 100 150 200 250 300
time[s]
Fig. 8: The trajectory of DRL(training epidode 3000) and Fig. 11: The total error of DRL in different training episodes
PID controller
0.0
0.1
0.2
0.3
reward
0.4
0.5
0.6
0.7
0.8
0 500 1000 1500 2000 2500 3000
episode