DRL AUV Trajectory Tracking

Deep Reinforcement Learning Based Optimal Trajectory
Tracking Control of Autonomous Underwater Vehicle

Runsheng Yu*, Zhenyu Shi*, Chaoxing Huang*, Tenglong Li*, Qiongxiong Ma
*These authors contributed equally
South China Normal University, Guangzhou 510631, P. R. China
E-mail: 20143201002@m.scnu.edu.cn, 20143201094@m.scnu.edu.cn, goldenhwong@gmail.com, 20143201060@m.scnu.edu.cn,
robotteam@qq.com
Abstract: The aim of this paper is to solve the control problem of trajectory tracking of Autonomous Underwater Vehicles
(AUVs) through using and improving deep reinforcement learning (DRL). The deep reinforcement learning of an underwater
motion control system is composed of two neural networks: one network selects action and the other evaluates whether the
selected action is accurate, and they modify themselves through a deep deterministic policy gradient(DDPG). These two neural
networks are made up of multiple fully connected layers. Based on theories and simulations, this algorithm is more accurate than
traditional PID control in solving the trajectory tracking of AUV in complex curves to a certain precision.
Key Words: Autonomous Underwater Vehicles, Optimal Control System, Deep Reinforcement Learning
1 Introduction control of AUVs is also going in this direction.

To improve the performance of robot control problems
In recent years, AUVs have been widely utilised in ocean when carrying out untreated complex tasks with high dimen-
exploration and the protection of the marine environment, sionality and perceptible input, David Silver et al. proposed
and their status is increasingly important[1][24][25]. Some a deterministic policy gradient (DPG) algorithm. The exper-
extremely dangerous tasks are made safe by the accurate iment results demonstrate a significant advantage of using a
control of AUV, such as exploring for submarine oil, repair- deterministic policy gradient over a stochastic policy gradi-
ing submarine pipelines, as well as tracing and recording the ent[7]. DPG can also be used to solve other nonlinear opti-
positions of torpedoes. However, because of the complex- mization problems where the method of stochastic gradient
ity of underwater environments, the autonomous control of descent is ineffective. In 2015, Mnih et al. proposed a rein-
AUVs is nonlinear, as their motions are easily influenced by forcement learning algorithm called deep Q network(DQN)
flow and hydraulic resistance, making accurate control diffi- which completes the task of playing Atari computer games
cult. By referring to classical control theories, many special- perfectly [8]. However, DQN cannot be directly applied to
ists and scholars have studied underwater nonlinear systems continuous problems when the control of high dimensional
and achieved variable results. For example, Min J. Kim et motions is involved. Therefore, Timothy P. Lillicrap et al.
al. put forward an underwater hovering and tracking con- designed an algorithm based on DPG called DDPG (deep
trol method based on the fuzzy PID with an accuracy of 1 deterministic policy gradient) in ICLR in 2016[8]. Inherit-
meter[2]. Lu Wang et al. proposed an underwater nonlinear ing DPG, this algorithm uses for reference the actor-critic
tracking control method based on sliding-mode control[3]. method in the DQN algorithm, and has high stability and
AUV motion control has developed from simple feedback robustness in solving multiple challenging physical control
control to the nonlinear stage of advanced robust, intelligent problems [17].
control[4]. This paper aims to produce a method with better perfor-
With the rapid development of artificial intelligence in re- mance to control AUVs and to explore the application of the
cent years, machine learning is widely used in different fields DRL algorithm in underwater control. We put forward the
[16], especially in the control field. Many scholars are start- underwater tracking control algorithm of AUVs and create
ing to pay attention to the application of AI in AUV control. the autonomous underwater vehicle control model by using
Mariano De Paula and Gerardo G. Acosta have proposed a the DRL algorithm. We also carry out a simulation verifica-
self-adaptive reinforcement learning tracking control algo- tion of the control effect.
rithm of AUVs [5]. However, this algorithm is too general The rest of this paper will be structured as follows: The
for use in AUV control. A. El-Fakdi et al. came up with an second part introduces the underwater dynamic equation of
online policy learning control method of AUVs based on re- AUVs and gives a brief description of the system. The third
inforcement learning of stochastic gradient descent[6]. One part explains the structure of neural networks as well as the
of the primary goals of AI is to solve complex tasks from training process of the control neural network(actor) and the
unprocessed, high-dimensional and sensory input [8]. The evaluation neural network(critic). The fourth part shows the
analysis of the stability of the system. The fifth part de-
This work is supported by Natural Science Foundation of Guang- scribes the training and simulation. Finally, we draw con-
dong Province, China under Grant 2016A030313456, Science and Tech-
clusions.
nology Planning Project of Guangdong Province, China under Grant
2015B090920003, 2016B090917002, 2016B090926004, South China Nor-
mal University National Training Programs of Innovation and Entrepreneur-
ship under Grant 201610574038, Youth Teacher Science Cultivation Foun-
dation of South China Normal University under Grant 15KJ13.
2 The Dynamic Model of Autonomous Underwa-
ter Vehicles
This section presents the general underwater dynamic
equation of AUVs. The control system of AUVs will also
be introduced in this section.
2.1 Nonlinear dynamic equation and control system
In this paper, we consider the AUV moves in a horizontal
plane.
The starting point is defined as the origin and the earthxed
inertial frame {Oxb y b z b} is set according to a left-handed
system.
The underwater dynamic equation of AUVs[9][13] can be
expressed as follows :
M v + C(v)v + D(v)v + g() + = (1)
<()v = (2) Fig. 1: AUV movement on the horizontal plane.

m11 0 0
M = 0 m22 0 By applying the Taylor expansion of the first order, there
0 0 m33 are

cos sin 0
v(t + 1) = M 1 + M 1 G(t) (5)
<() = sin cos 0 (t + 1) = <( (t))v(t) (6)
0 0 1
Where
0 0 m22 v
C(v) = 0 0 m11 u G(t) = (D(v (t))v g( (t)) C(v (t))v (t) w)
m22 v m11 u 0
t is a certain moment of the system. The controller can be
d11 0 0 set as follows:
D(v) = 0 d22 0 (t) = (v(t), (t)) (7)
0 0 d33
The force of AUVs is determined by its velocity and po-
T
g() = [gx (), gy (), gz ()] sition of previous moment, which means that the controller
() is a map from state to action. To simplify it, let
Among which, M is the mass matrix, C(v) is the centrifu-
gal and Coriolis matrix, D(v) is the damping matrix, g() T
st = [v(t), (t)]
is the gravity and buoyancy matrix, and is the target input
which refers to the force of motor. is the model uncertainty In figure 2, the agent that adapts the policy is constituted
vector, which is induced by disturbances. by the networks of actor and critic. The actor network is
The position and velocity can be expressed as follows: trained to work with the control policy (force) as controller.
T In addition, the critic network is trained and the output is
= [x, y, ]
used as a parameter of actor network to evaluate the training
T performance of the controller. Noise from the environment
v = [u, v1 , r]
is taken into account as disturbance during the entire trial.
Where u is the velocity in surge, v1 is the velocity in sway,
and r is the velocity in yaw, while x and y are the linear po- 3 The Proposing and Evaluating of Actor and
sition and is the direction of AUVs in an earthxed inertial Critic Neural Networks
frame. In this section, we introduce the multiple fully connected
The AUV movement in the horizontal plane is shown in layers neural network, the control policy function and the
figure 1. critic function Q. Then, two neural networks are given to re-
Assumption I: The parameters of the gravity and buoy- place the control policy and critic function respectively. The
ancy matrix [gx (), gy (), gz ()] are constants. control algorithm is also given below.
This paper discusses the horizontal motions of AUVs at a
constant depth and it is apparent that the above parameters 3.1 Multiple fully connected layers[26]
are constants. The fully connected layer is presented as
When the mass matrix is not a singular matrix,
y = (wT x + b) + (8)
1
v = M ( D(v)v g() C(v)v ) (3)
Where wT is the weight and b is the bias. x is defined as
<()v = (4) the input of the layer while y is the output. The function of
It can be known from the Bellman equation [21]
Q(st , at ) = r(st , at ) + Q(st+1 , at+1 ) (14)
The optimal critic function is
Q(st , at ) = argmax (Q(st , at )) (15)

st ,at
Replace Q(st , at ) by using neural network
Q(st , at|) (16)
In order to evaluate a policy, we must firstly get the opti-

mal critic function Q(st , at ). We define the Loss function
as
Fig. 2: Overview of the AUVs control system 1
Loss = (yt Q(st , at|))2 (17)
2
yt = r(st , at ) + Q(st , at|) (18)
Relu[23] is chosen to be the activate function as , and is
the error function. The optimal critic function Q(st , at ) can be found by min-
The whole network is constituted by a multiple fully con- imizing the value of Loss function.
nected network and it can be known from [11] that Thus, the policy gradient algorithm [7] is applied here by
sampling a batch of (st , at ) and computing the average Loss
kk b1 , b1 Const function.
3.2 Control policy and the optimal control policy N
= 1
X 2
It has been mentioned in (7) Loss (yi Q(si , ai |)) (19)
N i=1
(st ) = (9)
Find gradient
also can be defined as at (action) N
As for the underwater trajectory tracking control, the tar- = 2 X Q(si , ai |)
Loss (yi Q(si , ai |))
get will make the system move along the desired trajectory N i=1
state st by using the control policy at any moment. (20)
It means the actor function is designed to stabilize the Update weight by gradient descent methods
tracking error in an optimal manner by minimizing the re-
ward function: t+1 = t + Loss (21)
2
r(st , at )=[ I(st st ) a2t ] (10) Where is the learning rate.
I and are all positive-definite functions. 3.4 Actor function[22]
To take the influence of possible future into consideration, When we find the critic function, we can use it to update
the long-term reward function is defined as: the actor function.
Z Replace (st ) by using neural network (st |) and sub-
R(st , at ) = (t) r(st , at )d (11) stitute at =(st |) into Q(st , at|). It has
t
is the discount factor(0 < < 1), which is used to J=Q(st , (st |)|) (22)
weaken the influence of possible future state[10].
Therefore, this problem can be described as the following Differentiate (22)
mathematical problem. Q(st , (st |)|) (st |)
Problem 1: find argmin {R(st , at )} with the constraint J = (23)
st ,at
at
conditions: Use ADAM [12] to update the network:
s.t. amin at amax
smin st smax mt+1 = mt + (1 )
3.3 Critic function[22] =t+1 = =t + (1 )
In order to solve Problem 1, critic function is defined as: mt
mt =
Z 1 t
Q(st , at ) = R(st , at )= (t) r(st , at )d (12)
t t = =t
=
1 t
Discretize (12)
X 1
t+1 = t p mt
R(st , at )= (it) r(st , at ) (13) =t
i=t
Where and are the learning rate. Assumption IV: There should be the following numerical
To make both the actor and critic neural networks up- relationships:
date their weights in a more soft way. We create a copy When st st 0, at 0
of the actor and critic networks, named Q0 (st , at | 0 ) and When st st < 0, at > 0
0 (st |0 )[8]. The whole algorithm is as Algorithm 1. The al- It can easily be found that the AUV will travel backward
gorithm will only stop when the < r , where is the stan- when the boat is moving away from the target position, and
dard deviation of total reward function in each 100 episodes it will travel forward when the boat is approaching the target
and r is the threshold. position.
The analysis of the stability is as follows:
Algorithm 1: DRL algorithm Combine (1)(7)(9)
Initialize the networks of Q(st , at |) and (st |)
with weights and . (st |) = D(st )st + mst + g() + C(st )st (25 1)
Initialize the copy networks of Q0 (st , at | 0 ) and 0 (st |0 )
with weights 0 and 0 . It is easily known that D(st ) is not a singular matrix.
Initialize
sthe replay buffer R Applying transposition on (25 1), we can get:
M T
[rj (si ,ai )r(st ,at )]2
P P
while (
j=M -100 i=1
> r ) do
(st |) mst C(st )st g()
100T st = (25 2)
(M is the current training episode) D(st )
Initialize the state s0
According to assumption I:
for t=1, T do
Choose actor at = (st |)
g() = Const
Get the state st+1 according to the environment
compute r(st , at )
Take the derivatives of t of (25 2) two sides, we get:
store in transition R(st , at , rt , st+1 )
Randomly select N arrays from R D st
dst D(st )(K2 s K1 )
compute yi = ri + Q0 (si , ai | 0 ) = t t
(26)
dt 2
N [D(st )]
compute Loss = N1 (yi Q(si , ai |))2
P
i=1
N Among which,
i ,ai |)
1
(yi Q(si , ai |)) Q(s
P
compute Loss = N
i=1
update weight t+1 = t + Loss K1 = (st |) mst C(st )st (27)
N
Q(si ,ai |(si |)) (si |)
compute J = N1 (st |) st (st |) st C st st
P
ai

i=1 K2 = + m
compute mt = mt1 + (1 ) st t t t s t t
compute =t = =t1 + (1 ) (28)
From assumption III, we can know that s

t can be ig-
mt t
compute mt = 1 t
=t
compute =t = 1 t nored. After simplifying the above equations, we can get
update weight t+1 = t 1 mt (st |)
0
=t
0
dst t
update weight = + (1) = (29)
dt D(st )
update weight 0 = + (1)0
( is learning rate) Now we apply the Lyapunov function,
end for
end while 1 2
end L(t) = (st st ) (30)
2
Among which, st means the desired value.
4 Analysis of the Stability of the Control System Apply derivation of t on Lyapunov function
In this section, the analysis of the stability of the control dL dst

= (st st ) (31)
system will be given. dt dt
Before analysing the stability, we introduce the following Substitute equation (29) into equation (31)
assumptions.
Assumption II: The damping matrix D(v) of AUVs is a dL
(st |)
t
positive definite matrix[14][20]. = (st st ) (32)
dt D(st )
Assumption III: The velocity, acceleration and the
derivative of acceleration should be significantly smaller Substitute equation (23) into equation (32)
than the changing rates of weight, which is
2
t |)
dL (st st )( (s
) a Q(st , (st |)|)
st = (33)

t t (24) dt D(st )
The Loss function is achieved by (19):
The running speed of CPU should be significantly faster
2
than the change rates of other external environments[15]. = (r(st , at ) + ( 1)Q(st , at |)) (34)
Apply derivation on both sides of (34) The learning rates of actor and critic networks are set at
0.001 which is proper for the control systems of AUVs. To
quicken the computing rate of the computer, the batch size is
= 2[r(st , at ) + ( 1)Q(st , at |)] H (35)
at set at 128.
Where The discount factor of reward in (11) is set at 0.99, and
the training time is set at 300 steps for each training episode.
r(st , at ) Q(st , at |) Over-abundance of steps may greatly decrease the training
H=[ + ( 1) ]
at at speed or even stop the learning, while too few steps may
cause insufficient sampling and result in a low successful
If the Loss function has the minimum value, then the re-
learning rate.
sult of equation (35) is zero and the optimal solution can be
The ramdom disturbtance is x N (0, 1.6), y
found [18]. Let equation (35) be zero
N (0, 1.6), where N (, 2 ) is the Gaussian distribution.
Q(st , at |) r(st , at ) Meanwhile, to verify DRLs accuracy and efficiency in
= /(1 ) (36) controlling the AUV, we compare it with the traditional PID
at at
control in the simulation. Comparative simulation on AUV
Substitute equation (36) into (33) control based on traditional PID is also introduced.
To better evaluate the control system, some error indica-
2 tors were used and are defined here.
(st |) r(st , at )
L0 (t) = (st st )( ) /(1 )D(st ) The deviation of actual track and the ideal track on the x
at axis is defined as:
(37)
ex = x xd
From assumption II, we can know D(st ) > 0.
According to the Lyapunov stability principle, take the x is the actual track, and xd is the ideal track.
proper reward function r(st , at ) which satisfies The deviation of actual track and the ideal track on the y
2
axis is defined as:
(st |) r(st , at ) ey = y yd
( ) /(1 )D(st ) = L0 (t) < 0 (38)
at
y is the actual track, and yd is the ideal track.
then the system is stable. The total error is defined as:
It can be known from (10) that q
et = e2x + e2y
2
r(st , at )=[ I(st st ) a2t ] (39)
For comparison, we select the PID controller, which can
Substitute equation (39) into equation (38) be expressed as
2
(st |) d
Z t
L0 (t) = 2(st st )( ) at /(1 ) (40) U (s) = Kp e(t) + Kd e(t) + KI e( )d
dt 0
(st |) 2
L0 (t) = 2(st st )( ) at (41) 5.1 The simulation results of the application of DRL
in straight-line trajectory tracking in a horizontal
Where plane
= D(st )/(1-) (42)
The desired straight-line trajectory is as follow:
With assumption II, D(st ) > 0. In addtion, 0 < < 1
, 1 > 0, > 0, > 0. Then > 0 is permanently yd (t) = xd (t)
established.
xd (t) = t
According to assumption IV
When st st 0, at 0. Where t is the steps.
When st st < 0, at > 0. The parameters of the PID controller are as follows: Kp =
It means L0 (t) 0, the system is stable. 1.56, KI = 6.4, Kd = 0.002. These parameters are set
T T
based on experience. We set = [0, 0, 0] v = [0, 0, 0] at
5 Simulation Studies
the beginning of the simulation.
In this section, the results of the simulation verify the va- The simulation results are as follows:
lidity of underwater optimal path control by using DRL. All Figure 3 is the trajectory of the DRL( training episode
the parameters of the AUV model are given in the Appendix. 1000) and the PID controller, and figure 4 is the trajectory
Simulations are performed under a Tensorflow/Ubuntu envi- of the DRL in different training episodes.
ronment. Figure 5 shows the total error of DRL(training episode
The parameters of DRL are set as follows: both the ac- 1000) and PID controller. Figure 6 shows the total error
tor and critic neural networks have 3 hidden layers. All the of DRL in different traning episodes. It can be seen in the
layers are fully connected. Each layer has 100 neurons and figures that more episodes results in better accuracy in the
100 bias. The activate function between the hidden layers is practical trajectory tracking. The performance of the DRL
Relu, and that between the hidden layer and the output layer (episode 1000) is more stable and robust than that of the PID
is tanh. Dropout layers [19] are used to prevent overfitting. controller.
0.02
0.00
0.02
0.04
0.06
et [m]
0.08
0.10
episode=5
0.12 episode=10
episode=50
0.14 episode=500
episode=1000
0.16
0 50 100 150 200 250 300
time[s]
Fig. 3: The trajectory of DRL(training epidode 1000) and Fig. 6: The total error of DRL in different training episodes
PID controller
50
100
reward
150
200
0 200 400 600 800 1000

episode
Fig. 7: Total reward
Fig. 4: The trajectory of DRL in different training episodes

5.2 The simulation results of DRL in complex curve
trajectory tracking in a horizontal plane
The desired curve is defined as follows:
yd (t) = 0.5 sin(1.8 xd (t))
xd (t) = t
The new PID parameters are set at: Kp = 3.01, KI =
T T
4.96, Kd = 0.0005. We set = [0, 0, 0] v = [0, 0, 0] at
the beginning of the simulation.
The simulation results are as follows:
Fig. 5: The total error of DRL(training episode 1000) and Figure 8 shows the trajectory of the DRL (training episode
PID controller 3000) and PID controller, and figure 9 is the trajectory of the
DRL in different training episodes.
Figure 10 shows the total error of DRL(training episode
3000) and PID controller. Figure 11 shows the total error of
Figure 7 indicates the change of average reward in relation DRL in different training episodes. It can be seen from the
to the change of the number of training episodes. The reward figures that more episodes results in better accuracy in the
value converges dramatically and becomes stable after 200 practical trajectory. The performance of the DRL (episode
episodes. 3000) is more stable and robust than that of the PID con-
In conclusion, after carrying out sufficient trials in re- troller.
inforcement learning, the trajectory tracking of the DRL Figure 12 illustrates the change of average reward in rela-
shows better performance than that of the PID controller, tion to the change of training episodes. It can be seen that the
with higher accuracy and more stability. reward value shows an upward trend with significant fluctua-
0.3
0.2
0.1
0.0
et [m]
0.1
0.2
0.3 episode=50
episode=500
0.4 episode=1000
episode=3000
0.5
0 50 100 150 200 250 300
time[s]
Fig. 8: The trajectory of DRL(training epidode 3000) and Fig. 11: The total error of DRL in different training episodes
PID controller
0.0
0.1
0.2
0.3
reward
0.4
0.5
0.6
0.7
0.8
0 500 1000 1500 2000 2500 3000
episode
Fig. 12: Total reward

Fig. 9: The trajectory of DRL in different training episodes
network is used to carry out the control policy while the

critic network is used to evaluate the training of actor net-
work. Mathematical proofs have been applied to analyze the
stability of the system. In simulations, we verify the robust-
ness and effectiveness of DRL by making comparisons with
the PID controller of the performance in different kinds of
trajectory tracking. In the future, research should be carried
out into the application of the algorithm into practical sys-
tems.
7 Appendices
Fig. 10: The total error of DRL(training episode 3000) and
PID controller 7.1 Underwater dynamic parameters of the system
The dynamic parameters of the system are as follows. Its
model is based on Cuis[20].
tion at first but the fluctuation then declines and it eventually
stabilizes after 2500 episodes, with a marginal fluctuation. m11 200kg
In conclusion, after carrying out sufficient trials in rein- m12 250kg
forcement learning, the trajectory tracking of the DRL shows m33 80kg
better performance than that of PID controller with higher d11 (70 + 100|u|)kg/s
accuracy and more stability. d22 (100 + 200|v|)kg/s
d33 (50 + 100|r|)kg/s
6 Conclusions gx () 0N
In this paper, a trajectory tracking control algorithm of gy () 0N
AUVs using deep reinforcement learning is developed. Two gz () 0N
neural networks are embedded in the algorithm: the actor
References [13] Cui, Rongxin, et al. Neural network based reinforcement
learning control of autonomous underwater vehicles with con-
[1] Londhe, Pandurang S., et al. Task space control of an au-
trol input saturation. Control (CONTROL), 2014 UKACC In-
tonomous underwater vehicle manipulator system by robust
ternational Conference on. IEEE, 2014.
single-input fuzzy logic control scheme. IEEE Journal of
[14] You, Tian Qing, et al. CFD Research on Underwater Vehicle
Oceanic Engineering 42.1 (2017): 13-28.
Hydrodynamic Damping Force Coefficient. Missiles & Space
[2] Kim, Min J., et al. Way-point tracking for a Hovering AUV
Vehicles (2016).
by PID controller. Control, Automation and Systems (ICCAS),
[15] WANG Yao-nan. (1996). Intelligence Control System. Hunan
2015 15th International Conference on. IEEE, 2015.
university press. (in Chinese)
[3] Wang, Lu, et al. Horizontal tracking control for AUV based on
[16] Webb, Geoffrey I., Michael J. Pazzani, and Daniel Billsus.
nonlinear sliding mode. Information and Automation (ICIA),
Machine learning for user modeling. User modeling and
2012 International Conference on. IEEE, 2012.
user-adapted interaction 11.1 (2001): 19-29.
[4] Juan, Li, et al. AUV control systems of nonlinear extended
[17] Deisenroth, Marc Peter, Gerhard Neumann, and Jan Peters.
state observer design. Mechatronics and Automation (ICMA),
A survey on policy search for robotics. Foundations and
2014 IEEE International Conference on. IEEE, 2014.
Trends in Robotics 2.12 (2013): 1-142.
[5] De Paula, Mariano, and Gerardo G. Acosta. Trajectory track-
[18] Dan, Sven. Quadratic Programming. Nonlinear and Dynamic
ing algorithm for autonomous vehicles using adaptive re-
Programming. Springer Vienna, 1975:33-59.
inforcement learning. OCEANS15 MTS/IEEE Washington.
[19] Tobergte, David R, and S. Curtis. Improving Neural Net-
IEEE, 2015.
works with Dropout. (2013).
[6] El-Fakdi, A., et al. Autonomous underwater vehicle control
[20] Cui, Rongxin, et al. Leaderfollower formation control of
using reinforcement learning policy search methods. Oceans
underactuated autonomous underwater vehicles. Ocean Engi-
2005-Europe. Vol. 2. IEEE, 2005.
neering 37.17 (2010): 1491-1502.
[7] Silver, David, et al. Deterministic policy gradient algorithms.
[21] Ng, Andrew Y. Shaping and policy search in reinforcement
Proceedings of the 31st International Conference on Machine
learning. Diss. University of California, Berkeley, 2003.
Learning (ICML-14). 2014.
[22] Konda, Vijay R., and John N. Tsitsiklis. Actor-Critic Algo-
[8] Lillicrap, Timothy P., et al. Continuous control with deep rein-
rithms. NIPS. Vol. 13. 1999.
forcement learning. arXiv preprint arXiv:1509.02971 (2015).
[23] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep
[9] Fossen, Thor I. Guidance and control of ocean vehicles. John learning. Nature 521.7553 (2015): 436-444.
Wiley & Sons Inc, 1994. [24] Kobayashi, Ryosuke, and Satoshi Okada. Development of
[10] Sutton, Richard S., and Andrew G. Barto. Reinforcement hovering control system for an underwater vehicle to perform
learning: An introduction. Vol. 1. No. 1. Cambridge: MIT core internal inspections. Journal of Nuclear Science and
press, 1998. Technology 53.4 (2016): 566-573.
[11] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. [25] Selvakumar, J. Manecius, and T. Asokan. Station keeping
Universal approximation of an unknown mapping and its control of underwater robots using disturbance force mea-
derivatives using multilayer feedforward networks. Neural surements. Journal of Marine Science and Technology 21.1
networks 3.5 (1990): 551-560. (2016): 70-85.
[12] Kingma, Diederik, and Jimmy Ba. Adam: A method [26] Dean, Jeffrey, et al. Large scale distributed deep networks.
for stochastic optimization. arXiv preprint arXiv:1412.6980 Advances in neural information processing systems. 2012.
(2014).

DRL AUV Trajectory Tracking

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DRL AUV Trajectory Tracking

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning Based Optimal Trajectory

Tracking Control of Autonomous Underwater Vehicle

1 Introduction control of AUVs is also going in this direction.

M v + C(v)v + D(v)v + g() + = (1)

<()v = (2) Fig. 1: AUV movement on the horizontal plane.

Q(st , at ) = r(st , at ) + Q(st+1 , at+1 ) (14)

The optimal critic function is

Q(st , at ) = argmax (Q(st , at )) (15)

Replace Q(st , at ) by using neural network

Q(st , at|) (16)

In order to evaluate a policy, we must firstly get the opti-

In this section, the analysis of the stability of the control dL dst

0 200 400 600 800 1000

Fig. 7: Total reward

Fig. 4: The trajectory of DRL in different training episodes

yd (t) = 0.5 sin(1.8 xd (t))

Fig. 12: Total reward

network is used to carry out the control policy while the

You might also like