### Introduction

*A*algorithm, which simply finds the shortest distance (Lie et al, 2019; Wang et al, 2019), DQN has the advantage of generating routes, satisfying sea conditions and navigation regulations through rewards. With these advantages, reinforcement learning-based DQN is being studied as one of the powerful methods for generating ship routes. Chen et al. (2019) proposed a ship path generation method based on Q-learning to achieve autonomous navigation of ships without relying on experience. And, Guo et al. (2021) compared the BUG2 algorithm, the artificial potential field algorithm, the A* algorithm, and the DQN algorithm, and DQN was derived as the algorithm with the best performance for generating ship routes. Therefore, this study adopted DQN as an algorithm for route generation.

^{*}### Dynamic ship model

*ν*= [

*u*,

*v*,

*w*,

*p*,

*q*,

*r*]

*, where (*

^{T}*u*,

*v*,

*w*) are the linear velocities in three axes about the ship's body fixed coordinate system, and (

*p*,

*q*,

*r*) are the angular velocities in three axes. The inertia matrix

*M*is represented by

*M*+

_{RB}*M*as the sum of the rigid body and hydrodynamic inertia matrices, and Coriolis and centripetal force matrix

_{A}*C*(

*ν*) is represented by

*C*(

_{RB}*ν*) +

*C*(

_{A}*ν*) as the sum of the rigid body and hydrodynamic Coriolis and centripetal force matrices. The damping matrix

*D*(

*ν*) is

*D*+

*D*(

_{n}*ν*), which is the sum of the linear damping matrix and the nonlinear damping matrix.

*g*(

*η*) is the term associated with the restoring force, and

*τ*is the force and moment term, which includes the thrust and rudder to control the ship.

*x*,

*y*axes and rotational motion about

*z*axis. To derive the 3 DOF equation of motion, the following conditions must be satisfied.

### 2.1 Ship's 3 DOF equation of motion

*m*and

*u*are the mass and forward speed of the ship, respectively, and

*X*and

_{u̇}*X*

_{|}

_{u}_{|}

*are the added mass and drag coefficient for the surge direction of the ship, respectively.*

_{u}*t*and

*T*represent the propeller reduction factor and propeller thrust. And the steering model of the ship is represented by Equation (3)(Davidson, 1946).

*ν*= [

*v*,

*r*]

*,*

^{T}*δ*is the ship's rudder angle as a control input. The inertia matrix

*M*, the sum of the Coriolis and centripetal force matrix and damping matrix

_{m}*N*(

*u*)

*, and the input matrix*

_{m}*b*can be expressed as Equation (4)

_{m}*x*is the

_{G}*x*axis coordinate of the ship's center of gravity and

*I*is the moment of inertia about

_{z}*z*axis.

*Y*,

_{v̇}*Y*,

_{ṙ}*Y*,

_{v}*Y*,

_{δ}*N*,

_{ṙ}*N*,

_{r}*N*are the hydrodynamic coefficients for sway, yaw. From the above equations, the ship's 3 DOF equation of motion is derived.

_{δ}*x*,

*y*and

*ψ*are the positions of the ships along the

*x*,

*y*axes and

*ψ*is the ship's heading angle.

### 2.2 Specifications of the ship

*L*is the ship length,

*L*is length between perpendiculars,

_{pp}*T*is draft,

*B*is maximum beam and ∇ is displacement.

### Design of the route following controller

*ψ*(

_{e}*k*) at the current sampling time

*k*, the velocity

*ψ*

_{r}(

*k*) of

*ψ*(

_{e}*k*), and the acceleration

*ψ*(

_{a}*k*) of

*ψ*(

_{e}*k*) as inputs, and determines the control output through fuzzy rules and inference in real time. The control output is the control increment

*dδ*(

_{c}*k*) of the rudder that controls the actual ships. At this point,

*dδ*(

_{c}*k*) is added to the rudder angle

*δ*(

_{c}*k*− 1) generated at the previous sampling time, resulting in the final rudder angle

*δ*(

_{c}*k*) are generated as the ship's control input.

*GE*(

*k*),

*GA*(

*k*) and

*GR*(

*k*) are the input scale parameters for normalizing the three inputs and

*GU*(

*k*) is the scale parameter for the fuzzy output.

### 3.1 Fuzzification algorithm

*dδ*

_{1}(

*k*) for fuzzy control block 1 has three members corresponding to Output Positive (OP), Output Zero (OZ), and Output Negative (ON), as shown in Fig. 2(b). Moreover, the output

*dδ*

_{2}(

*k*) for fuzzy control block 2 has two members corresponding to the Output Positive Middle (OPM) and Output Negative Middle (ONM), as shown in Fig. 2(c).

### 3.2 Fuzzy control rules

*R*1)

_{1}∼ (

*R*4)

_{1}, for fuzzy control block 1 and in the control rules (

*R*1)

_{2}∼ (

*R*4)

_{2}, for fuzzy control block 2, Zahdeh's AND logic is applied, which performs a MIN operation to find the fitness of the latter for two conditions of the former.

### 3.3 Defuzzification algorithms and control increment

*dδ*

_{1}(

*k*) and

*dδ*

_{2}(

*k*) of the fuzzy control blocks 1 and 2 through defuzzification are summed. Furthermore, the output scale parameter

*GU*(

*k*) is multiplied to finally generate the control increment

*dδ*(

_{c}*k*). Organizing

*dδ*(

_{c}*k*) according to the fuzzy control rules can result in the formation of a very simple PID, as shown in Equation (6).

*K*(

_{i}*k*),

*K*(

_{p}*k*) and

*K*(

_{d}*k*) are as shown in Equation (7).

### Route generation of the ship

### 4.1 Deep Q-Netwrok algorithm

*A*in the current state

_{t}*S*through the optimal policy determined by the Q-network, and the agent will receive the next state

_{t}*S*

_{t}^{′}and reward

*R*

_{t}^{′}in the environment. At this time, past information

*S*,

_{t}*A*,

_{t}*R*

_{t}^{′},

*S*

_{t}^{′}is stored in the replay buffer, and Q-network is continuously updated based on the stored information. The neural network in DQN is oriented towards minimizing the loss function, which is given by Equation (8).

*L*(

*θ*) is the loss function,

*θ*is the parameters of the Q-network, and

*θ*

^{−}is the a parameters of the target network. Target network is a neural network with the same parameter values as the Q-network. The Q-network does not learn with a clear goal because the neural network is constantly updated during the coursed of finding the optimal behavior. To solve this problem, we keep the parameter values of the target network fixed and train in the desired target direction and then copy the updated parameters of

*θ*back to

*θ*

^{−}for training.

### 4.2 Douglas-Peucker algorithm

### 4.3 Constructing an experimental environment

*m*based on reference(Lee et al, 2019), and GEBCO's bathymetric data was used to reflect the sea depth information. GEBCO's bathymetric data consist of 400m × 400m per pixel, and if the data is greater than 0m, it is selected as an island or land, and if it is less than 0m, it is selected as a sea. However, by applying the UKC of the ship, the area deeper than 11m was selected as a navigable area, and the area shallower than 11m was selected as an non-navigable area.

### 4.4 Simulation on the route generation

*∊*-greedy to 0.2, the learning rate to 0.002, and the discount factor to 0.9. The discount factor is a value that determines how much of a future reward will be received at present. Then the replay memory and batch are each set to 2000 and 64, respectively. The activation function of the DQN neural network used RELU.

### Route following control

*m*/

*s*, and the maximum angle of the rudder was limited to ± 20° to secure the stability of the ship.

*ψ*(

_{d}*k*) is the reference heading angle.

*x*and

_{d}*y*are the position of the next way points, respectively and

_{d}*x*(

*k*) and

*y*(

*k*) are the current ship's position, respectively.