Comparison of monolithic and hybrid controllers for multi-objective sim-to-real learning

(1)

COMPARISON OF MONOLITHIC AND HYBRID CONTROLLERS FOR MULTI-OBJECTIVE SIM-TO-REAL LEARNING

Master of Science Thesis Faculty of Information Technology and Communication Sciences Examiners: Prof. Joni-Kristian Kämäräinen Dr. Nataliya Strokina April 2021

(2)

ABSTRACT

Atakan Dag: Comparison of monolithic and hybrid controllers for multi-objective sim-to-real learning

Master of Science Thesis Tampere University

Data Engineering and Machine Learning April 2021

Simulation to real (Sim-to-Real) is a way to accomplish robotic task by first training in the simulator and then transferring the learned model into the real environment. It is one of the most popular way to solve a robotic task since it is easy to implement and test compared to analytically solving the trajectories. Recent successful sim-to-real works have solved single objective tasks such as "take the object". However, in reality, problems consist of multiple objectives such as

"avoid humans while reaching the target". The most standard reinforcement learning (RL) solution for multi-objective problems is to train a single (monolithic) controller by using a complicated problem specific reward function. The problem with this approach is the difficulty of shaping the reward function (reward engineering). Recently, a hybrid controller solution was proposed, which uses multiple pre-trained single-objective controllers and switches between them according to some conditions.

In this research, comparisons of these two approaches are done for a problem where a robotic manipulator tries to reach a target while avoiding collisions with an obstacle. For both of the methods, controllers are trained and tested in a simulator environment and then verified by a real set-up. Our results show that the hybrid controller has a better performance in terms of success and failure rates. Also, training the hybrid controller is easier than training the monolithic controller due to the easiness of shaping the reward functions of single-objective controllers and training of hybrid controller is less time-consuming than the training of monolithic controller.

Keywords: Sim-to-Real, Reinforcement Learning, Deep Reinforcement Learning, Manipulator, Robotic Safety, Collision Avoidance

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

First of all, I would like to thank my supervisors Professor Joni Kämäräinen and Dr. Na- taliya Strokina for giving me the opportunity to work in this project. I was not certain about starting this project at first since I was unfamiliar with the subject. But, Joni and Nataliya were always supportive during the research and guided me whenever I needed.

I would like to thank Wenyan Yang for helping me with the theory of RL and his great advices for the development for our research. I would like to thank Alexandre Angleraud for helping me in the robotics lab to accomplish my research. Without him, it would be very difficult for a guy like me who is not very familiar with the robotics.

I would also like to thank my dearest Turkish friends in Finland; Duygu, Mehmet, Mert, Kubra and Baran for their friendship and all the adventures that we have experienced in this new country. Without them, I would be in so much trouble.

I would like to thank my friends in Turkey; Ogulcan, Ekin, Mert, Elif, Beste, Sina and Ceylan for the great fun we had through our chats. I know our friendship will last forever even if there are thousands of miles between us.

Lastly, I would like to thank my dear parents Figen and Murat for their support through all my life. Without their care and love, I would not be the person who I am.

Tampere, 14th April 2021

Atakan Dag

(4)

LIST OF FIGURES

1.1 Interaction between agent and environment adapted from [4]. . . 1

3.1 Example of DNN. . . 6

3.2 Illustration of how neuron output is calculated. . . 7

3.3 Popular activation functions. . . 8

3.4 Main branches of machine learning. . . 10

3.5 Calculation of the loss function in DQN. . . 13

3.6 Calculation of the loss functions in DDPG. . . 18

4.1 Screenshot from the simulator. . . 20

4.2 Panda’s joints and links [40]. . . 21

4.3 Overview of monolithic controller training. . . 24

4.4 Overview of hybrid controller. . . 26

4.5 Example of collision detection. . . 28

4.6 Example of minimum distance measurement. . . 29

5.1 Real Environment Set-up. . . 32

5.2 Overall System Architecture. . . 33

5.3 AprilTag markers placed at the center of the objects and on the table. . . . 34

6.1 Example of per episode rewards for monolithic controller (po = 4, ps = 4, τ_o = 100mm). . . 38

6.2 Convergence of the monolithic controllers with the same setting (p_o = 4, p_s= 4, τ_o = 50mm). . . 39

6.3 Training results of monolithic controllers with differentτ_ovalues (p_s = 4, p_o = 4). . . 40

6.4 Per episode rewards of the reaching sub-controllers. . . 41

6.5 Per episode rewards of the avoiding sub-controllers. . . 42

6.6 Average minimum distance per episode during training. Top: Monolithic Middle: Reaching Bottom: Avoiding. . . 43

6.7 Simulation test results for different controllers. . . 44

6.8 Experimented scenarios to compare simulation and real performance. . . . 45

6.9 Distance performance of the controllers in 5 different scenarios in simulation. 46 6.10 Video frames from the Sim-to-Real experiments. . . 49

(6)

LIST OF TABLES

5.1 Joint velocity limits for Panda [48]. . . 36 6.1 Success and collision rates for different combinations of p_o and p_s with

τ_o = 100mm. . . 44 6.2 The success and collision rates of the monolithic and hybrid controllers

experimented on 5 different scenarios. . . 48

(7)

LIST OF ALGORITHMS

1 DDPG Algorithm (Lillicrap et al. [1]) . . . 18

2 Hybrid controller - one control step . . . 25

3 Training Algorithm with PyRep . . . 30

4 Testing Performance with PyRep . . . 30

5 Calculation of the minimum distance in real environment. . . 35

(8)

LIST OF SYMBOLS AND ABBREVIATIONS

A_t Action at time step t α Learning rate

ANN Artificial Neural Networks

e Cartesian coordinates of the end effector o₁ Cartesian coordinates of the target object o₂ Cartesian coordinates of the obstacle

d_min Minimum distance between robot and obstacle d_T Minimum distance between end-effector and target DDPG Deep Deterministic Policy Gradient

DNN Deep Neural Network DQN Deep Q Network

DRL Deep Reinforcement Learning γ_i Intrinsic joint positions

γ̇_i Joint velocities γ Discount factor

µ₁ Policy of avoiding sub-controller µ2 Policy of reaching sub-controller µ(s) Deterministic policy, actor function MDP Markov Decision Process

MSE Mean square error

N Batch size

N_t Exploration noise at time step t π(s) Deterministic policy function

po Penalty for getting closer to the obstacle p_s Reward for reaching the target

π(a|s) Stochastic policy function L_critic Critic Loss

(9)

Q(s, a) Action-value function, expected return of a pair of state and action, critic function

G_t Estimated return, future reward Rt Reward at time step t

ReLU Rectified Linear Unit RL Reinforcement Learning ROS Robot Operating System S_t State at time step t

τ_hyb Threshold distance for the switching mechanism τ_o Threshold distance between robot and obstacle τs Threshold distance between robot and target τ Soft replacement factor

TD3 Twin Delayed Deep Deterministic Policy Gradient TRPO Trust Region Policy Optimization

V(s) State value function

θ^µ Weights of the actor network θ^Q Weights of the critic network

(10)

1. INTRODUCTION

Reinforcement Learning (RL) is becoming increasingly popular in the robotics area with the advances in "Sim-to-Real" approaches [2], [3]. The objective is to train the robot in a simulated environment and then transfer the learned behavior to the real environment.

Recent advances in simulators allow researchers to test their controlling algorithms with realistic dynamic properties of the robot itself and the outside world. Since the training time is not limited in the simulated environment, the recent developments in RL algorithms [1], [4]–[6] became very useful to train a model.

In RL algorithms, a reward is given for each action or at the end of the sequence of actions (episodes) so that the model (agent) will eventually learn which action will be the best for that specific observation of the environment which we call states. That is why it is important to be able to run models many times, observe the states, actions and rewards to come up with a decision making system for the robot which can collect optimum rewards until the end of the episode. The illustration of the interaction between agent and environment can be seen in Figure 1.1. Success of these RL algorithms in a Sim-to-Real set-up depends on how accurate they can perform the learned behaviour in the real environment.

Figure 1.1.Interaction between agent and environment adapted from [4].

There have been many successful Sim-to-Real controllers [6]–[10] applied for object ma- nipulation tasks with one main objective such as "pick the object". However, in real world applications, the problems might have multiple objectives at the same time such as "reaching the target" and "avoiding obstacles". Those multi-objective tasks are espe-

(11)

cially important for the working environments where the robots need to complete their tasks and at the same time avoid humans to provide human safety operations [11]–[13].

The most straightforward RL solution to cope with multi-objective problems is to define a reward function according to the objectives so that proper rewards and punishments will be given after each action and train amonolithiccontroller. The difficulty of this method is to find and tune the correct reward function specific to the problem that will represent all the objectives. This tuning or shaping the reward function is called "reward engineering".

A demonstration of monolithic controller to solve the problem of reaching a target and avoiding obstacle is given by Pham et al. [14]. In their approach, they gave different rewards according to the distances between the robot, target and obstacle.

Alternative solution for multi-objective problem is to have multiple models for each objective. Namely, different models for reaching and avoiding which is calledhybridcontroller.

This idea is proposed by Sangiovanni et al. [15]. They applied a switching rule between the models according to the distance between the robot and the obstacle. The problem with this method is to find a proper switching rule and the authors did not verify the solution in real robot.

1.1 Objective of the Thesis

The objective of this thesis is to compare the two approaches, monolithic and hybrid controllers, in a multi-objective RL problem where a robotic manipulator tries to reach a target object while trying to avoid collisions with another object in Sim-to-Real set-up. For both of the methods, the controllers are trained in a simulator and then verified in a real environment. Basically, the main objectives are:

1. Implementation of the monolithic and hybrid controllers for the "reaching a target and avoiding an obstacle" problem in a simulated environment by using reinforcement learning.

2. Comparison of the two controllers in terms of their training process.

3. Comparison of the test results of the two controllers in the simulated environment in terms of their success and collision rates and their behaviours.

4. Verification of the simulation results in the real environment by transferring the trained models to real manipulator Franka Emika Panda.

1.2 Structure of the Thesis

The rest of the thesis is structured as follows. Chapter 2 presents the related work with our research. Chapter 3 explains the theoretical background used in our RL algorithm. Chap- ter 4 explains the methods that we used in our RL algorithm and simulation. Chapter 5

(12)

explains the real robotic system that we used to test our algorithm. Chapter 6 shows the results that are obtained both from simulated and real environment. Chapter 7 concludes the thesis by discussing the results.

(13)

2. RELATED WORK

Sim-to-Real research is becoming increasingly popular with the great improvements in simulators. The main challenge which is the transfer of the trained controller to a real environment became easier with the advances in robotics. Some of the most popular tasks in Sim-to-Real tasks are to pick and place an object, to reach a static or dynamic target object while there are static or dynamic obstacles around. To solve these tasks, reinforcement learning based approaches are used. Some of them have shown great success in simulators [15]–[19] while some of them have additionally shown success in transferring the trained controller to the real environment [6]–[9], [14], [20]. All these works use different RL algorithms or different reward functions. Also, used manipulator types, sensor inputs and the object recognition methods are different for each work.

Some works concentrated more on the problems which occur due to the differences between real and simulated environments. Matas et al. [3] worked on deformable objects such as clothes which are more difficult to recognize, pick and place compared to the rigid ones. Golemo et al. [21] proposed Neural-Augmented Simulation (NAS) to transfer the learned policies better to the real environment to close the gap between real and simulated environment. Peng et al. [22] demonstrated that by adding randomness to the dynamics of the simulator during training, it is possible to close the gap between real and simulated world since that gap is generally caused by the noises and modeling errors in real environment.

Works that are most similar to our work are Wen et al. [16] and Pham et al. [14] which are monolithic controller approaches, and Sangiovanni et al. [15] which uses the hybrid controller approach. All of these works focused on the multi-objective task of reaching to a target and avoiding the obstacle by using RL. Wen et al. [16] used Deep Deterministic Policy Gradient (DDPG) to train the model. They defined a dense reward which depends on the distance to target and a negative penalty when the robot is closer to the obstacle than a predefined threshold. However, they experimented a single scenario in the simulated environment and did not verify the solution in the real environment. Pham et al. [14]

build their RL algorithm based on Trust Region Policy Optimization (TRPO) and defined their reward functions as sparse rewards according to the the distance to the target, obstacle and collision. They also proposed "OptLayer" which is a neural network takes unsafe actions as input and outputs the closest safe action. They verified their work on

(14)

a real environment with UR5 manipulator. Sangiovanni et al. [17] first proposed a reward function which have multiple terms with different weights according to the distance to the target, obstacle and magnitude of the actions. They further developed their approach in [15], by separating the model into a dual mode for reaching and avoiding which we call hybrid controller. For reaching mode, they used a motion planner working in the operative space which basically tracks a reference trajectory to move the end-effector to target. For the collision avoiding mode, they used Deep Reinforcement Learning (DRL) approach.

However, they did not verify their work in real environment. In our work, we compare the monolithic controller and the hybrid controller which are both trained by using DDPG algorithm with dense and sparse rewards.

There is also another approach to use multiple policies which uses Pareto optimal policies to solve the task by approximating these policies [23]. Xu et al. [24] proposed an evolu- tionary algorithm to find those Pareto set approximations. They have shown their results only in simulated environment. This approach is not practical yet since the number of parameters needed to build the optimal sets increases with the complexity of the task.

(15)

3. THEORETICAL BACKGROUND

This chapter provides background information necessary to understand the models that are used to implement the solution: Deep Neural Network (DNN) and Reinforcement Learning (RL). In the DNN part, we will explain the terminologies that are used in DRL algorithms. In the RL part, we will explain the required RL background for the DDPG algorithm.

3.1 Deep Neural Network

DNNs are sub-types of Artificial Neural Networks (ANN) which are created from the in- spiration of human brain system. The aim is to find a non-linear mapping between input information to the desired output by trying to minimize the defined loss. ANNs consist of input, hidden and output layers where each layer is constructed by neurons and their connections to the other neurons from the upcoming layer. DNNs have multiple hidden layers between input and output layers [25]. Simple example of a DNN can be seen in Figure 3.1.

Input Layer ∈ ℝ³ Hidden Layer ∈ ℝ⁵ Hidden Layer ∈ ℝ⁵ Output Layer ∈ ℝ¹

Figure 3.1. Example of DNN.

(16)

3.1.1 Neuron

Neurons are the most basic units of neural networks. Each neuron in the hidden and output layers takes the information from neurons in the previous layer whose values are multiplied with a weight specific to that connection and then summed. There can also be a bias term added to that summation to adjust the output. After that, the calculated value goes through a function called "activation function" to output the value of that neuron. The illustration of this operation which is also called asperceptroncan be seen in Figure 3.2.

w₁

w₂

w_n x₁

x₂

x_n

fn ^output

Inputs

Weights

Activation Function Bias

Figure 3.2.Illustration of how neuron output is calculated.

3.1.2 Activation Functions

Activation functions are important because they define the output value of each neuron.

By applying nonlinear functions to the neuron’s output, we can ensure the non-linear transformation which enables us to have a nonlinear mapping between input and output at the end which is the basic aim of these networks.

Rectified Linear UnitRectified Linear Unit (ReLU) is the most common activation function for hidden layer neurons since the nonlinearity of the function and the computational efficiency. It only activates positive neurons by simply making the negative values as zero.

This simplicity, helps the network to compute faster. By using many ReLU functions, we can fit our data to any non-linear function. Also, ReLU is very close to a linear function

(17)

which makes it useful with gradient based methods [26]. Formulation of ReLU function is

f(x) =max(0, x) (3.1)

The shape of the ReLU function can be seen in Figure 3.3a.

(a)ReLU f unction.

(b)tanhf unction.

Figure 3.3.Popular activation functions.

Hyperbolic Tangent FunctionThe other activation function is hyperbolic tangent function (tanh). This function is generally used in the output layers and it is useful to scale the output between a desired range. Normally, the output range oftanhis[−1,1], by multiply- ing this output with a constanta, one can obtain an output range of[−a, a]. Formulation

(18)

oftanhis

f(x) = e^x−e^−x

e^x+e^−x ^(3.2)

The shape of thetanhfunction can be seen in Figure 3.3b.

3.1.3 Training the Networks

The aim of the training process is to update the weights so that the mapping from the inputs to the output will be as similar as possible to the desired output. The aim is to minimize this difference between the predicted and desired output which is calculated by using loss function [26]. The loss function we will be using in our algorithms is Mean Square Error (MSE) and it is defined as

M SE = 1 n

n

∑︂

i=1

(y_i−yˆ )_i ² (3.3)

where y is the desired output, yˆ is predicted output and n is the number of elements in y. MSE is a popular choice for the problems where the task is to approximate a function which is also called regression [27]. After calculating this loss value, the aim is to decrease this loss by updating the weights in the direction where the loss decreases. This method is called gradient descent. To update the weights in all of the layers, the gradients calculated at the last layer are used in the gradient calculation of the previous layer and this goes on until the input layer which is calledback-propagation[28]. This training process can be done in batches of training data rather than every training data separately. In this way, the average of the gradients for each weight is calculated and then the weights are updated. This improves the speed of the training process.

It is important to be aware of that the learned function does not fit so precisely to the given data which leads to wrong predictions for an unseen data. This is calledoverfitting and we can also face it in RL algorithms.

3.2 Reinforcement Learning

Reinforcement learning (RL) is one of the three branches of machine learning. The others aresupervised learningandunsupervised learning. In supervised learning, input-output pairs are given to the model to learn the mapping between the input and output. Namely, the model is supervised by the data telling how good is the prediction. In unsupervised learning, the given data is unlabeled. Namely, there is no supervision of input-output pairs and the model itself needs to learn how to group and differentiate the given data. RL is different from both of them. In RL, the aim is to optimize the cumulative reward collected

(19)

from the environment by doing actions. Although RL can be seen similar to unsupervised learning since there is no given labeled data, it does not try to learn the structure of the input data, it tries to maximize the reward taken [4]. Similarly, RL can be thought similar to supervised learning solution since we are supervising the predictions with rewards.

Therefore, we can put RL on a third different branch which have common areas with supervised and unsupervised learning. The main branches of machine learning can be seen in Figure 3.4.

Machine Learning Supervised

Learning Unsupervised

Learning

Reinforcement Learning

Figure 3.4.Main branches of machine learning.

In the RL framework, the problem is modeled as Markov Decision Process (MDP) which was illustrated in Figure 1.1. In the following, basic terms of MDPs and RL are explained.

In MDP, agent is our robot which is responsible to take actions (a ∈ A) according to the states (s ∈ S). States are the observations from the environment. If the observations describe the full state, then it is MDP. If the observations describe only partially the state, then it is Partially Observed MDP (POMDP). States can consist of both the internal information of the agent and objects in the environment. Both states and actions can be continuous or discrete depending on the problem. Rewards (r ∈ R) tell the agent how good was the taken action for that state by giving feedback value. During the agent’s interaction with environment, it first observes the state and decides on the action. Then it receives a feedback reward and observes the next state after performing the chosen action from the environment. Then it updates its policy to choose better actions. Each iteration (decision) is calledstep. The sequence of states from the beginning step to the last step is calledepisode.

In RL, the term model is used for the predictions of the next states after performing the action. Namely, they are used for planning the future when choosing the actions [4]. If we learn or have any model of the environment, then our method ismodel-based, Otherwise, it ismodel-free. In our work, we will deal with model-free methods.

(20)

Policy Policyπ is the function that maps states to the actions. It is also called as the behavior of the agent. It can be stochastic or deterministic. If it is stochastic, it outputs the probabilities of actions and shown byπ(a|s). If it is deterministic, it outputs a certain action and it is shown byπ(s).

Value FunctionsValue functions show how good the state is in terms of the total reward earned until the end of the episode after that state. However, this future reward which is also called as return (G_t) is not directly calculated by summing the rewards. The summation is done by using discounting factor (γ ∈ [0,1]) for the upcoming rewards to make the rewards from the nearer steps more important. The future reward after time- step t is shown in (Equation 3.4). The other reason for using discount is to deal with the episodes with extremely high number of steps which have no effect on the first steps.

The smaller the discount factor, the less importance given to the further states. The state value function is the expected future reward from state s by choosing the actions using policyπand it is shown in Equation 3.5.

G_t=

∞

∑︂

i=0

γⁱR_t+i+1 (3.4)

V^π(s) = Eπ[G_t|S_t=s] (3.5) A similar function to state-value function isQvalue which is also called action-value function. This function shows the expected future reward after statesand choosing the action a at that state. Namely, it shows how good to choose that action for that state. The equation is shown as

Q^π(s, a) = Eπ[G_t|S_t=s, A_t =a] (3.6) Temporal Difference Learning To predict these value functions, Temporal Difference (TD) Learning is proposed by Sutton et al. [29] which is a model-free algorithm. By predicting the value function, one can find the optimal policy. The simplest form of TD learning is shown as

V(St) = V(St) +α[Gt−V(St)] (3.7) We can writeG_t in terms ofR_tandV(S_t)by using the Equation 3.4. Thus, the general from of TD Learning is shown as

V(S_t) =V(S_t) +α[R_t+1+γV(S_t+1)−V(S_t)] (3.8)

(21)

whereαis the learning rate. This equation can be thought as a milestone in RL and used in many of the algorithms.

We will explain some of the model-free algorithms which are important for us. We can group the algorithms into three which are Value-based, Policy-based and Actor-Critic Methods. In Value-based methods, the value function is updated to find the optimal one. Then the optimal policy becomes the one that follows the optimal value function.

In Policy-based methods, policy function is updated in order to find the optimal policy. In Actor-Critic methods, both policy and value function are updated but we will mention the basic Actor-Critic algorithm in the policy based methods and the DDPG algorithm in a new section since it consist both of the methods.

3.2.1 Value Based Methods

There are two types of value-based methods. On-policy methods learn by using the new observations that are observed by using the current policy only. Off-policy methods can use past experiences which can be observed with different policies also. In this part, two off-policy value-based methods are explained: Q-learningandDeep Q Network.

Q-learning Q-learning [30] is an off-policy TD control algorithm. The aim is to update all the Qvalues for all state-action pairs which is also called asQ table. The formula to updateQvalue for the stateS_tand actionA_tpair is shown as

Q(S_t, A_t) =Q(S_t, A_t) +α[R_t+1+γmax

a Q(S_t+1, a)−Q(S_t, A_t)] (3.9) Here, the actionA_tis generally chosen by a method calledϵ−greedy. In this method, the action with the bestQvalue is chosen for1−ϵpercent of time, a random action is chosen forϵpercent of time. The update in Equation 3.9 is done for every step each episode until the terminal state in which the episode is over. This iteration is done until all the values of Qtable reach their optimal value (converge to some value). This convergence is ideally done if

[R_t+1+γmax

a Q(S_t+1, a)−Q(S_t, A_t)] = 0 (3.10) which means

Q(S_t, A_t) =R_t+1+γmax

a Q(S_t+1, a). (3.11) Q-learning is especially useful for simple problems that have discrete state and action space. For example, in the taxi environment of OpenAI Gym [31], there are total of 500 states and 6 actions. It is feasible to complete such aQtable. However, for the methods with continuous state space it is not possible to fill a table. Therefore, we need an ap- proximation method to predict the actions by looking at the states rather than creating a table for each value. As we have explained in Section 3.1, DNNs are the best candidates

(22)

to make those approximations. That leads us to a method called Deep Q Network.

Deep Q Network Mnih et al. [32] proposed Deep Q Network (DQN) to approximate Q values for the environments with very large state space. They proposed this method for Atari games where the states are the frames taken from the game and actions are the discrete commands but states can also consist of neurons with continuous value. In this method, Q value which is presented as DNN is called asQ local network. It takes states and actions as input and predictsQ(S_t, A_t)as output which is also called asQ_expected. To update this network, Equation 3.9 and Equation 3.11 are used. We said that we converge to the optimal Q value whenQ(St, At)gets closer to Rt+1 +γmax

a Q(St+1, a) which is called as Q target. Therefore, the loss function of Q network which is needed to be minimized can be defined as

Loss=M SE(Rt+1+γmax

a Q(St+1, a), Q(St, At)) (3.12) Here, it is difficult to train the network because target function Q(St+1, a) updates at each step and this makes the training unstable. Therefore, another DNN calledQ target networkis used to predictmax

a Q(S_t+1, a)which is also called asQ_target. As this network can predict aQvalue for each action in that state, it can give the output as the maximum of theseQ values. The weights of this network is fixed for k steps and then updated by exactly copying the weights of the Q local network at every k steps.

Another important problem is how to train the Q local network with batches of data. The solution is a method calledExperience Replay[33] which is a buffer that keeps the experiences as(St, At, Rt, St+1, done). The termdoneis a binary variable which indicates if the episode is finished or not. If we are in the last state, Q_target has no effect in the loss function since there is no expected future reward. The diagram in Figure 3.5 illustrates the calculation of this loss function.

Experience Replay dones Rewards

Next States Actions

States

Q Local Network

Q Target Network

Figure 3.5. Calculation of the loss function in DQN.

This loss function is used to train Q Local Network at each step. At the same time, Q local

(23)

Network continues to produce actions by taking the present state as input and outputting the action which is predicted to have the highest Q value with an ϵ−greedy selection policy. Those actions are performed by the agent to produce experiences by interacting with the environment. This iteration continues until predicted Q values converge to a value.

This algorithm is feasible for the problems with a discrete and small action space because Q local Network needs to predict theQvalue for each action in the present state to choose the action according to predictedQ values. Also, Q Target Network needs to predictQ value for each action in the next state during the calculation of the loss function. This is feasible for the environments with discrete actions but is not possible to predict Q value for each action where the action space is continuous and large. Therefore, DQN is not a good choice for autonomous driving or robot control problems where the actions are generally continuous values of velocities of motors and etc.

3.2.2 Policy Based Methods

Policy based methods are good choices to deal with problems with very large state and action space. Policy gradient methods optimize the policy function instead of the value function. By moving the parameters of the policy in the direction where the gradient of the policy ascents, we can find the optimal policy with the maximum return. This approach is first proposed in REINFORCE [34] and Actor-Critic [35] algorithms.

REINFORCEREINFORCE algorithm uses the estimated return(G_t)in Monte-Carlo (MC) methods at each time-step of the trajectory. MC methods are very similar to the TD- Learning methods. Only difference is MC methods uses the whole episode trajectory to learn whereas TD-Learning can use incomplete episodes. The update of the policy parameter in REINFORCE algorithm is written as

θ =θ+αγ^tG_t∇_θlnπ_θ(A_t|S_t) (3.13) Actor-Critic In Actor-Critic algorithm, value function is also updated beside the policy function. Critic is responsible for updating the parameters of the value function(w), actor is responsible for updating the parameters of the policy function (θ). The update ofθ is done as

θ =θ+α_θQ_w(s, a)∇_θlnπ_θ(a|s) (3.14)

(24)

The update ofwis done by using the TD error at every time step(δt)and is found as

δ_t=r_t+γQ_w(s^′, a^′)−Q_w(s, a)

w=w+α_wδ_t∇_wQ_w(s, a) ^(3.15) where αθ and αw are the learning rates to update the policy function parameter θ and value function parameterwrespectively. This actor-critic framework is widely used in the recent algorithms of the DRL.

Policy Gradient TheoremA more generalized approach for policy gradients is proposed by Sutton et al. [36]. Before diving into algorithms we used in this work, it is better to understand the policy gradient theorem [36]. The expected return (reward function or objective function) in continuous space is defined as

J(θ) = ∑︂

s∈S

d^π(s)Vπ_θ(s) =∑︂

s∈S

d^π(s)∑︂

a∈A

π(a|s, θ)Q^π(s, a) (3.16)

where d^π(s)is stationary distribution of states under the policyπ. The aim is to update the parameters of the policy(θ)as

θ =θ+α∇_θJ(π_θ) (3.17)

where α is the learning rate. The gradient of the reward function is found by the Equa- tion 3.18 which is called as Policy Gradient Theorem [36].

∇_θJ(θ) =∑︂

s∈S

d^π(s)∑︂

a∈A

Q^π(s, a)∇_θπ_θ(a|s) (3.18)

Deterministic Policy GradientIn the previous methods, policy gradient is used for the stochastic policies. Silver et al. [37] proposed Deterministic Policy Gradient (DPG) which is the policy gradient for the deterministic policies. The objective function for continuous action values is defined as

J(θ) =

∫︂

S

p^µ(s)Q(s, µ_θ(s))ds (3.19)

where p^µ(s)is the state distribution which is defined similar to stochastic case in Policy Gradient Theorem. Deterministic Policy Gradient Theorem [37] defines the gradient of

(25)

this objective function as

J(θ) =

∫︂

S

p^µ(s)∇_θµ_θ(s)∇_aQ^µ(s, a)|_a=µ_θ_(s)ds (3.20)

which is found by simply applying a chain rule. This gradient is used in the algorithms which update deterministic policy parameters in the direction where this gradient sug- gests. One of these algorithms is DDPG which is the method that we will use in our research.

3.2.3 Deep Deterministic Policy Gradient

Lillicrap et al. [1] proposed an algorithm called Deep Deterministic Policy Gradient (DDPG).

It is a model-free off policy algorithm which combines DQN and DPG algorithms. It uses an actor-critic framework with local and target networks. As we mentioned in Sec- tion 3.2.1, DQN can map the states to discrete actions. DDPG can map the states to the continuous action space which makes it one of the best candidates for complex problems where actions are continuous such as robotics tasks.

Since it is an off-policy method, we use experience replay (explained in Section 3.2.1) which consist of (states, actions, rewards, next states). In this method, actor network(µ) maps states to the action value. This network represents the deterministic policy. Critic network (Q)predicts Q value of the state-action pair. Both of these networks have also their target networks similar to their usage in DQN. We call original networks as local.

Actor local network takes state input from experience replay and predicts the continuous action value. Critic local network takes states from the experience replay together with the actions predicted by actor local network and predicts theQvalue for that pair. Actor’s aim is to maximize thatQvalue. Therefore, we we can use the negative of this predicted Q as the loss function of the actor local network. Namely, actor tries to minimize the negative of the Q value predicted by the critic local network. This update of the actor network was shown in Equation 3.20. Critic local network also predicts theQvalue of the state and action together taken from the experience replay. This prediction (q) is used in the update of critic network in Equation 3.21. All of these updates are for a number of batches of experiences to speed up the process which is one of the main reason for using experience replay.

On the other hand, actor target network takes next states from the experience replay and predicts next actions. Critic target network takes this next action together with next states as input and predicts the Qvalue of this pair. Similar to DQN, critic’s aim is to minimize

(26)

the TD error which is defined as

T Derror=M SE(q, q_target) (3.21) whereqis the output of the critic local network andq_targetis calculated by using the output of the critic target network (q_) as

qtarget =rt+γ∗q_ (3.22)

After that, we can use TD error as the loss function of critic local network since we want to minimize this error. To update the target network parameters, a soft replacement method is used. The update equation for the parameters is defined as

θ^Q^′ ←τ θ^Q+ (1−τ)θ^Q^′ θ^µ^′ ←τ θ^µ+ (1−τ)θ^µ^′

(3.23)

where θ^Q, θ^Q^′, θ^µ, θ^µ^′ are the parameters of the critic local, critic target, actor local, actor target networks respectively andτ is the soft replacement factor. The diagram that shows the calculation of the loss functions for the actor and critic networks can be seen in Figure 3.6. Red arrows show the path to update critic. Blue arrows show the path to update actor.

It is important to mention how the experiences are collected. First, randomly initialized actor local network observes the state from the environment and then outputs the action.

We add some Gaussian noise to that action to have better exploration. In other words, to observe more different states. Until the experience replay buffer becomes full, the experiences are collected by this randomly initialized actor and added noise. This action selection process is formulated as

a_t =µ(s_t|θ^µ) +N_t (3.24)

After the replay is full, the networks start to update themselves according to the diagram in Figure 3.6 and we gradually decrease the randomness in the noise we add to the actions because we also want to exploit what we already found. This balance is very important in RL algorithms and it is calledexploitation vs. exploration trade-off. If the agent explores too much, it cannot perform the actions which return the optimum reward. However, if it exploits too much, it cannot discover a broad range of actions in the action space and it cannot find optimum point which returns the maximum reward.

(27)

Experience Replay

Actor Local Network

Actor Target Network

Critic Local Network

Critic Target Network States

Next states

next action

q Actor update: maximize q actions predicted

Actions

q_

Reward

q

Critic update: minimize Critic Loss

Figure 3.6.Calculation of the loss functions in DDPG.

The whole sequence explained continues until the total reward obtained from an episode converges. The whole algorithm of DDPG can be seen in Algorithm 1.

Algorithm 1:DDPG Algorithm (Lillicrap et al. [1])

Randomly initialize the actor and critic networks parameters ; Initialize experience replay buffer ;

forepisode = 1,M do

Initialize the Gaussian Noise for action selection ; Observe initial states1 ;

fort = 1,T do

Select actiona_tby using Equation 3.24 ;

Performatand observe rewardrtand next statest+1 ; Store the experience(s_t, a_t, r_t, s_t+1)in experience replay ; Sample a batch ofN experiences(s_i, a_i, r_i, s_i+1)from replay ; Update critic by minimizing the loss: L= _N¹ ∑︁

iM SE(yi, Q(si, ai|θ^Q)); Update the parameters of the target network by using Equation 3.23 ; end

end

DDPG is very appropriate for robotics tasks where the actions are the velocities, torques or positions of the joints of the robotic arm which are all controlled by continues values.

Also, the state space consists of these joint positions, actions and positions of the object in 3D space which are all continuous. Therefore, an algorithm that is capable of mapping the continuous states to continuous actions is needed. There are advanced methods of DDPG such as using Hindsight Experience Replay (HER) [38] which especially deal with

(28)

sparse rewards. The idea of this method is to pretend like the last terminal state is the goal state even though it is not. Then the experience is learned with that new trajectory and a sparse reward will be taken. This approach increases the learning performance for the environments with sparse rewards since it would not be possible to learn from the experiences which could not reach the goal state without HER. There are also advanced algorithms of DDPG such as Twin Delayed DDPG (TD3) [39] which is a more stabilized method of DDPG. It uses two Q functions (twin) instead of one and always takes the smaller Q value during the calculation of the loss. It also delays the actor update in one of two episodes. In our research, we will only use DDPG as our basic algorithm for the sake of simplicity and it is the most common algorithm that is used for these kind of robotics applications.

(29)

4. SIMULATOR AND METHODS

This chapter describes the two main controllers that are examined throughout the research and the software that are used to train the controllers in simulation. Before diving into controllers, the following explains the environment that we are working on. The screenshot from the simulator (explained in Section 4.3.1) is shown in Figure 4.1. The objective is to reach the target object (o₁ : (x₁, y₁, z₁))with the end-effector (tooltip) of manipulator (e : (x₃, y₃, z₃))and avoid from the rectangular obstacle (o₂ : (x₂, y₂, z₃)). In this task, target object is randomly placed in the shaded dark gray area at the beginning of the episode and the obstacle is randomly moved in the light gray area including the dark gray area in Figure 4.1. The robotic manipulator used for this problem is Franka Emika Panda.

Figure 4.1.Screenshot from the simulator.

(30)

In this work, controllers are based on DDPG. Their input is the observation from the environment and the output is the action to control the robot. The actions are the velocity commands of eachjointof the manipulatorγ̇_i. Joints are the movable parts of the robot.

Linksare the parts that are connected to the joint. By sending commands to the motors of the joints, adjacent links move relatively to each other and therefore the whole manipulator makes its movement. There are total of 7 joints in Franka Emika Panda. The positions of them and the length (in meters) of the adjacent links can be seen in Figure 4.2.

Figure 4.2.Panda’s joints and links [40].

4.1 Monolithic Controller

The monolithic controller uses only one controller which is trained by the DDPG algorithm to accomplish the multi-objective task. That is why it is called as monolithic. Wen et al. [16]

used DDPG in their work to solve the same multi-objective task. The difference of our controller is the reward function.

Reward Function One of the most important thing to consider during the training of DDPG based controllers is the selection of a reward function to accomplish the task. Find- ing a proper reward function for a multi-objective task is a challenging task and needs a lot of time on reward engineering. In addition, most of these reward functions are environment specific which means the function should be adjusted for other kind of robot environ-

(31)

ments. Our reward function for the multi-objective task consist of both dense and sparse rewards. Dense rewards are given in every step according to some formula whereas sparse rewards are given only if a specific condition is met after that action. Those kind of reward functions are used in some of the previous works [14], [16].

The reward function for the monolithic controller is a composition of three different terms

r=Rt+Ro+Rs (4.1)

where Rt is a dense reward responsible for moving the end-effector closer to the target object. R_ois sparse reward responsible for avoiding getting closer to the obstacle.R_sis a sparse reward for successful actions, namely reaching the target object without colliding with an obstacle.

Rtis defined as the negative distance in meters from the end-effector to the center of the target object.

R_t=−d_T (4.2)

With this dense reward given every step, controller will try to maximizeR_t which means minimizingd_T. In this way, reaching task will be accomplished.

Ro is defined as the sparse reward (penalty) for getting closer with any part of the robot to the obstacle. If the minimum distance between any part of the robot and any part of the obstacle ,d_min, is smaller than a fixed thresholdτ_o then a negative reward (penalty) p_ois given.

R_o =

⎧

⎨

⎩

−p_o, if|d_min|< τ_o 0, otherwise

. (4.3)

With this reward term, controller will eventually try to find actions which both minimizeR_t and avoid getting penalties fromR_o.

R_sis defined as a sparse positive rewardp_sgiven if the distance between the end-effector and the target target, dT, is closer than a threshold τs and no collision occurred during that step. The formulation is defined as

R_s =

⎧

⎨

⎩

p_s, ifd_T < τ_sandcol = 0 0, otherwise

, (4.4)

where colis the binary flag which is 1if collision occurred and0if not. With this reward term, controller will try to stay in the target region as long as it does not collide with the

(32)

obstacle. Since obstacle might come near the target after the manipulator reaches the target, this reward helps the controller to take avoidance decisions at that state. There is no need for adjusting the weights for the reward terms since adjusting penalty values does that. One can shape the controller to avoid more by increasing p_o and decreasing psor to reach more by increasingpsand decreasingpo. The challenge is to find balance between the terms so that the controller can perform optimally for both of the objectives.

We set p_o = 4.0 andp_s = 4.0 for this controller. The results for different penalties are summarized in Section 6.2.

State VectorAnother important thing is to define the state vector so that actor and critic networks can do the mapping with the information given in the state vector as input.

Therefore, state vector needs information about both the intrinsic values of the manipulator and positions of the objects. To be more clear, state vector is defined as

s= (γ_i, γ̇_i,−→eo₁,−→eo₂) (4.5) where γ̇_i and γ_i are the velocities and the intrinsic positions of the joints. With this information, the controller has the knowledge about how its joints are located and oriented and what are their existing velocities. Vectors−→eo₁ and −→eo₂ are the vectors from the end- effector to the center of the target object and center of the obstacle object respectively.

With this information, controller has the knowledge about where the objects are located with respect to the end-effector and can relate the given reward with those vectors. The total length of state vector is 20 (7 forγ_i, 7 forγ̇_i, 3 for−→eo₁, 3 for−→eo₂).

The overview of how the monolithic controller trains and performs actions can be seen in Figure 4.3. Controller takes the state vector as input and predicts joint velocities. Then the manipulator moves with those velocities. After the movement, next state and reward are given to the controller. This iteration continues until the training of the DDPG algorithm is done.

After the training is done, namely the total reward obtained per episode converges, the parameters of both actor and critic networks are saved. Then, by simply using the actor network, one can obtain the actions by providing the state vector to the actor. So, at the end you have a decision making controller which can decide on the joint velocity values of the manipulator.

4.2 Hybrid Controller

Our hybrid controller uses the principle proposed by Sangiovanni et al. [15]. The difference is that they use a motion planner in operative space for the reaching task. In operative space, trajectory is planned in a task relevant coordinate system instead of

(33)

Monolithic Controller

Figure 4.3. Overview of monolithic controller training.

generalized coordinate system which gives the full configuration of the robot. The motion planner tracks the trajectory to move the end-effector to the target. However, we use DDPG for the reaching task similar to monolithic controller. Also, our reward function of the avoiding sub-controller is different from the work of Sangiovanni et al. The hybrid controller aims to reduce the effort spent on reward engineering by simply using multiple controllers for each objective and switch between them according to a rule. In our work, we will separate the controller into two which are ’Avoiding Sub-Controller’ with policy µ₁ and ’Reaching Sub-Controller’ with policy µ₂. Details of these sub-controllers are explained Section 4.2.1 and in Section 4.2.2.

The switching rule between the sub-controllers depends on d_min which is the minimum distance between the manipulator and the obstacle. Ifd_minis below than a fixed threshold τ_hyb, controller switches to avoiding sub-controller and actions are determined by this controller. If dmin is above than τhyb, controller switches to reaching sub-controller. In this method, sub-controllers are trained separately like a monolithic controller. There is no external training for the hybrid controller. It chooses between the already trained sub- controllers and the chosen controller’s actor network (policy) decides on the action. The working principle of the hybrid controller is summarized in Algorithm 2.

(34)

Algorithm 2:Hybrid controller - one control step compute distance from the tool tip to the targetd_t;

compute minimum distance between the robot and the obstacled_min; ifd_min < τ_othen

halt;/* collision */

end

ifd_t< τ_sthen

halt;/* goal reached */

end

ifd_min < τ_hyb then

stepµ₁;/* avoid */

else

stepµ₂;/* reach */

end

4.2.1 Avoiding Sub-Controller

This controller is based on the same DDPG with the monolithic. Reward functionr₁ and state vectors2are also the same which are shown in Equation 4.1 and Equation 4.5. The only difference is the penalty values in reward function. Namely, the penalties are higher than the monolithic one. We set p_o = 10 and p_s = 10 for this controller. This simply makes the agent to get away from the obstacle because it will eventually know that the actions leads the manipulator close to the obstacle region will be penalized heavily.

The other option could be using a dense reward which is the positive distance to the target so that controller would try to increase that distance. However, we did not choose to use such a function because we also wanted to be closer to the target region while avoiding from the obstacle. We wanted our controller to avoid to a region close to the target so that the reaching operation would be faster and easier after the switch.

4.2.2 Reaching Sub-Controller

This controller is based on DDPG similar to monolithic. The only difference is the reward function and state inputs. It only aims to reach the target, so we do not need a penalty for getting closer to the obstacle. The reward functionr₂is defined as

r₂ =R_t+R_s (4.6)

whereRtandRsare the same as in Equation 4.1. We setps = 4for this controller. The

(35)

state vector of the reaching sub-controller is defined as

s₂ = (γ_i, γ̇_i,−→eo₁) (4.7) The training of this reaching sub-controller is not challenging sinceR_tterm itself makes the agent reach to the target easily by simply minimizing the negative distance without considering the obstacles around. One can simply make the reward functions for sub- controllers asr₁ = R_o andr₂ = R_s but we found that it is better to construct correlated reward functions by mixing the terms. Also, it is better to have R_t in both of the reward functions since we want our agent to explore the target region more and collect experiences from there. The solutions that are far from the target region is not for our good since we want to work with target object.

It should be noted that state vectors are different for each sub-controllers. Avoiding sub- controller takes state vector of length 20 whereas reaching sub-controller takes state vector of length 17. Therefore, the appropriate state vector should be chosen depending on which sub-controller we want to run according to the value of d_min. The switching mechanism chooses the state vector and sends it to the sub-controller. Then it takes the joint velocity value from the sub-controller and sends it to the robot control. The overview of the Hybrid Controller can be seen in Figure 4.4.

Avoiding Sub- Controller

Reaching Sub- Controller

controller switch

Figure 4.4. Overview of hybrid controller.

(36)

4.3 Using the Simulator

This section describes the software tools that were used to train the controllers with simulators. Our aim is to have a controller that can predict actions in the real environment by observing the states similar to the simulated environment. The simulation program is CopelliaSim [41] previously known as V-Rep. We implemented the DDPG algorithm in a Python script and this script communicates with the simulator through a toolkit called PyRep [42].

4.3.1 CopelliaSim

CopelliaSim is a versatile and a scalable simulation framework [41]. It is currently a popular simulator for industry and research. The wide range of available objects and robots makes it very useful. Some of the features of the simulator [41] which are important for our work are:

• Physics Engine

• Remote API

• Collision Detection

• Minimum Distance Calculation

Physics EngineCopelliaSim has 4 different physics engines available to use which are Bullet, ODE, Newton and Vortex Dynamics. Those physics engines enable the simulator to simulate real-world dynamics and interactions between objects. This is also important for the works in which manipulators need to grasp and put objects but we will not deal with those tasks. Bullet [43] physics engine is used in our work. The simulation time step is set as 50 ms which is the time needed for the main script to be executed again.

Each scene has one main script which has the necessary codes to run the simulation.

For the dynamic movement of the obstacle in our scene, a threaded child script written in Lua language is run for that specific object. Inside that script, a random target location is determined for the obstacle in the light gray area in Figure 4.1. Then, the obstacle moves to that location with a random velocity in the range [0,0.04] m/s. Note that, the position of the obstacle is not reset at the end of our episode, it continues this random movement independently from our episodes. The manipulator is located at the origin of the scene, namely the origin position of our Cartesian coordinate system is the base of the manipulator.

Remote APIRemote API enables the simulator to communicate with external applications such as Matlab, Python, C++. Our simulator communicates with a Python script but will use PyRep (explained in Section 4.3.2) instead of Remote API.

Collision Detection Module Collision Detection Module is used to detect collision be-

(37)

tween the objects. We use this module to detect collisions between the manipulator and the obstacle. It is used as a flag in our training algorithm. An example of this detection is shown in Figure 4.5 where the manipulator turns into red color after collision.

Figure 4.5.Example of collision detection.

Distance Calculation Module Distance Calculation Module is used to measure minimum Euclidean distance between any part of the two objects. We use this module to measure minimum distance between manipulator and obstacle, end-effector and target object pairs. An example of this measurement is shown in Figure 4.6 where distance vectors are shown by tick black lines with distance values in meters near them.

4.3.2 PyRep

PyRep is a toolkit built on top of CopelliaSim which enables us to launch the simulation from Python script and execute the Python codes simultaneously with the simulation [42].

Remote API calls are not fast enough to run our RL algorithms, so we use this toolkit which is faster than Remote API. PyRep takes information about the environment from the simulator at every step which are needed for the state vector and reward calculation during the DDPG training and testing. Those information items are:

• Target object Cartesian position

• Manipulator joint velocities and intrinsic position

(38)

Figure 4.6.Example of minimum distance measurement.

• End-effector Cartesian position

• Minimum distance between manipulator and obstacle

• Minimum distance between end-effector and target object

• Collision detection between manipulator and obstacle

Our Python script constructs the state vector with the given information and then it predicts the action values (joint velocities) by using the state vector. Then, PyRep sends the joint velocities predicted by the DDPG to the simulator and sends command to step the physics simulation with those joint velocities. After the step is executed with the given joint velocities, PyRep takes the new information from the simulator to find the new state vector and reward for the executed action. Note that, movement of the dynamic obstacle runs in a child script independent of the PyRep commands. Also, at the end of episode PyRep resets the environment by randomly placing the target object in the dark gray area and resetting the positions and orientations of the joints of the manipulator (Figure 4.1).

This whole iteration is summarized in Algorithm 3.

By using Algorithm 3, we can easily train our controller by implementing DDPG in Python script and communicating with the simulator through Pyrep. After the training, we can save the parameters of the networks of DDPG, namely save our controller behaviour.

Later, we can restore those parameters to test our controller. We can easily test how it performs by simply predicting actions for the state obtained using PyRep again. Algo- rithm 4 shows the testing iteration.

(39)

Algorithm 3:Training Algorithm with PyRep Initialize the DDPG networks ;

Launch the simulation scene by PyRep and initialize objects;

forepisode = 1,M do

Reset the environment by randomly placing the target object and setting the manipulator to the initial configuration through PyRep ;

Get information from simulator through PyRep and obtain the initial states₁ ; fort = 1,T do

Select actiona_tby using the actor network ;

Step the physics simulation by setting joint velocities of the manipulator given ina_tthrough PyRep ;

Get information from the simulator through PyRep, obtain the next states_t+1 and rewardr_t;

Store the experience(s_t, a_t, r_t, s_t+1)in experience replay ; ifexperience replay is full then

Update the networks in DDPG;

end

Sets_t =s_t+1 ; end

end

Save the parameters of the networks ;

Algorithm 4:Testing Performance with PyRep Restore DDPG networks of the controller ;

Launch the simulation scene by PyRep and initialize objects ; forepisode = 1,M do

Reset the environment by randomly placing the target object and setting the manipulator to the initial configuration through PyRep ;

Get information from simulator through PyRep and obtain the initial states₁ ; fort = 1,T do

Select actionatby using the actor network ;

Step the physics simulation by setting joint velocities of the manipulator given ina_tthrough PyRep ;

Get information from the simulator through PyRep, obtain the next statest+1 ; Obtain the information about successful action and collision ;

if action is successful for 30 consecutive time step tthen End the episode with success;

end end end

Save the statistics of the test ;

(40)

In Algorithm 4 the successful action is determined with the same condition asRsin Equa- tion 4.4. We decided to wait for some number of consecutive successful actions to ensure that the manipulator stays at the target region without collision for some time. Note that an episode can finish with both success and failure. Namely, it can collide at the earlier steps and then finish with success. At the end, the statistics which shows success rate and collision rate are obtained. Those results are shown in Section 6.2.

Note that the action selection is different for the Hybrid Controller which is summarized in Algorithm 2. Algorithm 4 is a summary to test a trained DDPG model such as Monolithic Controller, Reaching Sub-Controller or Avoiding Sub-Controller.

Comparison of monolithic and hybrid controllers for multi-objective sim-to-real learning