Aircraft Control with Deep Reinforcement Learning in Real-time Simulations

(1)

Jaakko Seppälä

AIRCRAFT CONTROL WITH DEEP REINFORCEMENT LEARNING IN REAL-

TIME SIMULATIONS

Master of Science Thesis

Faculty of Engineering and Natural Sciences

Hannu Koivisto

Risto Ritala

January 2021

(2)

Jaakko Seppälä: Aircraft Control with Deep Reinforcement Learning in Real-time Simulations Master of Science Thesis

Tampere University Automation Engineering January 2021

In this thesis, reinforcement learning (RL) with deep neural networks is applied to controlling a simulated aircraft. The aim for the control is to maneuver the aircraft to a given target while minimizing input changes, fuel consumption and non-leveled flight. Multiple models featuring various parameter counts and layer numbers are tested for both control and real-time performance.

First question is how well the models learn to control the aircraft i.e. the control performance.

The model’s control performance is then compared to a model predictive control themed optimization method that is used as an upper bound reference for the performance. This is done by running various benchmark scenarios and comparing the total costs. Also, the benchmark scenarios are ran with varying simulation parameters to see how well the models generalize to similar systems.

Second question is how well the neural network models perform real-time wise. Computation times are measured with all the model candidates on a reference test platform. Then, an estimate for the maximum number of aircrafts controlled simultaneously in real-time simulations is calculated as the result.

For constructing the RL environment, an aircraft physics model was given by the thesis commissioner, Insta DefSec Oy. The simulation model is lightweight, so it is easily implemented in to a RL system, but realistic enough, featuring nonlinear and nondifferentiable dynamics. The RL environment is implemented with Python using the OpenAI Gym interface for compatibility with machine learning libraries such as Keras and TensorFlow.

The selected RL algorithm is deep deterministic policy gradient (DDPG), which is used in a three-phased optimization scheme for the learning rate, critic, and actor models. The results indicated that the deeper and biggest networks in terms of parameters worked best as the critic model. However, all the actor model candidates achieved quite similar performance with the optimal critic model, with the best performer having around 100 000 parameters.

Overall, the control performance for the best performing actor model was almost as good as the model optimization in fine-tuning scenarios and better than the model optimization result in larger distance scenarios. Though, the model optimization was ran with a rather short prediction horizon of 21 seconds that is likely the cause for worse performance in longer scenarios. Also, the actor performed well while changing the simulation parameters, achieving low performance only with extremely unrealistic parameters.

For the most control-performing model, the real-time measurements showed that the maximum amount of aircrafts controlled with the central processing unit (CPU) is about 50 for 100 Hz simulations. Using the batching optimizations bumped this number up to thousands for the CPU and tens of thousands by also using a high-end graphics card.

Keywords: Aircraft, control, reinforcement learning, neural network, deep deterministic policy gradient, DDPG, machine learning, model predictive control, MPC, real-time simulation

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

Jaakko Seppälä: Aircraft Control with Deep Reinforcement Learning in Real-time Simulations Diplomityö

Tampereen yliopisto Automaatiotekniikka Tammikuu 2021

Työssä sovelletaan vahvistusoppimista syvillä neuroverkkomalleilla simuloidun lentokoneen ohjaamiseen. Ohjauksen tavoitteena on viedä lentokone tavoitteeseen minimoiden ohjausarvojen muutokset, polttoainekustannukset ja maksimoiden vaakatasossa lentäminen. Monelta erikokoiselta ja kerrosmääräiseltä mallilta testataan sekä ohjauksen suorituskyky että reaaliaikaisuuskyky.

Ensimmäisenä kysymyksenä on kuinka hyvin neuroverkkomallit oppivat ohjaamaan lentokonetta eli ohjauksen suorituskyky. Suorituskykyä verrataan model predictive control - henkiseen mallioptimointiin, jota pidetään työssä ylärajana suorituskyvylle. Molemmilla tekniikoilla suoritetaan tietyt testiskenaariot ja verrataan kustannuksia. Lisäksi malleilla ajetaan testiskenaarioita vaihtelevilla simulaatioparametreilla, jotta nähdään kuinka hyvin mallit yleistyvät vastaaviin systeemeihin.

Toisena kysymyksenä on neuroverkkomallien reaaliaikaisuuskyky. Tätä varten mitataan laskentaan käytettyjä aikoja kaikilla mallikandidaateilla tietyllä testialustalla. Sitten näistä arvioidaan maksimimäärää lentokoneita, joita voisi ohjata samanaikaisesti reaaliaikaisessa simulaatiosovelluksessa.

Vahvistusoppimisympäristöä varten työn teettäjä, Insta DefSec Oy, antoi fysiikkamallin lentokoneesta käyttööni. Simulaatiomalli on kevyt ja siten helposti toteutettavissa vahvistusoppimisympäristöön, mutta tarpeeksi realistinen sisältäen epälineaarisia ja epädifferentioituvia dynamiikkoja. Vahvistusoppimisympäristön toteutetaan Python- ohjelmointikielellä käyttäen OpenAI Gym -rajapintaa. Tällä saavutetaan yhteensopivuus käytettävien koneoppimiskirjastojen – Keras ja TensorFlow – kanssa.

Oppimisalgoritmiksi valittiin deep deterministic policy gradient, jota sovelletaan kolmivaiheisessa optimoinnissa optimaalisen learning rate -luvun, critic- ja actor-mallin valintaan.

Tulokset osoittivat, että critic-malliksi sopii parhaiten syvimmät ja isokokoisimmat mallit. Actor- malleilla taas kaikilla saatiin hyvin samanlaisia tuloksia, mutta parhaimmat saatiin noin 100 000 parametria sisältävällä mallilla.

Kaiken kaikkiaan, ohjauksen suorituskyky actor-malleilla oli hyvä. Parhaimmin suoriutuva actor-malli oli melkein yhtä hyvä kuin mallioptimointi hienosäätöskenaarioissa ja jopa parempi pidemmän matkan skenaarioissa. Tosin, mallioptimoinnin huonompaa suorituskykyä isommilla matkoilla selittää suhteellisen lyhyt optimointihorisontti, joka oli 21 sekuntia. Lisäksi actor-malli suoriutui hyvin myös erilaisilla simulaatioparametreilla ja huonoa suorituskykyä seurasi vain äärimmäisen epärealistisilla arvoilla.

Ohjauksen kannalta suorituskykyisimmällä mallilla laskenta-aikojen mittaukset osoittivat, että maksimimäärä ohjattavia lentokoneita on 100 Hz simulaatiossa noin 50 käytettäessä keskussuoritinta. Mutta, optimoidulla simulaation toteutuksella luku nousenee tuhansiin ja käyttämällä tehokasta grafiikkasuoritinta suuruusluokka on kymmenissä tuhansissa.

Avainsanat: Lentokone, ohjaus, vahvistusoppiminen, neuroverkko, deep deterministic policy gradient, DDPG, koneoppiminen, model predictive control, MPC, reaaliaikasimulaatio

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck –ohjelmalla.

(4)

I want to thank Insta DefSec Oy for commissioning such an interesting topic and Markus Kettunen for supervising the thesis and helping to brainstorm the topic.

Also, thanks to Hannu Koivisto and Risto Ritala from Tampere University for supervising and examining the thesis.

Lastly, thanks to my significant other, friends, and family for support.

Tampere, 31.1.2021

Jaakko Seppälä

(5)

1. INTRODUCTION ... 1

2.REINFORCEMENT LEARNING ... 2

2.1 Traditional Reinforcement Learning ... 2

2.1.1Key Elements ... 3

2.1.2Formalization ... 4

2.1.3Key Functions ... 4

2.1.4Dynamic Programming... 5

2.1.5Approximate Dynamic Programming ... 6

2.1.6Q-learning ... 6

2.1.7Gradient Methods ... 8

2.1.8Policy Gradient Methods ... 9

2.2 Deep Reinforcement Learning ... 10

2.2.1 Neural Networks ... 11

2.2.2Deep Learning ... 12

2.2.3 Fitting Neural Networks ... 12

2.2.4Gradient Methods for Neural Networks ... 13

2.2.5 Deep Q-learning ... 14

2.2.6Deep Deterministic Policy Gradient ... 15

3.MODEL PREDICTIVE CONTROL ... 17

3.1.1Core Strategy ... 17

3.1.2 Model Optimization ... 18

4. DEEP REINFORCEMENT LEARNING FOR CONTROL ... 20

4.1.1Discrete Classic Cartpole and Quantum Cartpole ... 20

4.1.2Continuous Locomotive and Hierarchical Problems ... 20

4.1.3Optimal Control of Space Heating ... 21

5.MODEL ENVIRONMENT ... 22

5.1 Physics Model ... 22

5.1.1Physics Model Parameters and Constants ... 23

5.1.2State variables and Transitions ... 25

5.2 Control Model... 28

5.3 Target Model ... 29

5.4 Pilot Model ... 30

5.4.1Pilot Model Reward Function ... 31

6. METHODS ... 35

6.1.1Models ... 35

6.1.2State Observation Space ... 37

6.1.3Training Scenarios ... 40

6.1.4Training Parameters ... 41

6.1.5Training Iterations ... 42

6.1.6Generalizability ... 44

6.2 Model Optimizer Control ... 44

6.3 Performance ... 45

6.3.1Benchmark Scenarios ... 45

(6)

7.1 Reinforcement Learning Environment ... 49

7.2 Model Optimizer ... 52

7.3 Test Platform... 54

8. RESULTS ... 55

8.1.1 First Iteration ... 55

8.1.2 Second Iteration ... 57

8.1.3 Final Iteration ... 60

8.2 Model Optimizer Reference ... 61

8.3 Real-time Performance ... 66

8.4 Analysis ... 68

9. CONCLUSIONS ... 70

SOURCES ... 71

(7)

AI Artificial intelligence

API Application programming interface

AVG Average

CPU Central processing unit

DDPG Deep deterministic policy gradient

DNN Deep neural network

DQN Deep q-network

DRL Deep reinforcement learning

GPU Graphics processing unit

MDP Markov Decision Process

MPC Model predictive control

NN Neural network

RL Reinforcement learning

SGD Stochastic gradient descent

STDEV Standard deviation

(8)

1. INTRODUCTION

This thesis is commissioned by Insta DefSec Oy. The motive for this thesis was to develop, study and experiment with an artificial intelligence (AI) method based application for simulation and training systems. Since no machine learning data was available for the purpose of this thesis, the methods are based on reinforcement learning (RL) with multilayer neural network models, which is called deep reinforcement learning (DRL).

The chosen application was aircraft control as it can be rather easily experimented with a simulation model. Also, its results can be compared to a reference performance achieved with a more traditional control technique, instead of relying on subjective reviews for performance. Another interesting result is how the control performance varies when the system changes drastically from the training model system i.e. how well the resulting models generalize to similar environments.

In addition to the control performance, we are also interested in the computational performance of the resulting neural network models. The second aim for this thesis is to evaluate and estimate how many aircrafts can be controlled simultaneously with the models in real-time applications with different fidelity levels.

First this thesis features theoretic background for the RL and model predictive control (MPC), which is used as the control performance reference for the RL results. Then a few practical applications and results of DRL to control problems are surveyed. After that, the methods and their implementation in this thesis is explained, which is then followed by the results and conclusions.

(9)

2. REINFORCEMENT LEARNING

Buşoniu et al. (2018) state in their review article “Reinforcement learning for control:

Performance, stability, and deep approximators” that reinforcement learning (RL) is a very broad field with many researchers from different backgrounds, such as AI, control, robotics etc. (Buşoniu, de Bruin et al. 2018). In this thesis, the focus will be on the AI perspective and methods, such as deep reinforcement learning (DRL) shown in Figure 1 as a subset of RL.

Figure 1. Taxonomy of deep reinforcement learning in a Venn diagram. Adapted from (Buşoniu, de Bruin et al. 2018).

DRL is a rather new subset to RL, using deep neural network models to learn a policy or a value function for a given system. DRL has enabled RL methods to tackle decision-making problems that were unattainable with previous RL techniques, such as playing Atari 2600 games well (Mnih, Kavukcuoglu et al. 2015) and beating human champions at the game of Go (Silver, Huang et al. 2016).

In this thesis, DRL will be applied to a control problem of manoeuvring an aircraft with fuel consumption in mind. Since the aim of the thesis is to experiment with AI methods, different approaches are not evaluated. However, as a reference to the DRL results, a traditional control approach known as model predictive control (MPC) will also be applied to this problem.

2.1 Traditional Reinforcement Learning

Summarizing from the industry standard textbook for RL “Reinforcement Learning: An Introduction” (Sutton, Barto 2018): RL in its essence, is mapping situations to actions so as to maximize a certain reward value in a system. Learning these mappings is usually difficult since actions may affect not only the immediate reward, but also subsequent rewards. Thus, a RL agent should explore the system with a trial-and-error search and calculate what actions map to the highest, possibly delayed, overall reward. (Sutton, Barto 2018)

(10)

2.1.1 Key Elements

The core elements of RL are the learning agent and the system, also called an environment:

each time step the agent chooses an action and one time step later receives a reward signal and a perceived state of the system as output, as illustrated in Figure 2. Figure 2 also lists key elements of the agent: policy, value function and the optional model, all of which are explained below. (Sutton, Barto 2018)

Figure 2. Key elements and interactions in reinforcement learning. Agent chooses an action and one time step later receives a reward and a new state. Adapted from

(Sutton, Barto 2018).

The reward signal defines the goal for the system. Transitioning into states that are favourable give the highest rewards and unfavourable states should net low or even negative rewards effectively acting as a punishment for the agent. (Sutton, Barto 2018) Thus the optimal policy for a system will choose actions transitioning the system to a favourable state, while avoiding the punishing states.

The policy is the core of the RL agent as it defines the learned behaviour, mapping states to actions. Methods using a policy to select actions are called on-policy methods and respectively methods that do not strictly follow a policy are called off-policy methods. (Sutton, Barto 2018)

The reward signal determines what is optimal immediately, but to find out what is optimal in the long-term, a value function is used. Sutton, Barto (2018) effectively summarize: “the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state” (Sutton, Barto 2018).

The model may be used to also learn a model of the system or use a predetermined model to mimic how the system will behave. The agent could then effectively plan out actions, instead of relying on pure trial-and-error searching. Methods using models of the system are called model-based and methods that do not use models of the system are referred to as model-free. (Sutton, Barto 2018)

(11)

2.1.2 Formalization

For the purposes of this thesis, the reinforcement learning problem can be formalized as a discrete decision-making problem: At simulation time step t the learning agent perceives a state 𝑠_𝑡∈ S, where S is the possible state space, and based on its policy π, selects an action 𝑎_𝑡∈ A, where A is the possible action space. The agent then observes the system to receive the updated state 𝑠𝑡+1∈ S and the reward associated with the state transition 𝑟𝑡+1.

In literature, reinforcement learning problems are usually defined as a finite Markov Decision Process (MDP). So, defining that the updated state 𝑠_𝑡+1 and received reward 𝑟_𝑡+1 only depends on the previous state 𝑠_𝑡 and selected action 𝑎_𝑡 will satisfy the Markov property and make this problem a finite MDP. An MDP is fully defined as a 5-tuple 〈𝑆, 𝐴, 𝑅, 𝑃, 𝑝₀〉, where S is the possible state space, 𝐴 is the possible action space, 𝑅 is the reward function 𝑟𝑡+1= 𝑅(𝑠_𝑡, 𝑎_𝑡, 𝑠_𝑡+1), 𝑃 is the state transition probability function with 𝑃(𝑠´|𝑠, 𝑎) being the probability of transitioning into state s´ starting at state 𝑠 and taking action 𝑎, and 𝑝₀ is the starting state distribution. (Sutton, Barto 2018)

Solving this problem means finding the optimal policy π*, that selects the action with the highest expected cumulative reward for all states in the state space. For defining the expected cumulative reward there is two options: For episodic problems, the cumulative reward 𝐺_𝑡 can be defined as a sum of the sequent episode’s rewards

𝐺𝑡 = 𝑟𝑡+1+ 𝑟𝑡+2+ 𝑟𝑡+3+ ⋯ + 𝑟𝑇, (1)

where T is the final step. For ongoing problems, where an episode cannot be defined, the cumulative sum can be defined as an infinite discounted sum

𝐺_𝑡 = 𝑟_𝑡+1+ 𝛾𝑟_𝑡+2+ 𝛾²𝑟_𝑡+3+ ⋯ = ∑^∞_𝑘=0𝛾^𝑘𝑟_𝑡+𝑘+1, (2) where 𝛾 is the discount factor ranging from 0 to 1, that determines how much should future rewards be weighted relative to instant rewards. (Sutton, Barto 2018)

2.1.3 Key Functions

Policies map states to actions effective performing the action selection logic for the agent.

Policies can be defined as stochastic policies by calculating probabilities for actions from a state with 𝜋(𝑎|𝑠). Or they can be deterministic, mapping a certain state to a certain action 𝑎 = 𝜋(𝑠). Deterministic policy can be obtained from a stochastic policy by selecting the action with highest probability. (Sutton, Barto 2018)

Usually reinforcement learning methods search for the optimal policy with estimating state values. One way to calculate state value 𝑣_𝜋 is to calculate the expected cumulative reward (1, 2) following a policy 𝜋 with state-value function

𝑣𝜋(𝑠) = 𝐸𝜋[𝐺𝑡|𝑠𝑡 = 𝑠], (3)

(12)

where Eπ denotes the expected value of a random variable at any time step 𝑡. Extending this to include the selected action results in the action-value function

𝑞_𝜋(𝑠, 𝑎) = 𝐸_𝜋[𝐺_𝑡|𝑠_𝑡= 𝑠|𝑎_𝑡= 𝑎], (4)

calculating the expected cumulative reward for state 𝑠 choosing action 𝑎. Then, following the optimal policy, the optimal state-value function 𝑣_∗ would be

𝑣_∗(𝑠) = max

𝜋 𝐸_𝜋[𝐺_𝑡|𝑠_𝑡= 𝑠] (5)

and optimal action-value function 𝑞_∗ would be 𝑞_∗(𝑠, 𝑎) = max

𝜋 𝐸_𝜋[𝐺_𝑡|𝑠_𝑡= 𝑠|𝑎_𝑡= 𝑎]. (6)

(Sutton, Barto 2018)

Thinking recursively and using the state transition probability function to average the successor state values weighted with their probability of occurring forms the Bellman equation for the state-value function

𝑣_𝜋(𝑠) = 𝐸𝜋[𝐺𝑡|𝑠_𝑡 = 𝑠] = 𝐸_𝜋[𝑅𝑡+1+ 𝛾𝐺_𝑡+1|𝑠_𝑡= 𝑠]. (7) Basically, this means that the state value should be the reward plus discounted next state value. Same can be applied to the action-value function to obtain

𝑞_𝜋(𝑠, 𝑎) = 𝐸𝜋[𝑅𝑡+1+ 𝛾𝐺_𝑡+1|𝑠𝑡= 𝑠|𝑎_𝑡= 𝑎]. (8) (Sutton, Barto 2018)

Applying this to the optimal policy variants (5, 6) gives the Bellman optimality equations for value and action-value functions. Solving the Bellman optimality equation provides the optimal value or action-value functions, from which the optimal policy 𝜋∗ can be obtained with

𝜋_∗(𝑠) = argmax

𝑎

𝑞_∗(𝑠, 𝑎). (9)

2.1.4 Dynamic Programming

Methods that try to solve the Bellman optimality equations for all the states are referred to as dynamic programming algorithms. These methods usually involve finding the optimal policy by improving the policy and value function iteratively until it converges to the optimal one.

By looping all the states iteratively, the value function for a give policy can be estimated with (7). Then again looping states iteratively, calculating the action-value function with (8) can be used to compare the action chosen by the policy and a possible new action. If the action- value function yields a better value, then the policy is changed to choose this action with a greater probability to obtain a better policy. (Sutton, Barto 2018)

(13)

Then the value function is estimated again and then the policy is improved repeatedly. Each iteration always leads to a better policy until it converges to the optimal policy. (Sutton, Barto 2018)

2.1.5 Approximate Dynamic Programming

Even if an accurate model of the system is known, it is sometimes practically impossible to compute the optimal policy by solving the Bellman optimality equation due to computational or memory limitations. As the state space grows bigger, the computational effort goes up extremely. (Sutton, Barto 2018)

Also, the memory available is an important constraint: on small state spaces, all the possible states and actions could be tabled into a state-value or action-value table. But for bigger state spaces, this is not feasible and some sort of approximators should be used for value functions. (Sutton, Barto 2018)

The approximate value function can be formalized with a weight vector 𝒘, consisting of a fixed amount of real valued components 𝒘 = (𝑤₁, 𝑤₂, … , 𝑤_𝑘)^𝑇, where 𝑘 is the amount of weights. The weight values could represent anything from decision tree values to polynomial function coefficients, forming the actual approximate value function 𝑣̂(𝑠, 𝒘) ≈ 𝑣_𝜋(𝑠). (Sutton, Barto 2018)

However, in the tabular case, where an update to the value function only affected a single state, now updating the weight vector affects multiple states. Assuming the number of weights is far less than the number of states, this makes it impossible to get the function value exactly right and updating a certain state’s value more accurate makes the other’s less accurate. For calculating the errors in the approximated value function 𝑣̂(𝑠, 𝒘), the mean squared value error 𝑉𝐸̅̅̅̅ is used:

𝑉𝐸̅̅̅̅(𝒘) = ∑𝑠∈S𝜇(𝑠)[𝑣𝜋(𝑠) − 𝑣̂(𝑠, 𝒘)]², (10) where 𝜇 is the state error weight function 𝜇(𝑠) > 0, ∑_𝑠∈S𝜇(𝑠) = 1, representing what state’s errors are more significant. This principle can also be applied to approximating the action- value function 𝑞̂(𝑠, 𝑎, 𝒘), also parametrized with a weight vector 𝒘. (Sutton, Barto 2018)

2.1.6 Q-learning

Q-learning is a popular example of a reinforcement learning algorithm and it is considered one of the early breakthroughs in reinforcement learning (Sutton, Barto 2018). Q-learning was first proposed by Watkins (1989) by learning the action-value function 𝑞(𝑠_𝑡, 𝑎_𝑡), also known as the q-function, iteratively with a simple update rule

(14)

𝑞(𝑠_𝑡, 𝑎_𝑡) = (1 − 𝛼)𝑞(𝑠_𝑡, 𝑎_𝑡) + 𝛼(𝑟_𝑡+1+ 𝛾 max

𝑎 𝑞(𝑠_𝑡+1, 𝑎)), (11)

where 𝛼 is the learning rate and 𝛾 is the reward discount factor (Watkins 1989).

As can be seen from (11), Q-learning takes advantage of a previous learning result in its updates. This technique is called temporal-difference (TD) learning. Opposing methods such as Monte Carlo methods do wait for the final outcome. The key difference between these approaches are how the differences are calculated, shown in Figure 3. Using TD learning computation effort can be saved, and it is possible to apply an ongoing problem, where the resulting outcome cannot be calculated by just finishing the episode. (Sutton, Barto 2018)

Figure 3. Prediction differences (red) in Monte Carlo methods (left) and temporal- difference methods (right) in predicting travel time home. Adapted from (Sutton, Barto

2018).

Using the simple update rule (11), the action-value function approximates and converges to an optimal action-value function 𝑞_∗. The agent’s actions are free to choose, making Q- learning an off-policy method. But for it to converge to the optimal action-value function the agent must try each action in each state many times. This only works for discrete action and state spaces 𝐴 and 𝑆, where each combination can be tried and learned. (Watkins 1989) Similarly to (9), using the learned action-value function, a deterministic greedy policy can be formed by choosing the best possible action based on the action-value function with

𝜋(𝑠) = argmax

𝑎 𝑞(𝑠, 𝑎). (12)

If the problem being learned has a small and finite action and state space, an optimal action- value function can be found with Q-learning and thus also an optimal policy is found. (Watkins 1989)

Exhausting the action and state space completely with just random actions can be computationally inefficient as some states are rarely visited by random chance. However, learning the optimal policy could be sped up with selecting the actions based on a near-

(15)

greedy policy, also called an ε-greedy policy: For probability 1 – ε select best action according to current policy and otherwise select a random action, where 𝜀 ∈ ]0,1[ is the exploration factor. Effectively exploiting the current learning result to find relevant states more often and exploring them thoroughly. Pseudocode for applying this to Q-learning shown below in Algorithm 1. (Sutton, Barto 2018)

Algorithm 1. Pseudocode implementation for Q-learning. Adapted from (Sutton, Barto 2018).

2.1.7 Gradient Methods

The approximated value function 𝑣̂(𝑠, 𝒘) is a differentiable function of 𝒘 for all 𝑠 ∈ S and a good strategy is to apply a gradient based optimization method to minimize the approximation error 𝑉𝐸̅̅̅̅ (10) on observed samples. Therefore, the most widely used methods for updating the weights are based on stochastic gradient descent (SGD). (Sutton, Barto 2018)

Assuming an equal importance for each state, effectively discarding 𝜇(𝑠) from the value error formula (10), SGD methods adjust the weight vector after each sample state by a small amount in the direction that would most reduce the value error 𝑉𝐸̅̅̅̅. With the updated weight vector 𝒘_𝑡+1 being calculated from current weight 𝒘_𝑡 and sample state 𝑠_𝑡 with a gradient step

𝒘𝑡+1= 𝒘𝑡−¹₂𝛼∇[𝑣𝜋(𝑠𝑡) − 𝑣̂(𝑠𝑡, 𝒘𝒕)]²= 𝒘𝑡+ 𝛼[𝑣𝜋(𝑠𝑡) − 𝑣̂(𝑠𝑡, 𝒘𝒕)]∇𝑣̂(𝑠𝑡, 𝒘𝒕), (13) where 𝛼 is a positively valued learning rate and ∇ means taking a gradient with respect to 𝒘.

However, the actual value for 𝑣_𝜋(𝑠𝑡) might not be known in the gradient step (13), so it should be replaced with some target value 𝑈_𝑡. Using the Bellman equations, one approach is to use the resulting reward 𝑟_𝑡+1 and resulting state’s 𝑠_𝑡+1 discounted value approximation as the

(16)

target with 𝑈_𝑡= 𝑟_𝑡+1+ 𝛾𝑣̂(𝑠_𝑡+1, 𝒘_𝒕). This is also known as taking a gradient of the mean squared bellman error and these approaches are called semi-gradient methods. (Sutton, Barto 2018)

Applying this to approximating action-value function 𝑞̂(𝑠, 𝑎, 𝒘) results in the following update rule with semi-gradient:

𝒘_𝒕+𝟏= 𝒘_𝑡+ 𝛼[𝑟_𝑡+1+ 𝛾𝑞̂(𝑠_𝑡+1, 𝑎_𝑡+1, 𝒘_𝒕) − 𝑞̂(𝑠_𝑡, 𝑎_𝑡, 𝒘_𝒕)]∇𝑞̂(𝑠_𝑡, 𝑎_𝑡, 𝒘_𝒕). (14) Using this rule, an optimal action-value function, and thus an optimal policy (9), can be estimated. Shown below in Algorithm 2 is a pseudocode implementation for estimating 𝑞̂ ≈ 𝑞∗ with a semi-gradient method. (Sutton, Barto 2018)

Algorithm 2. Pseudocode implementation for estimating value function for a policy with a semi-gradient method. Adapted from (Sutton, Barto 2018).

2.1.8 Policy Gradient Methods

Gradient methods can also be applied to an approximated policy 𝜋(𝑎|𝑠, 𝜽), where 𝜽 is the weight vector parametrizing the policy. These methods are referred to as policy gradient methods. (Sutton, Barto 2018)

Instead of the value error, the optimized function is the value of a state following the policy.

This is usually denoted as the performance 𝐽(𝜽) = 𝑣_𝜋𝜽(𝑠). Using the policy gradient theorem, gradient update rules can also be formed for updating 𝜽 to approximate the optimal policy 𝜋∗

for both episodic and continuing problems. Contrary to (13 and 14) where the value error was

(17)

reduced, the performance 𝐽 should be increased and therefore these methods apply gradient ascent steps, effectively climbing the gradient uphill instead of downhill. (Sutton, Barto 2018) Policy gradient methods can simultaneously make use of a value approximation and a policy approximation. These are called actor-critic methods. Actor-critic methods use the learned value function as a baseline for the policy gradient updates. This is similar to baseline methods, where an arbitrary baseline value is used as the target instead of a learned value function. The actor-critic setup is illustrated in Figure 4. (Sutton, Barto 2018)

Figure 4. Stepping with actor-critic methods. Actor selects an action based on the state and policy and a new state and a reward is observed. Critic calculates a TD error

to update itself and the actor. Adapted from (Arulkumaran, Deisenroth et al. 2017).

Interestingly, Silver et al. (2014) showed that policy gradients could also be used with a deterministic policy 𝜋(𝑠, 𝜽), where 𝜽 is the weight vector parametrizing the policy. They also established a deterministic policy framework for policy gradient methods and showed that the deterministic policy variants significantly outperformed their stochastic counterparts in many problems. (Silver, Level et al. 2014)

2.2 Deep Reinforcement Learning

One approach to approximate RL is using neural networks as the function approximator.

Especially using deep neural network models has achieved various breakthroughs in many difficult problems lately (LeCun, Bengio et al. 2015). Using deep neural network models for

(18)

RL has formed a new field of methods and algorithms known as deep reinforcement learning (DRL).

2.2.1 Neural Networks

Neural networks (NNs) are mathematical models, often referred to as artificial neural networks, that transform some input into an output. For example, an image classifier model transforms an image into a linearly separable classification data output.

There are various neural network model types, but the most popular type is the multi-layered perceptron model. This model is loosely based on the brain, consisting of multiple connected nodelike units, each having multiple parameters. Tuning the parameters to form a desired output is called fitting the neural network, often referred to as training.

Each unit, often referred to as a neuron, is parametrized with a real-valued weight vector 𝒘 = (𝑤1, 𝑤2, … , 𝑤𝑘)^𝑇, where 𝑘 is an integer determining the input size for the unit, a scalar bias 𝑏 and an activation function 𝐹. The whole neural network model usually consists of multiple layers of these units, where each layer is fully connected to the sequent layer. (Forsyth 2019) The unit takes an input vector 𝒙 = (𝑥₁, 𝑥₂, … , 𝑥_𝑘)^𝑇, where 𝑘 is the same integer as in w, and outputs a single scalar value 𝑦 calculated with 𝑦 = 𝐹(𝒘^𝑇𝒙 + 𝑏). The activation function can be any arbitrary function, but current best practice is to use a rectified linear unit, where 𝐹(𝑥) = max (0, x). (Forsyth 2019)

In neural network models the units are stacked in layers, where each layer’s outputs are vectorized and fed as the input vector for all the units in the next layer as illustrated in Figure 5. For transforming the input data into a desired output, the model must have an input layer and an output layer. The input data vector’s dimension specifies the number of nodes in the input layer and the dimension of the desired output specifies the number of nodes in the output layer. The layers between the input and output layers are called hidden layers. Having hidden layers add complexity and approximation power to the model and NNs that have hidden layers are considered as deep neural networks (DNN). (Forsyth 2019)

(19)

Figure 5. Example neural network model with 𝒎 + 1 layers, each having 𝒏 node units except for the last layer. For the whole model, the input 𝒙 is a vector of length 𝒏 and output 𝒚 is vector of length 2. Each node is densely connected to all neighbouring layer’s nodes.

Typically, DNNs have a lot of parameters that need to be learned (Forsyth 2019). As can be seen from the definition, a single unit has 𝑘 parameters for the weight vector 𝒘 and one as the bias 𝑏. Here 𝑘 is the previous layer’s output length. A single layer has 𝑛 units, where each unit outputs a single scalar value, forming an output vector 𝒚 of length 𝑛. In total, a single layer has 𝑛𝑘 + 𝑛 parameters. So, for multiple layers having big 𝑛 and 𝑘, the total parameter count can easily reach millions of parameters.

2.2.2 Deep Learning

Deep learning in general means approximating some function with a deep neural network.

Key aspect in deep learning is that the connections between the layers are not hand- engineered but rather learned from data. For many difficult high-dimensional problems, such as speech recognition, object detection and language translation, deep learning has improved the state of the art results significantly. (LeCun, Bengio et al. 2015)

Applying deep learning to RL forms the reinforcement learning framework known as DRL.

Traditional RL methods have had multiple successes in various problems, but ultimately, they lacked scalability and were limited to low-dimensional problems due to memory, computational and sample complexity (Arulkumaran, Deisenroth et al. 2017).

2.2.3 Fitting Neural Networks

Fitting, often referred to as training, the neural network model is very hard problem due to the complexity of the objective function that is being optimized. For adjusting the network’s

(20)

parameters i.e. fitting the network, the single most effective approach is to use gradient based methods such as stochastic gradient descent (SGD). (Forsyth 2019)

However, these methods often require lots of data and iterations to fit the models. Due to the powerful approximation power of DNNs, overfitting to the training data is a common problem and usually this leads to bad generalization ability for the model. Therefore, some of the training data should be reserved for strictly testing and validating the model to know when the fitting process has reached a peak for the model’s performance and prevent overfitting.

(Forsyth 2019)

Other tricks to prevent overfitting that are generally used are regularization and dropout. At training, regularization will apply penalties to layer parameters that are added to the objective function being minimized, usually called the loss function. Dropout technique means ignoring random connections between the units at training, effectively making sure that the unit’s output is not dominated by a single or a few previous units. (Forsyth 2019)

And instead of randomly selecting one example from the training data set, training NNs is commonly performed with the gradient descent applied to a batch of data selected uniformly at random. This is known as minibatch training. Using the minibatch is just as fast to compute but is provides a better gradient estimate, resulting in better performance for the fitted model.

(Forsyth 2019)

In addition to minibatch training, often input data is normalization, gradient scaling and momentum is used in neural network training. Normalization is when the dataset features have different ranges. Gradient scaling and momentum are used to achieve better results with SGD optimization of the loss function: Gradients might be too large or small resulting in overshooting or undershooting with the gradient steps and even diverging the training.

Usually, the gradients are scaled with a factor known as learning rate, usually denoted by 𝛼.

Also, the gradients might be very noisy or changing sign with each step, so adding a momentum term to the steps should help average that out. (Forsyth 2019)

2.2.4 Gradient Methods for Neural Networks

But how should the gradients used in the gradient descent be calculated for this mess of parameters? The current best answer is an algorithm called backpropagation.

Backpropagation was first introduced in the 1960s, but it was popularized by Rumelhart et al. (1986) in their article “Learning representations by back-propagating errors” (Rumelhart, Hinton et al. 1986).

Making a forward pass for the neural network model means evaluating the output of the model from input 𝒙 i.e. making a prediction resulting in an output 𝒔. If data is available that says that the model’s output for this input should be 𝒚, then we can perform a backward pass

(21)

with backpropagation minimizing a certain cost function 𝐶(𝒔, 𝒚), effectively shifting the prediction 𝒔 closer to the expected output 𝒚. (Rumelhart, Hinton et al. 1986)

The cost function 𝐶 is usually referred to as the loss function. The loss function can be any function, such as a simple mean squared error or a more complex one such as cross-entropy.

(Forsyth 2019)

In backpropagation, the gradients for the final layer’s parameters are calculated first, moving on to previous layers until calculating the gradients of the first layers last, hence the name backwards pass with the error information flowing backwards. Calculating the gradient of the cost function with respect to each weight parameter with the chain rule starting from the end is very computationally efficient and makes it feasible to fit multilayer networks. (Rumelhart, Hinton et al. 1986)

2.2.5 Deep Q-learning

Q-learning with a neural network model as the action-value function, referred to as the q- function, approximator was first proposed by Lin (1993) in his technical report “Reinforcement Learning for Robots Using Neural Networks”. Lin also used a technique called experience replay, where the state transition samples are collected in a replay buffer and sampled randomly to smooth the training data distribution over past behaviours. (Lin 1993)

Using DNNs as the q-function approximator was popularized by Mnih et al. (2013) by training DNNs to play several Atari 2600 games with results that outperformed all previous RL algorithms. To train the deep neural network, they used a minibatch training, experience replaying variant of Q-learning to train the model with SGD. (Mnih, Kavukcuoglu et al. 2013) Generally, this algorithm is called deep q-learning, also known as the deep q-network (DQN), and it used the semi-gradient update rule from (14) to approximate the q-function. However, since the q-function is also used to compute the target 𝑈𝑡 in (14), the update is prone to diverge. This is prevented with taking a backup of the q-function to calculate the targets, effectively functioning as separate constant target network. Then the target network is periodically refreshed to update the learned improvements into the targets. (Mnih, Kavukcuoglu et al. 2013)

Using the q-function a policy can be found with taking the best possible action with (12). As with regular Q-learning, this algorithm works with a discrete action space 𝐴, such as Atari 2600 controller inputs. Generalized pseudocode implementation using the 𝜀 -greedy exploration technique for this algorithm shown below in Algorithm 3.

(22)

Algorithm 3. Pseudocode implementation for DQN. Adapted from (Mnih, Kavukcuoglu et al. 2013).

2.2.6 Deep Deterministic Policy Gradient

Continuing from the Silver et al. (2014) deterministic policy gradient work (Silver, Level et al.

2014) and DQN algorithm (Mnih, Kavukcuoglu et al. 2013), a policy gradient version of deep q-learning called deep deterministic policy gradient (DDPG) was proposed by Lillicrap et al.

(2015), effectively adapting the deterministic policy gradient algorithm to the DRL setting (Lillicrap, Hunt et al. 2015).

Key difference between DQN and DDPG is the continuous action space 𝐴. Applying q- learning to a continuous action space is very inefficient due to the greedy policy (12) requiring a potentially slow solve for the best action 𝑎 in the continuous 𝐴. Therefore, DDPG uses an actor-critic setup, where the actor learns the policy based on the q-function learned by the critic. Both are approximated with a neural network model, where the policy weights are parametrized with 𝜽 and the q-function weights are parametrized with 𝒘. (Lillicrap, Hunt et al. 2015)

Other major differences are to the training exploration and target network updates: Instead of selecting a totally random action based on some probability 𝜀, DDPG uses noise added to the policy output while learning. To provide stability to the learning, DDPG also makes use of two target networks and they are updated softly every iteration with 𝜽_𝑇 = 𝜏𝜽 + (1 − 𝜏)𝜽_𝑇 and 𝒘_𝑇= 𝜏𝒘 + (1 − 𝜏)𝒘_𝑇, where 𝜽_𝑇is the target policy parameters, 𝒘_𝑇is the target q- function parameters and 𝜏 is a small constant. (Lillicrap, Hunt et al. 2015)

Because the action space is continuous, the q-function is differentiable with respect to the action and the policy parameters 𝜽 are optimized with a gradient ascent solving

(23)

max𝜽 𝑞̂(𝑠, 𝜋(𝑠, 𝜽), 𝒘), where the q-functions parameters are constant. Adding this and the soft updates to DQN algorithm results in the DDPG algorithm shown in Algorithm 4 and visualized in Figure 6. (Lillicrap, Hunt et al. 2015)

Algorithm 4. Pseudocode implementation for deep deterministic policy gradient.

Adapted from (Lillicrap, Hunt et al. 2015).

Figure 6. Deep deterministic policy gradient architecture.

(24)

3. MODEL PREDICTIVE CONTROL

In this thesis, a reference for the DRL performance will be formed with the use of a more traditional control scheme. The upper bound performance reference is to be obtained by using a model optimization approach inspired by MPC control techniques, for its simplicity and general applicability.

MPC methods has been developed since the 1970s and it is an advanced method of process control that is widely used. The main advantages of MPC methods are the intuitive concept, its applicability to a great variety of processes, handling multivariable cases and constraints well, and it is easy to implement. The main disadvantage is the large computation required at every sampling time for complex constrained nonlinear models and the need for a model of the system being controlled. (Camacho, Bordons 2007)

For the MPC to work a model that accurately enough captures the system dynamics and different MPC algorithms mainly differ in obtaining the model and the cost function used in the optimization (Camacho, Bordons 2007). For the purposes of this thesis, obtaining a model via mathematical modelling, system identification or any other means is not necessary nor important, since the model described in paragraph 5, is used directly.

3.1.1 Core Strategy

Key elements in MPC are, as the name suggests, a model of the system, the system and an optimizer. The MPC family is a wide assortment of methods, but they are all characterized by the following strategy: The model is used to predict the future system outputs based on current and past values. The optimizer then selects the control outputs for the next 𝑁 time steps based on model predictions, a cost function and input constraints. The 𝑁 is an integer greater than one, known also as the prediction horizon. These elements and interactions are illustrated in Figure 7. (Camacho, Bordons 2007)

(25)

Figure 7. Key elements and interactions in model predictive control. Adapted from (Camacho, Bordons 2007).

The control is characterized by minimizing the cost function over the prediction horizon. For example, a popular cost function is the quadratic cost 𝐽 = ∑^𝑁_𝑖=1𝑤𝑥_𝑖(𝑟𝑖− 𝑥𝑖)²+ ∑𝑁 𝑤𝑢_𝑖∆𝑢𝑖2

𝑖=1 ,

where 𝑥_𝑖 is the 𝑖^th controlled variable, 𝑟_𝑖 is the target value for the 𝑖^th control variable, 𝑢_𝑖 is the 𝑖^th manipulated control variable, 𝑤_𝑥_𝑖 is a coefficient reflecting the relative importance of 𝑥_𝑖 and 𝑤_𝑢_𝑖 is a coefficient penalizing relative big changes in 𝑢_𝑖.

At every sampling time step 𝑡, the control output 𝒖 = (𝑢₁, 𝑢₂, … , 𝑢_𝑁)^𝑇 is calculated by the optimizer. This optimization is based on the model predictions, reference trajectory, input constraints and a cost function to define the desired output. Then the first control value 𝑢₁ is input to the real system and held constant for the time step. Repeating this every time step effectively results in a discrete control with zero order hold for the system that is optimal in the given time horizon 𝒕 = (𝑡₁, 𝑡₂, … , 𝑡_𝑁)^𝑇. (Camacho, Bordons 2007)

The optimizer minimizes the cost function within the given input constraints. The cost function is usually defined by penalizing the errors to hold the system at a reference trajectory and penalizing input changes to avoid constant drastic changes in the control input.

3.1.2 Model Optimization

Minimizing the cost function with the model is a fundamental part of the MPC strategy.

Usually linear models are used, since they can be easily identified with step or impulse responses and minimizing the cost function with linear models is very easy. However, if the real system has severe nonlinear dynamics, using a linear approximation or assumed

(26)

linearization around the current operating point as a model will not have good results. This requires use of a nonlinear model. (Camacho, Bordons 2007)

Developing or constructing nonlinear models for systems is significantly harder. Also, the optimization for nonlinear models must be done numerically and iteratively. Combining nonlinear model with the multivariate case and input constraints makes the optimization problem magnitudes more difficult and requires more computational effort. (Camacho, Bordons 2007)

(27)

4. DEEP REINFORCEMENT LEARNING FOR CONTROL

In addition to Atari and Go, DRL has been applied to various control themed applications with performing results. For example, the OpenAI RL system application programming interface (API) known as OpenAI Gym features lots of benchmark problems to test and compare different RL algorithms (OpenAI 2017). Few of these problems are control theory problems from classic RL literature such as the cartpole, the pendulum and the mountain car problem.

In addition to the classics, the Gym also contains control-themed robotics problems, such as a bipedal walker and manipulating a robotic arm or a robot hand with various goals.

4.1.1 Discrete Classic Cartpole and Quantum Cartpole

Wang et al. (2019) applied DRL to the classic cartpole problem and a more challenging quantum cartpole system. In the classic problem, a 2D cart is controlled to prevent an inverted pendulum on top of the cart from falling down. The quantum version is similar, but it features quantum-mechanical behaviour to the measurements with controlling the forces applied to a particle to keep it at the top of a potential. (Wang, Ashida et al. 2019)

They used the DQN since they used discrete action spaces for both problems: Move the cart left or right. Apply one of the 21 equispaced forces from interval [−𝐹_𝑚𝑎𝑥, 𝐹_𝑚𝑎𝑥] to the particle, where 𝐹_𝑚𝑎𝑥 was the maximum force applicable. Using DQN they achieved comparable performance to a linear-quadratic-Gaussian control in the classic cartpole and superior performance in the quantum version. (Wang, Ashida et al. 2019)

4.1.2 Continuous Locomotive and Hierarchical Problems

Duan et al. (2016), performed a wide benchmarking for many different algorithms with continuous control problems featuring locomotive problems and extended these with hierarchical goals. Locomotive problems featured manoeuvring different 3D entities, ranging from swimming worm to walking humanoid, forward as quickly as possible. The more difficult hierarchical problems required manoeuvring the entity from collecting food while dodging bombs to completing a maze. (Duan, Chen et al. 2016)

They found out that the truncated natural policy gradient, trust region policy optimization (Schulman, Levine et al. 2017) and DDPG algorithms performed well on the locomotive tasks and were effective ways to train deep neural network policies. However, all evaluated algorithms performed poorly on the difficult hierarchical tasks. (Duan, Chen et al. 2016)

(28)

4.1.3 Optimal Control of Space Heating

Nagy et al. (2018) set up a simulated RL environment involving space heating by controlling a heat pump thermostat in a room. The RL reward was formed with two elements: first was the occupant comfort element, penalizing too high and too low temperatures, and the second one featured energy consumption costs. For calculating the energy consumption costs, they featured three scenarios: flat electricity pricing with a constant price, dual pricing with cheaper price during nights and real-time pricing where the price varied by the hour. (Nagy, Kazmi et al. 2018)

As a baseline providing the lower bound for performance, they used a simple on-off rule based controller, that is found in many buildings today. As an upper bound for the performance, they used a model-based MPC controller that assumed full knowledge of the system’s dynamics. The first RL approach was to use a model-based method to learn the model of the system using a neural network and use the model to plan out the heating. The second RL approach was a model-free method called double deep neural fitted q iteration where the heating signal is learned from the measurements. (Nagy, Kazmi et al. 2018) Overall, both RL results were much better than the rule based controller with energy cost reduction of 7-18 % for the model-based one and 6-10 % for the model-free one. In some scenarios, the model-based method approached comparable performance to the perfect information MPC. The model based methods proved to be superior for this problem, but the model-free methods offer other benefits such as much faster compute times and working reliably when unexpected changes to the environment take place. (Nagy, Kazmi et al. 2018)

(29)

5. MODEL ENVIRONMENT

To apply reinforcement learning, of course a system to apply it to is needed. For this purpose, a reinforcement learning environment is set up, using a light mathematical physics model for aircraft flying given by Insta DefSec Oy. Since the physical model is given for this thesis, no reasoning or further background is provided for any of the functions and variables described for the physics model. The purpose of this chapter is to describe the mathematical model environment, while paragraph 7 has the actual implementation details.

The basic idea for the RL environment is to have an aircraft controlled by the RL agent using the physics model and a control model, which are thoroughly explained below in 5.1 and 5.2.

The objective for the agent is to manoeuvre to and follow a moving target position model, described in 5.3. Then, in 5.4, these models are combined into a single model to represent the goal, which is referred to as the pilot model.

5.1 Physics Model

For the purposes of this thesis, the physics model takes the control values, pull, throttle, roll and rudder, and aircraft’s current spatial information as input and outputs the updated spatial information. The spatial information contains the aircraft’s latitude, longitude, altitude, yaw, pitch, and roll values. The model is illustrated in Figure 8. This model has no delays and satisfies the Markov property: each updated state depends only on the previous state and selected action.

Figure 8. The physics model takes the aircraft state 𝒔𝒂 containing the current spatial information and control values 𝒖 as input and outputs an updated aircraft state 𝒔_𝒂′.

For this model, the latitude, longitude and altitude are simply values on a 3D 𝑥𝑦𝑧 coordinate system where 𝑥 is longitude, 𝑦 is latitude and 𝑧 is the altitude. Effectively flying on a constant altitude results in a level flight over the 𝑥𝑦 plane.

(30)

5.1.1 Physics Model Parameters and Constants

The physics model is parametrized with various parameters and constants, referred here as only parameters. The parameters, their symbols, their units, and their default values are presented in Table 1 and explained below.

Table 1. Simulation parameters, symbols, units, and default values.

Symbol Parameter Default Value Unit

m Mass 13 000 kg

deq-% Equipment drag percentage 0 %

vmin_sl Minimum speed at sea level 60 m/s

vmax_sl Maximum speed at sea level 480 m/s

vmax Maximum speed 532 m/s

vmax_alttitude Maximum speed at max altitude 309 m/s

vmach Mach speed 340 m/s

zmax Maximum altitude 15 000 M

zmax_speed Max speed altitude 12 000 M

Tmil Military thrust 97 800 N

Tfull Full thrust 156 600 N

cmil Specific military consumption 82.6 kg/kN/h

cfull Specific full consumption 177 Kg/kN/h

nmax Max load factor 7.5 g

nmax_negative Max negative load factor -2 g

u3max Max absolute rudder 1 °/s

Δt Simulation tick 1 s

g Gravity acceleration 9.81 m/s²

I came up with the reasonable default values, which are used when not stated otherwise.

The aircraft related parameters are my estimates for a modern fighter aircraft. The simulation tick of 1 second is a solid compromise between simulation accuracy i.e. fidelity and computational performance of the applied methods. Contrary to the real world, gravity and Mach speed are fixed constants for the ease of implementation and their relatively small effect on the fidelity of the simulation.

First parameters are mass 𝑚 stating how heavy the aircraft is and equipment drag percentage 𝑑_𝑒𝑞−% stating additional drag forces due to external equipment. These values vary a lot depending on the current loadout and fuel level of the aircraft so ideally the learned flying policy should be very robust to changes to these variables.

Secondly set of parameters contains the speed parameters stalling speed at sea level 𝑣_{min_𝑠𝑙}, maximum speed at sea level 𝑣max_𝑠𝑙, maximum speed 𝑣max, maximum speed at max altitude

(31)

𝑣max_𝑎𝑙𝑡𝑖𝑡𝑢𝑑𝑒 and the Mach speed which in this model is a constant 𝑣_mach. These speed parameters are used in combination with the altitude parameters, maximum altitude 𝑧_𝑚𝑎𝑥 and maximum speed altitude 𝑧_{max_𝑠𝑝𝑒𝑒𝑑}, to calculate the max and minimum speeds at various altitudes.

Third set of parameters are related to calculating fuel consumption: military thrust 𝑇_𝑚𝑖𝑙, full thrust 𝑇_{𝑓𝑢𝑙𝑙}, specific military consumption 𝑐_𝑚𝑖𝑙 and specific full consumption 𝑐_{𝑓𝑢𝑙𝑙}. Military thrust defines the maximum force generated with the engines operating at full thrust without lighting the afterburners. Respectively, full thrust is achieved with full throttle with the afterburners.

Fourth set has some physical constraints, max load factor 𝑛_𝑚𝑎𝑥, and max negative load factor 𝑛max _𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 and max absolute rudder 𝑢_{3𝑚𝑎𝑥}. Maximum load factors measure the maximum stress that the aircraft can be subjected. Load factor is technically unitless, but since it is defined by dividing lift by weight, it is effectively measured in g-forces.

Lastly are the simulation parameters simulation tick time 𝛥𝑡 and gravity acceleration 𝑔. The simulation tick measures the time between two timesteps 𝑡_𝑛 and 𝑡_𝑛+1, where 𝑛 is some integer. For performance reasons its default value is set to one.

In addition to these parameters, the equations described below in 5.1.2 use the air density at altitude 𝜌(𝑧). In this model the air density is looked up and interpolated from the values found in Table 2.

Table 2. Air density value interpolation datapoints.

Altitude (m) Air Density (kg/m³)

0 1.22500

1 000 1.11200 2 000 1.00700 3 000 0.90930 4 000 0.81940 5 000 0.73640 6 000 0.66010 7 000 0.59000 8 000 0.52580 9 000 0.46710 10 000 0.41350 11 000 0.36976 12 000 0.32602 13 000 0.28228 14 000 0.23854 15 000 0.19480

(32)

5.1.2 State variables and Transitions

This model effectively forms an MDP because the updated state only depends on the previous state and selected action. This model is also fully deterministic, as can be seen from the equations below.

In this model the aircraft flying is constrained to non-inverted flight with constraining pitch and roll to a maximum of 90° and minimum of -90°. Resulting in some inverted flying manoeuvres, such as inverted diving, being impossible. Heading is also constrained with values at and over 360° looping back starting from 0° and vice versa.

The action in this environment are effectively the control values 𝒖 = (𝑢0, 𝑢1, 𝑢2, 𝑢3)^𝑇, where 𝑢₀ is the pull value, 𝑢₁ is the throttle setting, 𝑢₂ the roll value and 𝑢₃ is the rudder value. Input variables, units and constraints are listed in Table 3.

Table 3. Aircraft control inputs, units, and constraints.

Symbol Input Variable Unit Constraints u0 Pull g [nmax_negative, nmax]

u1 Throttle [0, 2]

u2 Roll ° [-90, 90]

u3 Rudder °/s [-u3max, u3max]

Pull is effectively the desired load factor, where the value of 1 means flying with lift equalling the aircraft weight. Throttle setting between 0 to 1 means using the engines normally and range 1 to 2 means afterburner usage. The roll value is straight up the desired roll in degrees.

Rudder value is the desired rudder correction in degrees per second.

For this system, the state consists of the spatial information: Longitude 𝑥_𝑎, latitude 𝑦_𝑎 and altitude 𝑧_𝑎, all measured in meters. Speed 𝑣_𝑎 is measured in meters per second. Finally, the heading ℎ_𝑎, pitch 𝑝_𝑎 and roll 𝑟_𝑎 is measured in degrees. So the physics model is fully defined with the state vector 𝒔 = (𝑥𝑎, 𝑦𝑎, 𝑧𝑎, 𝑣𝑎, ℎ𝑎, 𝑝𝑎, 𝑟𝑎, )^𝑇. The physics model state variables, units and constraints are listed in Table 4.

Table 4. Aircraft state variables, units, and constraints.

Symbol Input Variable Unit Constraints

xa Longitude m

ya Latitude m

za Altitude m

va Speed m/s

ha Heading ° [0, 360[ looping

pa Pitch ° [-90, 90]

ra Roll ° [-90, 90]