Applications of machine learning in cellular base station supervision

(1)

APPLICATIONS OF MACHINE LEARNING IN CELLULAR BASE STATION SUPERVISION

Master of Science Thesis Faculty of Information Technology and Communication Sciences Examiners: Assoc. Prof. Sergey Andreev Dr. Alexander Pyattaev December 2020

(2)

ABSTRACT

Henrik Talarmo: Applications of machine learning in cellular base station supervision Master of Science Thesis

Tampere University

Master’s Programme in Information Technology

Examiners: Assoc. Prof. Sergey Andreev (Tampere University), Dr. Alexander Pyattaev (Tampere University)

Supervisors: Alexander Pyattaev (Tampere University), Olga Galinina (Tampere University), Sergey Andreev (Tampere University), Johan Torsner (Ericsson), Felipe Del Carpio (Ericsson), Fedor Chernogorov (Ericsson)

December 2020

Existing research on cellular network optimization using machine learning can generally be divided into two different methodologies: Offline learning and online learning. Offline learning trains a machine learning agent off the network using simulations or data gathered from real base stations to determine the optimal set of parameters. The resulting parameters are then applied to the network, achieving performance greater than that of common heuristic approaches. Online learning methods on the other hand train the agent on the network itself while the network is operational. The agent will then over time find an optimal set of parameters specific to that network to ensure high performance.

This thesis proposes that there exists an under-researched hybrid approach that makes use of both approaches. In it, an agent is trained offline in how to search the parameter space of the network to discover the configuration that results in the highest performance automatically. By doing this, the agent is capable of finding the global maximum of performance in the parameter space faster than online methods, making it more responsive and capable of handling changing environments.

In this thesis, this hybrid method is further explored and tested. Neuroevolution is used to create an agent to navigate a simplified model of a cellular network parameter space to optimize the throughput of the network. The results highlight the need for a specialized approach that can operate despite the lack of information about the parameter space while the system is online.

Keywords: Cellular networks, Machine learning, Neuroevolution, Optimization, NEAT The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Henrik Talarmo: Koneoppimisen soveltaminen matkapuhelinverkon tukiasemien valvontaan Diplomityö

Tampereen yliopisto Tietotekniikan DI-ohjelma

Tarkistajat: Assoc. Prof. Sergey Andreev (Tampereen Yliopisto), Dr. Alexander Pyattaev (Tampe- reen Yliopisto)

Ohjaajat: Alexander Pyattaev (Tampereen Yliopisto), Olga Galinina (Tampereen Yliopisto), Sergey Andreev (Tampereen Yliopisto), Johan Torsner (Ericsson), Felipe Del Carpio (Ericsson), Fedor Chernogorov (Ericsson)

Joulukuu 2020

Jo olemassa oleva tutkimus matkapuhelinverkkojen optimointiin koneoppimisen avulla voidaan yleisesti ottaen jakaa kahteen eri metodologiaan: offline- ja online-oppimiseen. Offline-oppimisessa koneoppimisen agenttia opetetaan verkon ulkopuolella hyödyntäen simulaatioita, tai dataa, joka on kerätty todellisilta tukiasemilta, optimaalisten asetusten löytämistä varten. Tällä menetelmällä saadut parametrit sitten asetetaan matkapuhelinverkkoon, jolloin saavutetaan korkeampaa suo- ristuskykyä, kuin mitä tavallisilla heuristisilla menetelmillä voidaan saavuttaa. Online-oppimisessa toisaalta agenttia opetetaan verkossa itsessään, kun verkko on toiminnassa. Agentti sitten ajan myötä löytää optimaaliset parametrit erityisesti kyseiselle mobiiliverkolle, varmistaen verkon kor- kean suorituskyvyn.

Tämä työ esittää kolmannen, vähän tutkitun hybridimetodin, joka hyödyntää molempia meto- dologioita, olemassaolon. Siinä agenttia opetetaan verkon ulkopuolella automaattisesti löytämään verkon parametriavaruudesta asetukset, jotka mahdollistavat korkeimman mahdollisen suorituskyvyn. Tämän avulla, agentti on kykenevä löytämään parametriavaruuden suorituskyvyn globaa- lin maksimin nopeammin, kuin online-oppimisen metodit, jonka avulla agentista tulee nopeampi ja kykenevämpi reagoimaan ympäristön muutoksiin.

Tässä työssä tätä kolmatta hybridimetodia tutkitaan ja testataan syvemmin. Neuroevoluution avulla luodaan agentti, joka navigoi verkon parametriavaruudesta yksinkertaistetun mallin halki, tarkoituksenaan optimoida verkon läpisyöttöä. Tulokset osoittavat, että hybridimetodi vaatii tar- kemman lähestymistavan, joka kykenee sopeutumaan parametriavaruudesta saatavan tiedon vä- häisyyteen, kun järjestelmä on toiminnassa.

Avainsanat: Matkapuhelinverkot, Koneoppiminen, Neuroevoluutio, Optimointi, NEAT Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

This thesis work was completed during the 2020 COVID-19 pandemic, which caused a significant amount of disruption in the lives of everyone worldwide. It was done for Ericsson as part of the Business Finland-funded 5G-FORCE project. This thesis would not have been possible without the assistance of several people I want to acknowledge here.

I want to thank Johan Torsner, Felipe Del Carpio, and Fedor Chernogorov from Ericsson for feedback on the thesis and their expertise in wireless and 5G technologies during the project. I also want to thank Assoc. Prof. Sergey Andreev and Dr. Alexander Pyattaev for supervising my thesis work and providing feedback on the writing. Furthermore, I want to thank Alexander for his guidance during the entire project, as he was my first line of support during it Finally, I want to thank Olga Galinina for her feedback and knowledge on the machine learning side of the thesis.

Tampere, 14th December 2020 Henrik Talarmo

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

BW_sys System Bandwidth

D_i Change in a parameter over time G Scheduler greediness

H^ν Hotspot location in parameter space N_sym^DL Number of downlink symbols

N_RB,ω^β Number of resource blocks used in the pastω frames N_maxi Number of maximums

P Vector of parameter values P L_dB Path loss

P^∗ Point of global maximum in the parameter space P_tx Transmit power

Q Q-function of Q-learning

S User’s SNR

SN R_t SNR target

T Throughput

T_H Throughput value from a hotspot in the training model Tk Throughput of the base station after thekth agent step V_i Value range of a parameter

W Vector of all the weights of all the users being allocated by a scheduler

X Parameter used in an example

∆t Change in time

Ω Event, when the network environment changes suddenly Φ Agent fitness

ℵ Event, when the agent is attached to the online network α Scheduler greediness SNR weight

β Scheduler greediness RB weight

(7)

λ Variable for adjusting the weighted average fitness function (3.10) R Set of all real numbers

µ DDPG policy estimating the correct action for a given state ν Number of parameters, dimensionality

ω Scheduler window size in frames θ Example function

ξ Function describing base station performance at RL action taken at a given time

c Speed of light

d Distance

f Frequency

n Path loss exponent p_i Parameter value r_m Distance fromP toH^ν st RL state chosen at time t

t Time

w_m Weight of a hotspot

w_u Weight of a user in the scheduler

w_max Highest weight at a given point in the parameter space 5G NR 5G New Radio

ANN Artificial Neural Network bps Bits per second

BS Base Station

BSP Binary Space Partitioning

CCO Coverage and Capacity Optimization

DDPG Deep Deterministic Policy Gradient (in machine learning)

DL Downlink

NEAT Neuroevolution of Augmenting Topologies (in machine learning) OFDM Orthogonal Frequency-Division Multiplexing

PC Power Control RB Resource Block

RL Reinforcement Learning

(8)

RNN Recurrent Neural Network SCS Subcarrier spacing

SNR Signal-to-noise ratio SON Self-Organizing Network UE User Equipment

UL Uplink

(9)

1 INTRODUCTION

There has been a significant amount of research into using machine learning for cellular network optimization [1]. The existing research can generally be divided into two distinct methodologies: Offline learning and online learning.

Offline learning methods are approaches where machine learning is used to find an optimized set of parameters for a cellular network as part of the offline process of cellular network planning. For training, the data is either generated through simulation or gathered from real-world networks over time. The solutions obtained through these methodologies can then result in better performance in a specific cellular network compared to conven- tional heuristic methods.

These offline methods have been researched widely in many aspects of cellular network optimization. Offline solutions have been used to minimize the number of base stations during the early planning phase while ensuring that coverage and capacity are optimized [2]. Offline methods have also been researched for reducing coverage issues by tuning base station transmit power and downtilt [3], and to increase cell edge throughput to improve link allocation algorithms [4]. Beyond the network planning phase, improving the performance of existing networks has also been researched [5].

A common aspect of offline methods is that they produce static solutions. The solution can be highly optimized, as there are few resource and time constraints for calculating it, but its static nature means that it cannot adapt over time. This can be a problem, as the environment can experience changes in both the long-term and short-term.

In the long-term, the surrounding landscape might change due to new buildings being built and old ones being demolished. In the short-term, the weather, traffic patterns, and user locations can affect the performance of the network. To ensure that the network performs optimally in these situations, a new solution must be calculated each time for the new environment and the solution must then be applied to the network.

In online learning methods, an agent is instead trained while the network is online. An agent is a machine learning entity that autonomously makes decisions on how to modify the network’s parameters. The agent is placed in the network without any prior training and while the network is operational, the agent learns the optimal parameter configuration for the network. Unlike the offline methods, this approach is capable of adjusting

(10)

to changes to the environment by re-adjusting the machine learning algorithm that it is running over time.

The predominant method for creating these kinds of online learning agents is through the use of Reinforcement Learning (RL). For example, Q-learning [6] has been used exten- sively to create agents that can improve network throughput in femtocells [1, 7, 8] and on the edges of cells [9]. Improving spectral efficiency by adjusting base station downtilt has also been explored through fuzzy Q-learning [10, 11]. There has also been interest in improving Self-Organizing Networks (SON) by applying Q-learning to their functions [11, 12]. Use of methods other than Q-learning have also been explored. For example, a cyclic multi armed bandit was used to tune the number of sub-bands in a cell to achieve improved throughput in picocells [13].

Online agents do not come without downsides. As the agent has not been trained in how to optimize the network yet when the network is turned on, it takes some amount of time before the optimal operation is reached. This amount of time may in some cases be several minutes or even hours [10]. As a result, there is some delay between a change in the environment and the algorithm finding a suitable solution for the new environment. If the environment changes rapidly, the agent will not have enough time to converge on the optimal solution. However, if the environment does not change much, or if there is enough time between the changes, the online agent will eventually find the optimal solution. This makes it adaptable, unlike an offline solution. An online agent is also likely unable to reach the same level of performance as an offline solution. This is because the online agent is more restricted in time and resources that it can spend on learning the environment.

The common feature between the online and offline learning methods is that the algorithm learns what the optimal configuration for the network is and outputs that as the solution.

Instead of this, the solution can be found by teaching the agent tosearchfor this optimal configuration. For example, instead of the agent spending time to learn that on a particular base station setting transmit power to 14.5 dBm results in optimal performance, the agent is taught to adjust the transmit power in small increments over time. When placed in the online network, the agent can then adjust transmit power over time until it finds the setting that results in optimal performance. This kind ofhybrid agent has the advantage of being highly adaptable. If the environment changes, the agent will immediately start to search for the configuration of high performance in the network. A well-trained hybrid agent thus has the potential of being a more responsive solution to a network than an online agent.

The differences in performance over time between the three types of methods are shown in Figure 1.1. In it, the offline training processes of the offline machine learning method and hybrid agents are shown until the eventℵwhen the solution is applied on the online network. The offline solution immediately yields high performance. At the same time, the online agent begins to learn what the optimal configuration for the network is, and the

(11)

hybrid agent immediately begins to search for the optimal configuration.

Figure 1.1. Example visualizing the differences between the three types of methods.

Event ℵ represents the point in time where the network goes online. At Ω a change happens in the environment, resulting in reduced performance across all methods.

At eventΩ, an abrupt change happens in the environment. As a result of the event, the old configuration of the network results in worse performance than the previous. Because the solution from the offline method was not optimized for this new environment, its performance drops significantly. As it is also static, it stays that way until a new solution is calculated. As the old solution from the online agent is no longer sufficient, it will have to spend some time to re-learn what the optimal configuration is for the network. Meanwhile, the hybrid agent will be able to recover from the disturbance much faster as it already has a policy for how to improve the performance of the network by observing the network state. As before Ω, the online agent will eventually out-perform the hybrid agent, but if events like Ω happen frequently enough, it is possible that the online agent will never reach the same level of performance as the hybrid agent. This demonstrates how the hybrid agent can be more responsive and adaptable than the online or offline agents.

The hybrid method has not been researched as much as the other two methods. As a notable example, one group studied the use of genetic programming to improve scheduler performance, and then used a heuristic hill-climbing algorithm to optimize individual parameters for a set of base stations [14]. In this study, the parameter optimization was done via a heuristic algorithm, and the scheduler was static during the online phase. Thus, it did not create a truly hybrid agent.

In this thesis, hybrid agents are further explored. It does this by designing a machine learning agent that can observe the state of the network and propose adjustments to the configuration parameters to improve performance. In Section 2 a framework is proposed

(12)

for integrating one such agent into a cellular network by suggesting a suitable location for the agent in the system. Suitable parameters are then chosen for the hybrid agent to adjust during the online phase, and throughput is chosen as the metric measuring network performance. Neuroevolution is then chosen as the most suitable machine learning approach for the framework. The proposed framework is then tested.

In Section 3, data is first obtained on how the performance of a cellular network changes based on both the chosen adjustable parameters and due to environmental effects. This data was obtained through system-level network simulation runs and gathered to form an understanding on how the throughput of the network changes based on various parameter values. Based on this data, a training environment is created that generates simplified data. The environment is created to ensure that during the offline training process, the environment is of a level of complexity that is desired. Finally, the Chapter concludes by describing the training process in detail.

The training results presented and analyzed in Section 4.1 highlight that the approach required for a hybrid agent needs to be more specialized than what is suggested in Chap- ter 2. An approach where the agent directly controls base station parameters requires a considerably more complicated network than what is easily achievable through neuroevolution. Based on the results, several approaches are proposed in Section 4.3 as possible alternative methods for creating a hybrid agent.

(13)

2 INTEGRATING MACHINE LEARNING INTO THE CELLULAR NETWORK

The way the agent is integrated into the cellular network affects how the agent can interact with the network itself. In the following, two different ways to integrate the agent are discussed: a distributed model and a centralized model.

In the distributed model, the agent is placed in each of the base stations in the network as shown in Figure 2.1. By integrating the agent into each base station, the agent has direct control over the parameters of the base station and can adjust them to fit that base station accurately. Having an individual agent in each of the base stations can, however, also cause adversarial behavior between the agents. This can be caused by one agent improving the performance of its own base station in a manner that reduces the performance of another base station. The change can then lead to the other agent adjusting one of its parameters to counteract the drop in performance, which can result in a feedback loop between two or more base stations. To avoid feedback loops, training the agent in an environment with other similar agents is encouraged.

The distributed model has been used in online learning methods [8, 13], and a variation of it, where the algorithm is applied to the user devices instead of the base stations has also been researched [9].

The centralized model shown in Figure 2.2 has the agent placed outside of the base stations in an external location. In consequence, the agent has control over several base stations at once. This has the advantage of avoiding any of the issues created by multiple agents causing feedback loops within the network. If a change to one base station causes a drop in the performance of another one, the agent can react to it accordingly without causing a feedback loop.

Because the agent is in control of several base stations, the number of parameters across all of them will also be larger. As each base station has its own set of parameters, the amount of observable parameter combinations grows exponentially. For the agent to be able to control an expanded number of parameters, it will have to be more complex, resulting in higher CPU usage on the base station. This can be undesirable for online approaches, where resource and time constraints are a limiting factor. To avoid the total

(14)

Figure 2.1.Example of a distributed machine learning module in a network.

Figure 2.2.Example of a centralized machine learning module in a network.

number of parameters growing so large, that the performance of the agent begins to suffer significantly, the agent can control only a limited number of parameters from each base station. This reduces the degree to which the agent can optimize the base stations.

Centralized-styled methods are common for offline learning methods [2, 3, 5, 14]. This is because the increased resources and additional time permitted by the offline method allows the method to absorb much of the impact of optimizing several base stations at once.

Because the distributed model allows for more individual adjustable parameters per base station, it has the potential to optimize the base station further than the centralized model.

(15)

For this reason, it is of more interest for this thesis. To avoid the negative effects of the adversarial behavior, training can be done initially by having only one agent placed in a single base station in the center of a cluster. Its effects can then be evaluated without having to worry about other agents interfering with it. It also simplifies the system and makes the relationship between the adjustable parameters and the performance of the system more direct.

The way the agent interacts with the base station must also be discussed. For this, two approaches are studied. The first one is a single-shot method where the agent observes the network around it and based on known patterns chooses parameter values that result in the highest performance for the base station. After this configuration is completed, the agent does not do any further adjustments to the base station. The agent can then also learn from previous times the base station had been active whether its solution was correct and it can keep learning what the perfect set of parameter values are for that base station alone. This kind of agent can also be used as part of the network design process.

The main downside of this approach is that the base station configuration is static over time. This means that changes in the environment or data patterns are ignored.

The second approach is to have the agent adjust the base station continuously over time. This online approach is achieved by setting the base station parameters first in some known settings. After this, the base station invokes the agent regularly, giving it the current state of the base station as input. The agent processes the input and outputs corrections to the parameter values to improve base station performance. Unlike the static approach, this has the significant advantage of being capable of reacting to changes in the environment over time. The downside is that the training process can be more complicated because the dynamic behavior lends itself to many more possible situations that the agent must handle.

Despite the more complicated training process, the dynamic approach is the only practical choice for a hybrid agent. The single-shot method behaves much like offline methods in the short-term, except that the learning happens while the network is online. Because of this, the base station can not easily react to changes in the environment, making it less responsive, which in the scope of this thesis is undesirable.

2.1 Outlining base station parameter options

As the purpose of the agent is to adjust the parameters of the base station, a set of parameters must be chosen for that purpose. These parameters need to be accessible by the agent and something that can be changed on a software level. Downtilt is a parameter that has been used for network optimization [10, 11], but it may be unavailable if the base station does not mechanically allow for it. Conversely, the shape of the antenna itself is dependent on the physical characteristics of the antenna and thus cannot be readily

(16)

adjusted by the agent. It can, however, be optimized for in the offline phase of cellular network design [2].

First parameters for the agent to adjust come from the Power Control (PC). In this thesis, PC for downlink is considered to be static and uplink PC is achieved through the use of a signal-to-noise ratio (SNR) target value SN R_t. This allows the agent to fine-tune the downlink transmit power as a single value. The opposite of this is possible [9], but doing so is more complicated. The complexity arises from the agent having to control an arbitrary number of users transmit powers individually, greatly inflating the number of controllable parameters.

SN R_tis the desired SNR value for a given user. This is used along with the known path loss of the user as well as the noise to calculate what the transmit power of the user needs to be in order for theSN R_tto be achieved. The agent can adjust the exactSN R_tvalue, adjusting how low signal quality the base station is willing to accept.

Another feature of the base station that the agent can adjust is the scheduler [4, 7, 8].

The scheduler is responsible for deciding which user sends and receives data at any given moment in time in future frames. The exact way the scheduler is implemented on the base station varies, however. Because of this, the exact parameters for the agent to adjust cannot be firmly defined in advance.

Accounting for these parameters gives the following list of parameters that the agent can adjust in the base station. The exact values and the implementation of scheduler parameters andSN R_tare further detailed in Section 3.

• Base station downlink transmit power

• Scheduler parameters

• SN R_t

2.2 Cellular network optimization goals

Optimizing a system means improving the efficiency or performance of that system as much as possible. To determine how well the system is optimized a metric that is suitable for measuring the performance of the system must be chosen. To make the training process of the agent less complicated, the metric will be chosen based on what suits a machine learning approach the best. In this section, several candidate metrics and their pros and cons are discussed from the point of view of machine learning. These are latency, spectral efficiency, and throughput.

Latency measures the delay in communication between two devices. In cellular networks, this is commonly measured as the amount of time for a message to reach from one device to another and back. The lower the latency, the faster one device can respond to

(17)

messages from the other device. Faster response times facilitate the use of time-critical applications where a message being late for even a short moment can result in accidents.

For example, in the application of self-driving vehicles, the sooner the message about an upcoming collision reaches the other vehicle, the sooner that vehicle can start taking actions to avoid the collision.

However, from the point of view of machine learning, latency may not be a suitable optimization metric. This is because in machine learning, it is preferable that the effects of the actions that the module takes are consistent. Because of things like Radio Resource Control state transitions and environmental factors, latency can have high variability. This means that from time to time latency might have spikes and large differences caused by external factors that are outside the agent’s control. This means that training of the agent becomes more complicated, as occasionally massive drops in performance might be in- terpreted as if it was caused by the agent. The unappealing nature of latency as a direct metric of performance for machine learning applications is also supported by the lack of its usage in literature [1].

Instead of latency, network data throughput can be, and has been, used as a metric of performance [8, 9, 13]. Throughput is usually measured as the number of data bits sent and received within the network in a given amount of time. It is more universally useful than latency as higher uplink (UL) and downlink (DL) throughput is desirable in most wireless applications. It is also simple to measure by simply counting how many bits of data have been sent or received.

In order to measure throughput, network traffic must be observed for some amount of time. This limits how often the performance of the network can be assessed. It is also dependent on user activity. For example, during night-time the amount of traffic generated by users is going to be lower than during the morning. Because of this, the absolute bitrate of the network or individual base station is not by itself a suitable indicator of performance. In some situations, the highest possible bitrate for the base station can be lower than what the agent assumes is the maximum. In this situation, the agent can be at the global maximum throughput value and still predict that the throughput can somehow be improved. In this situation, it can start making unnecessary adjustments to the base station that result in reduced performance.

Spectral efficiency is a measure of bitrate across a given bandwidth. If measured for a single link, it is called link spectral efficiency. If it is instead calculated across multiple users within an area or a cell, it is calledsystem spectral efficiency[15]. Of these, system spectral efficiency is of course better for estimating the performance of an entire network.

The primary downside of using spectral efficiency is that it has all the downsides of using the network throughput and extra complexity. For one to calculate the spectral efficiency of the network based on the aforementioned definition, one must first calculate the through-

(18)

put of the network. Throughput is then divided by the amount of bandwidth available on the network’s channel. If the bandwidth does not change for the channel, the resulting metric will behave the same way as throughput does. Despite this, spectral efficiency has been used as a performance metric in some studies [10, 11], showing that its usage has some merit.

As latency can vary significantly and suddenly without agent intervention, and because spectral efficiency does not have any notable advantages over throughput from the point of view of this study, throughput was chosen to be the metric used in this thesis. As throughput is also very straightforward to measure and fairly stable over time, it is likely to give the best chances of training a successful agent. It is also universally useful, making it the most generic choice of the three options.

2.3 Formal definition

Let p_i represent a parameter on a base station. Each parameter has its own finite set of allowed real values Vi so that pi ∈ Vi. There are ν parameters in the base station, and they form a vector P = {p₁, p₂, p₃, ...p_d}. An instance of P thus represents one configuration of the base station. The performance of the base station is dependent on the current parameter configuration. Performance also varies over time depending on factors that are outside the control of the agent. Thus, we can define a function that represents the network’s performance with a configurationP at timet:

ξ(P, t) = T. (2.1)

The behavior of the function over time is unknown and non-deterministic, as we cannot know how the environment is going to change over time in advance. The environment can change for example because of the weather or local events.

The functionξreturns valueT as output. T ∈Rand it has some maximum and minimum values at any given t. Because for each parameter its corresponding set of possible values is a finite set, the domain ofξis also finite. However, because each parameter has potentially hundreds or thousands of possible values and because there are several of them, observing the domain ofξ(P, t)in its entirely for any giventis typically infeasible.

Several further details are known about the behavior of the performance functionξ. The performance of the base station is not expected to change suddenly with only a very small change in the parameters. Because of this, it can be determined that most partial derivatives ofξ are small. It is also likely that there exists wide ranges of possible values for some parameters at some points in time when the performance of the base station does not change much. As such, the functionξmay be insensitive to some configurations

(19)

ofP at some values oft.

As it takes some amount of time to determine the current performance of the base station, observing ξ(P, t)directly is impossible. Instead, one must observe it for ∆t amount of time. As the exact value ofξ(P, t)changes over time, this means that the exact result of the observation may differ when repeated even ifP is the same.

Althoughξ does change over time, it is known to be stationary long-term. It is also likely that the change is constrained by some values (D¹_i, D²_i). This is likely, because the environment does not usually experience changes that are dramatic enough to affect the performance of the base station very rapidly. This constraint on the change ofξ can be described with

δ_δp^δf

i

δt =Di, Di ∈R: (D¹_i, D²_i). (2.2) It is also possible to estimateD_i for all parameters ahead of time.

The problem that the machine learning agent has to solve is to find an algorithm that is capable of finding the global maximum at any given point in time. That is, it must findP^∗ so that(∀P)ξ(P^∗, t)≥ξ(P, t). The agent must also findP^∗with only a few observations and then track it over time.

2.4 Choosing a suitable machine learning approach

Not all machine learning approaches are suitable for the task presented in Section 2.3.

For this reason, different machine learning approaches must be considered, and the one most likely to succeed in finding a solution must be chosen.

Three approaches are considered for use in this thesis for network optimization: particle filters, Reinforcement Learning, and Neuroevolution. All of them are discussed in detail in the following sections and their suitability for use in the agent described earlier in the section is studied.

2.4.1 Particle Filters

Particle filters are a robust machine learning approach. To find a global maximum of a function, a particle filter first makes many observations of the parameter space, called particles. It then filters out the particles that perform the worst in the parameter space.

After this, it makes a new set of observations near the remaining particles. By repeating this loop, the particle filter can rapidly approach the global maximum of the function. If the function has multiple equal global maximums, it is capable of discovering the location of each of them. [16, 17]

(20)

Figure 2.3. Basic graph depicting the way the base station and agent affect each other.

The reason why particle filters are not suitable for this problem, in particular, is the fact that they need to do hundreds of observations for each epoch. Because the behavior ofξ(P, t)changes over time, and because each observation takes ∆ttime, by the time the observations have been completed the behavior of the function has nearly certainly changed considerably. Particle filters are thus too slow for this purpose and must be discarded.

2.4.2 Reinforcement Learning

RL is an unsupervised machine learning method where an agent learns how to map states into actions to maximize reward. A state s_t is provided to the agent that uses a policy to choose an action at. This action is then applied to the environment and an immediate reward r_t is given to the agent based on the action. Beneficial actions are rewarded and detrimental actions may be penalized. The agent uses the reward to adjust its policy, which makes the agent better at choosing the correct action for a given state.

[18, p. 2]

The way that the agent interacts with the base station has similarities with the RL training loop. As shown in Figure 2.3, the agent receives the state of the network from the base station. It then performs an action based on the state of the base station. By including a reward along with the state, the interaction loop is similar to that of a reinforcement learning training loop. This makes RL an appealing method choice for this task.

Q-learning [6] is a common RL algorithm and it has been applied to cellular network optimization in many ways [1]. Q-learning maintains a state-action function Q(s_t, a_t), known as the Q-function, that it uses it to estimate the quality of state-action pairs [18, p.

157-159]. At the start of training, the Q-function is arbitrarily initialized for all state-action pairs. When the agent receives st, it uses the Q-function to choose the optimal action.

This action is then applied to the environment, and the next state s_t+1 as well as the

(21)

rewardrt+1are given to the agent. The agent then uses its learning rateαQand discount factorγ to adjustQ(s_t, a_t)using

Q(s_t, a_t)←Q(s_t, a_t) +α_Q(︂

r_t+1+γmax

a Q(s_t+1, a)−Q(s_t, a_t))︂

. (2.3) The discount factor γ adjusts how much the algorithm favors future rewards from the actions that the agent can take in the new state.

A notable feature of Q-learning is the termmax_aQ(s_t+1, a). To calculate the term, the entire action space of the environment must be evaluated. This restricts Q-learning in its basic form to problems where the evaluating the entire action space is feasible. Similarly, as Q-learning fundamentally is trained by evaluating state-action pairs, the training process must visit each state. If a state was never visited during training, then the behavior of the agent in that state is undefined at that state. Consequently, To ensure that the agent behaves in a known manner in all states, the state space must be small enough for it to be feasible for the agent to evaluate each of the possible states. Because of these restrictions, continuous action spaces must be discretized to make Q-learning viable.

To use Q-learning for continuous action spaces, Q-learning must be expanded upon in some way. One way to do this is to change how the Q-function works so thatmax_aQ(s_t, a) can be evaluated. Deep Deterministic Policy Gradient (DDPG) [19] does this by assuming thatQ(s_t, a_t)is differentiable. By exploiting this assumption, DDPG creates a policyµ(s_t) to find the optimal action for a given state. This allowsmax_aQ(s_t, a)to be approximated withQ(st, µ(st)). DDPG has been used successfully to reduce under- and overcoverage in offline cellular network optimization [3].

Another method for expanding on Q-learning to account for continuous action spaces is to use fuzzy logic to create distinct states and actions from the continuous action space [20]. Fuzzy Q-learning has been used in online methods to improve spectral efficiency by adjusting downtilt [10, 21] and to make self-organized femtocells more dynamically capable by adjusting their resource allocation policy [7].

For the purposes of finding a global maximum in a parameter space that changes over time, the immediate rewarding of the agent’s actions in RL presents a challenge. For example, let there be a parameter θ behaving as shown in Figure 2.4 with the current parameter valuep= 3. The system throughput is at the local maximum ofθ(p) = 3with the global maximum atp= 7. Assuming that the agent can only make adjustments to the parameter in steps of 1, the step towards the global maximum will reduce system performance asθ(4) < θ(3). Even though continuing this action for 2 more steps results in an overall improvement in the system, in reinforcement learning the agent is rewarded for its actions on the moment of the action itself. If the agent was rewarded later when multiple consecutive actions yielded a positive result as a whole, the agent can only interpret it as

(22)

Figure 2.4.Graph of an example functionθ.

the latest action being the beneficial one and ignoring the effects of the rest.

2.4.3 Neuroevolution

Neuroevolution is a machine learning approach where evolutionary algorithms are lever- aged to generate Artificial Neural Networks (ANN) [22]. Evolutionary algorithms are stochastic methods that mutate a population of genomes over several generations. By controlling which genomes survive and which are eliminated each generation, the population can be guided towards a desired behavior. In neuroevolution, this is applied to ANNs. Neuroevolution has been shown to converge towards a solution faster than backpropagation in certain studies [23, 24].

Neuroevolution can be approached in different ways. A basic approach to this is to only adjust the weights of a static ANN. This approach is similar to backpropagation in the sense that the topology of the network stays the same and only the weights are adjusted, but it does not directly utilize the output data to adjust the network. Another approach is to instead adjust the topology of the network itself, adding and removing nodes and connections as required [25].

Another possibility is to adjust both the weights and topology simultaneously. This offers a wider spectrum of possibilities in terms of what kind of ANN can emerge from training.

A popular algorithm for this that has been used for both academic [26] and recreational purposes [27, 28] is Neuroevolution of Augmenting Topologies (NEAT) [29] published by Kenneth Stanley in 2002. It mutates a population of genomes over generations by simultaneously adjusting the weights of all connections and by adding or removing nodes in the ANNs stochastically. It also creates or removes connections between the nodes

(23)

and adjusts their biases.

A third approach that has been studied is to split the ANN into subcomponents. Each of these components is evolved as their own parts of the ANN and topological adjustments in the ANN as a whole are only done within each subcomponent [30, 31]. As the subcomponents must evolve cooperatively to achieve a common goal, this kind of neuroevolution is often referred to as coevolution. Coevolution, however, does have downsides. Namely, as the ANN must be split into subcomponents, determining how many subcomponents for the given task result in optimal performance can be challenging. This becomes increas- ingly difficult when those subcomponents can be split down even further into smaller subcomponents. Furthermore, determining the contribution of each component in the ANN is difficult and requires some amount of analysis of the ANN in question [31]. Considering these challenges, and that NEAT has been shown to be both faster and able to create more sophisticated ANNs than fixed-topology coevolution algorithms [32], NEAT quickly becomes a preferred approach.

(24)

3 TRAINING METHODOLOGY

To gather data via the simulator runs, a set of parameters had to be defined for it. These parameters were given a set of values and all possible combinations of these values were run on separate simulations to gather information on how the performance of the network changes with different sets of parameters.

First, a set of parameters that the agent itself cannot control, but affect the performance of the network, were chosen. The first of these was chosen to be cell radius in meters, defining how far apart the different base stations are from each other. The values for this were chosen to be between 200 m and 400 m to accommodate differently spaced networks. The second parameter was chosen to be UL/DL traffic. This was defined in UL and DL bits per second (bps) separately, and the values were chosen to have a high traffic value of 10·10⁶bps and a low traffic value of 5·10⁶bps.

The third was a parameter related to the path loss calculations in the simulator. Path loss in the simulator was calculated based on the Friis transmission equation as shown in (3.1), wherecis the speed of light,f the frequency of the transmission anddthe distance.

n, referred to in this thesis aspropagationn, is the path loss exponent value. If this value is 2, the path loss is that of free space path loss. Any values higher than 2 represent any features in the environment that result in higher path loss. As the simulation area was chosen to represent an area with few obstacles, the values ofnwere chosen to be close to 2.

P L_dB = (︃ c

4f πd )︃n

(3.1) The parameters described in Section 2.1 were used to devise a further list of parameters for the simulator. The values given for each of the parameters are shown in Table 3.1.

These parameters are the parameters that the agent can control directly. Of these parameters, two had to be further specified for the simulator itself, as no direct counterparts existed in the simulator. These are thescheduler parametersandSNR target.

In Section 2.1, scheduler parameters are left undefined. This was because the exact parameters available depend on how the scheduler in the base station is implemented. In the case of the simulator used for this thesis, there were two significant parameters avail-

(25)

able: scheduler greedinessandscheduler DL symbols. In other systems, the appropriate parameters related to the scheduler may be different.

Parameter name

Values Description

BSP_tx 30 dBm, 10 dBm and 0 dBm

Base station downlink transmit power Cell radius 200 m, 300 m and

400 m

Radius of each cell DL bps 5·10⁶bps and

10·10⁶bps

Downlik traffic

n 2, 2.2, 2.4 and

2.5

Path loss exponent values G 0.9, 0.5 and 0.1 Scheduler greediness

N_symb^DL 0, 4, 7, 10 and 14 Number of downlink symbols in a slot SN R_t,80dB 0 dB, 3 dB, 10 dB

and 15 dB

SNR curve fit target value at 80 dB path loss SN R_t,90dB 0 dB, 3 dB, 10 dB

and 15 dB

SNR curve fit target value at 90 dB path loss UL bps 5·10⁶bps and

10·10⁶bps

Uplink traffic

Table 3.1.All values of the parameters used in the data gathering simulation.

Scheduler greedinessGis a parameter used in the scheduler to adjust how it prioritizes users. Scheduler greediness is defined as G ∈ R : [0,1]. Based onG, values α = G andβ = 1−αare calculated. The values are then used in (3.2) to calculate a weight for each user. In (3.2),Sis the user’s SNR,N_RB,ω^β is the amount of resource blocks the user has been given in the lastωframes andNRB is the number of resource blocks requested by the user. The weightw_u of the user is then later used to calculate how many resource blocks that user is allowed to use in the next frame. In effect, asG →1, users with high SNR are prioritized by the scheduler. As G → 0, the scheduler is more inclined to give each user an equal amount of resource blocks each frame

w_u =N_RB S^α

N_RB,ω^β . (3.2)

To explain what scheduler DL symbols N_symb^DL represents, a short description of the structure of 5G NR frames is required.

In 5G NR a single frame lasts for 10 ms that is divided into slots that contain 14 Orthogonal Frequency-division multiplexed (OFDM) symbols each. The number of slots depends on thenumerology µof the cell. Numerology is an integer variable that defines the subcarrier

(26)

Figure 3.1. Symbols in a slot with 5 downlink symbols and 9 uplink symbols.

spacing (SCS) and symbol length of a 5G NR network. The base value for numerology is µ= 0. Atµ= 0, SCS is 15 kHz and a single slot of 14 symbols lasts for 1 ms. Increasing numerology by 1 halves symbol length and doubles SCS as is shown in Figure 3.2.

Each of the symbols within a single slot can be individually classified to be eitheruplink, flexible, ordownlink [33]. Flexible symbols can be used for both uplink and downlink, but for the sake of simplicity, the simulator only uses uplink and downlink symbols.

5G NR has numerous pre-set configurations of symbol patterns for the slots [34]. The symbols are generally restricted to be in the order of downlink-flexible-uplink in a slot.

These numerous configurations are simplified to N_symb^DL in the simulator describing how many downlink symbols there are in a slot. As the number of symbols is restricted to 14, the number of UL symbols can be calculated with

N_symb^{U L} = 14−N_symb^DL . (3.3)

In Figure 3.1, an example of how the symbols in a slot are positioned is shown.

In the simulator, SNR targetfor each user is defined by a curve. This curve maps the user’s path loss to the desired SNR target for the user. To construct the curve, the simulator takes in two values,SN R_t,80dB and SN R_t,90dB. These represent the desired SNR targets at 80 dB and 90 dB path loss respectively. In addition to these values, the simulator was also set to define SN R_t,70dB = 0 andSN R_t,100dB = 0. Using these values, the simulator uses an algorithm to fit a series of Bézier curves to construct a curve that follows the set points. In Figure 3.3 an example of one such curve is shown.

(27)

Figure 3.2.Figure showing the relation of SCS and slot duration with different numerolo- gies in 5G NR.

Figure 3.3. Example of a Bezier curve fit for values SN R_tar,80dB = 10dB and SN R_tar,90dB = 3dB.

(28)

Figure 3.4.Example of the frame allocation in the scheduler showing the two bins dividing the frame and the number of DL and UL symbols per slot in each.

3.1 Simulator and scheduling

The simulations were completed using WINTERsim [35]. WINTERsim has been previ- ously used to research offloading cellular traffic onto device-to-device connections [36].

The scheduler used in the simulation was a binned binary space partitioning scheduler. It operates in two stages: the binning stage and the scheduling stage. In the binning stage, each user that has pending DL or UL data requests is divided into bins based on their DL/UL ratio. Each bin contains a scheduler with a set DL/UL ratio set byN_symb^DL . The goal of the binning is to ensure that the users are allocated according to their needs. If a user has high UL requirements but low DL requirements, the user is allocated in a part of the frame where the slots contain more UL symbols than DL symbols.

Once the users have been binned, the bins are allocated in the frame. The schedulers are given a number of slots based on how much data the users given to the schedulers have requested. This is done for efficiency. In Figure 3.4, the binning of a frame is visualized.

With the schedulers allocated, the amount of space allowed for each user is calculated.

For this purpose, scheduler greedinessGis used to assign each user with a weightw_u according to (3.2). The scheduler normalizes the weights with

w_u,norm = w_u

∑︁W, (3.4)

where W is a vector of all the weights of the users being scheduled. This normalized weight w_u,norm then represents the ratio of the resource space that the user is entitled

(29)

Figure 3.5. Example of the frame allocation showing individual users scheduled in different parts of the frame. The color indicates how accurately the user was allocated with blue representing that the user was over-allocated and red representing that the user was under-allocated.

to. This is converted to an amount of resource blocks in relation to how many resource blocks are available for the scheduler.

Once the amount of allowed resource blocks for each user has been calculated, the scheduler performs binary space partitioning (BSP) on the resource space. Using it, the scheduler aims to allocate each user with as small amount of space as possible without under-allocating them. With BSP, the scheduler divides the space in half repeatedly, alternating between dividing it along the time and frequency axis. Once it has found an appropriate space for the user, it assigns that user into that space and continues to the next user. The scheduler repeats this until it either runs out of space or runs out of users.

An example result of scheduling is shown in Figure 3.5. On it, the color represents over- (blue) and under-allocation (red) of each user.

The parameters used for the simulation are listed in Table 3.1. The simulation was run on a cell of seven base stations lifted 10 m above the users. The simulations ran for 60 seconds for each of the 34 560 parameter combinations. The users walked randomly around the simulation area. Every simulation second the simulator checks that each user was within the cell. If the user had walked outside of the cell, a random point near the closest base station was chosen as the user’s new direction and they were directed towards it for the next few seconds. As the simulation ran only for 60 seconds and the user speeds were low, the users did not generally walk too far from the base stations so the effects from the users walking outside of the cell were low.

(30)

Parameter name

Value(s) Parameter description

BSh 10 m Height of each base station

BSθ 10^◦ Downtilt of each base station’s antenna

BW_sys 40.32 MHz System bandwidth

Cluster size 7 Number of cells in the cluster Deployment

size

50 m UE deployment distance from each base station

f_carrier 2 GHz Base carrier frequency

µ 1 Numerology

Packet size 1000 b Size of each sent packet Scheduler

bins

1 Number of bins in the scheduler

SCS 30 kHz Subcarrier spacing

Sectors 3 Number of sectors per base station

Sim duration 60 s Duration of the simulation

UEh 0 m Height of travel of all users

UE collision check rate

10 s⁻¹ How often the simulation checked that users do not collide with each other

UE maxPtx 30 dBm User maximum transmit power UE minP_tx −70 dBm User minimum transmit power

UE per cell 5 Number of users per cell

US velocity 1 m s⁻¹-2 m s⁻¹ Randomized user speed range UE position

check rate

1 s⁻¹ How often the user’s positions were checked for ensuring that the users do not wander outside the simulation area

Table 3.3. List of static simulation parameters.

(31)

Figure 3.6. Throughput data from the simulation with varying scheduler greediness and DL transmit power.

Figure 3.7. Throughput data from the simulation with varying DL transmit power and SN Rt,80dB.

3.2 Training environment

To construct the training environment for the agent, the data from the simulation runs were observed. In Figures 3.6 and 3.7, the data is shown in 2-dimensional slices. From the figures, it can be observed that the data contains one or several high points or areas, and the further from the areas the parameters are tuned, the lower the throughput is.

Using this, a simple model was devised that takes these characteristics into account. In the model, each parameter is simplified into a single real number between 0 and 1, where

(32)

0 corresponds to the smallest allowed value for the parameter and 1 the largest. In the model, the parameters used are referred to as thedimensionsof the model. This creates a ν-dimensional cube where each point inside it represents a specific configuration of parameters. This bounded area is henceforth referred to as the parameter spaceof the environment. For simplicity during training the value ofν was largely only either1 or2. This was to make it easier to visualize the internal state of the environment.

Within the parameter space there areNmaxi pointsH^ν calledhotspots. Hotspots represent local maximums in throughput in the parameter space. Each hotspot has a weight w_m ∈R: [0,1]associated to it representing what the throughput is at the hotspot.

For the rest of the points in the parameter space, the throughput is calculated using a throughput function. A throughput function uses the distance r_m from a given point to a hotspot to calculate the throughput T_H at that given point. This throughput value is calculated respective to all hotspots in the environment, and the highest throughput value is chosen as the point’s throughput value. For the model, 3 different throughput functions were made to allow for customization: linear, parabolic, and inverse square throughput function.

Thelinear throughput function(3.5) calculates the function linearly based on distance to the hotspots. The function normalizesrm to scale the distance to a value that is within [0,1]. This clamps the output of the function within[0, w_m]

T_H =w_m−w_mrm

√ν. (3.5)

If the linear throughput function is plotted in ν = 1as shown in Figure 3.8, it is trivial to see that a point of the hotspot forms a discontinuity. This happens across all values of ν. To have a function that behaves similarly but does not have a discontinuity at the point of the hotspot, aparabolic throughput function (3.6) was created. Notable about it is that the derivative of the function becomes positive afterr_m = ⁵₃. The maximum value of rm in aν-dimensional parameter space is√

ν. Because of this, forν >2the throughput function can result in the throughput increasing instead of decreasing over distance. As all training runs were completed withν ≤2, this has no effect on the behavior

T_H =w_m−w_mr_i²+w_m

2.5r³_m. (3.6)

The third throughput function used in tests was theinverse square throughput function (3.7). It was made to emulate the inverse-square law with the maximum value ofw_m at

(33)

rm = 0. In it,λis an adjustable variable that can be tuned to fit the parameter space

T_H = w_m

(r_m+ 1)^λ. (3.7)

Figure 3.8. Example of a sharp angle atx= 0.4in the linear throughput function (3.5).

By placing hotspots in the environment and adjusting their weights, the training environment can create a simplified representation of the real data. An example of this is shown in Figure 3.10, which is created to be similar to the real data shown in Figure 3.9.

3.3 Agent training process

For training, neuroevolution of augmenting topologies (NEAT) [29] was chosen as the algorithm. NEAT is a genetic algorithm that generates a population of ANNs, called genomes, and improves them over several generations. In each generation, the genomes are evaluated to test their ability to solve the problem in question. Genomes that perform poorly are eliminated, and the genomes that perform well are used to mutate a new population of genomes. Over many generations, this leads to the population becoming in- creasingly capable at solving the problem. The specific implementation of NEAT used for the training process was a python-based library called NEAT-python, the code for which can be found on Github [37].

The training process started by generating a population of 200 agents. Depending on the run, these agents were either generated randomly by NEAT-python, or they were generated by taking an already existing agent and mutating it many times to generate the full population. This results in the first generation of agents that are subsequently tested.

(34)

Figure 3.9.Data from the simulation.

Figure 3.10. Visualization of the training environment replicating Figure 3.9. Red points are hotspots. By adjusting the weights and locations of hotspots, the training environment can be tuned to replicate the real data to a high degree.

(35)

To test an agent, it is placed in a training environment at a random position in the parameter space. The agent is then given its current position and current throughput value as input. The output is aν-dimensional vector of values that represent what kind of step to take in the parameter space. The action is then applied to the environment, resulting in a new position in the parameter space. This process is then repeated for 50 steps after which the run ends.

To diminish the effects of randomness, each agent was put to contend with 50 different environments in each generation. If the agents are only tested in one environment, it is possible that the global maximum just happens to be in a location that is particularly easy for the agent to find. To further ensure that one agent does not gain any significant benefit over another by getting a set of easier environments, each agent was given the same set of 50 environments each generation.

After the agent has completed its 50 environment runs, each of the runs are evaluated using afitness function. The fitness function takes the vector of all the throughput values that the agent achieved during the run and returns a fitness value Φ representing the agent’s performance in that environment. In effect, the fitness function in NEAT-python is the objective function. The details of choosing a fitness function for the runs is are discussed further are Subsection 3.3.1.

Once each of the agents had completed their 50 environment runs and the run results had been evaluated, the data was aggregated back. For this, the average of the fitness function results was taken. This value is then the agent’s true fitness value. These fitness values for the different agents were then given to NEAT-python to process. NEAT-python then used those values to find the best 20% of the agents and eliminated the rest. NEAT then mutated the remaining agents, creating new ones until it had a new generation of 200 agents.

This entire process is then repeated. There are two ways that repetition can end. The first method is that the fitness values of the population reach a certain, configurable threshold.

The second method is that the algorithm reaches a certain configured number of generations, at which point it simply stops. When the algorithm completes, it provides the agent with the highest fitness value as the winning agent and the training process is complete.

3.3.1 Comparing different fitness functions

The most simple kind of fitness function for the problem is (3.8), which simply takes the vector of k throughput values T and calculates their mean. This fitness function will always have a value range of[0, w_max]wherew_maxis the highest weight of any hotspot.

This point is also the global maximum of the parameter space. This is an intuitive way to observe the performance of the agent as it puts the performance on a simple linear scale

(36)

where the maximum value is easy to see from hotspot weights. The downside with the function however is that it can be prone to high fitness values when the agent happens to spawn close to the global maximum. Also, it does not measure how effective the agent is at seeking the global hotspot, only how high the achieved throughput was on average

Φ = 1 k

k

∑︂

n=1

T_n. (3.8)

To measure how well the agent seeks the global maximum, the total improvement in the system can be measured instead. For this purpose, (3.9) calculates how much the agent has improved the environment as a whole. In it, the numerator T_k −T₁ represents the total improvement to the system the agent did over the course of the run by calculating the throughput difference between the first and last step. The denominator wmax−T1 on the other hand represents the maximum amount of improvement the agent could have achieved in that run. By calculating the ratio of these two values, issues that arise from agents spawning at different distances are diminished. For example, an agent that spawns relatively close to the global maximum can only improve the throughput a limited amount even with perfect behavior. An agent that then spawns far away from the maximum can improve throughput significantly more even if both of the agents were equally capable of finding the global maximum

Φ = T_k−T₁

w_max−T₁. (3.9)

One issue that is present in (3.9) is that it only takes into account the last step. An agent could take very erratic steps and only end up near the global maximum on its last step. In these situations, the fitness function will not penalize the agent for the other steps. (3.10) exists as a middle ground between the two functions. It calculates a sum across all of the throughput values but places a higher value on the most recent throughput values. This way an agent has some time to get close to the global maximum before its performance starts to have a strong impact on its fitness. This makes staying near the global maximum desirable, as well as making the effects of unfavorable starting positions less significant

Φ = 1 k

k

∑︂

n=1

T_n (︂n

k )︂1.5

. (3.10)

3.3.2 Fitness function evaluation

The purpose of the fitness function is to tell which agents are performing well in the task. To gain a better understanding on how each of the fitness functions measures

Applications of machine learning in cellular base station supervision