Predicting imbalance power price

(1)

Computational Engineering Technomathematics

Vilma Mäkitalo

PREDICTING IMBALANCE POWER PRICE

Master’s Thesis

Examiners: D.Sc. (Tech.) Matylda Jablonska-Sabuka M.Sc. (Tech.) Ari Suutari

Supervisors: M.Sc. (Tech.) Tommi Siponen D.Sc. (Tech.) Toni Kuronen

D.Sc. (Tech.) Matylda Jablonska-Sabuka Professor Lasse Lensu

Ari Rasimus

(2)

Lappeenranta-Lahti University of Technology LUT School of Engineering Science

Computational Engineering Technomathematics

Vilma Mäkitalo

PREDICTING IMBALANCE POWER PRICE

Master’s Thesis 2021

58 pages, 31 figures, 7 tables, 2 appendices.

Examiners: D.Sc. (Tech.) Matylda Jablonska-Sabuka M.Sc. (Tech.) Ari Suutari

Keywords: predicting, electricity market, machine learning, SARIMAX, deep learning The electricity market is a complex multi-layer system that is designed to balance power consumption and production at every single moment. This thesis focuses on the lowest level of the Finnish electricity market called the reserve and more precisely on its imbalance power. The objective was to utilize a machine learning method that accurately predicts the price of the imbalance power in the short term. The method was chosen based on the literature review of recent neural network studies. Several different model settings and hyperparameters were experimented with and the method was compared with a traditional time series model with exogenous variables representing both the electricity market itself, as well as weather. The machine learning method performed better than the traditional method. The results showed that even though the chosen machine learning approach resulted in errors smaller than the benchmark approach, none of the models was able to predict the imbalance power price accurately. The inaccurate predictions may have been a result of missing relevant exogenous factors or the modeled phenomenon being too stochastic.

(3)

Lappeenrannan-Lahden teknillinen yliopisto LUT School of Engineering Science

Laskennallinen tekniikka Technomathematics Vilma Mäkitalo

Kulutustasesähkön hinnan ennustaminen

Diplomityö 2021

58 sivua, 31 kuvaa, 7 taulukkoa, 2 liitettä.

Tarkastajat: TkT Matylda Jablonska-Sabuka DI Ari Suutari

Hakusanat: ennustaminen, sähkömarkkinat, koneoppiminen, SARIMAX, syvä neuro- verkko

Sähkömarkkina on monimutkainen ja -kerroksinen järjestelmä, joka on suunniteltu jat- kuvasti tasapainottamaan energiankulutus ja -tuotanto. Tämä diplomityö keskittyy suo- malaisen sähkömarkkinan alimpaan kerrokseen eli reserviin ja erityisesti kulutustasesäh- köön. Tavoitteena oli ottaa käyttöön koneoppiva menetelmä, jolla voidaan ennustaa tarkasti kulutustasesähkön hintaa lyhyellä aikavälillä. Käytetty menetelmä valittiin viime- aikaisiin neuroverkkotutkimuksiin keskittyvän kirjallisuuskatsauksen perusteella. Useita erilaisia mallien asetuksia ja hyperparametreja tutkittiin. Menetelmää verrattiin perintei- seen aikasarjamalliin. Monia eri sähkömarkkinoiden ja sään muuttujia käytettiin mallien ulkoisina muuttujina. Koneoppiva menetelmä suoriutui paremmin kuin perinteinen me- netelmä. Tulokset osoittavat, että vaikka valittu koneoppiva menetelmä johti pienempiin virhearvoihin kuin perinteinen menetelmä, mikään malleista ei pystynyt ennustamaan ku- lutustasesähkön hintaa tarkasti. Epätarkat ennustukset saattoivat johtua puuttuvista rele- vanteista ulkoisista muuttujista tai hinnan käyttäytymisen satunnaisuudesta.

(4)

I want to thank Ari Rasimus, Ari Suutari, Lasse Lensu, Matylda Jablonska-Sabuka, Tommi Siponen and Toni Kuronen for helping with this thesis. Also thanks to Syncron Tech Oy for giving me the opportunity to make this thesis a job and providing the supportive and interesting workplace I have been happy to be part of.

Thanks to my guild Lateksii and to the student union LTKY for providing so much joy and opportunities to learn and grow during these years.

But most importantly, thanks to me. Because of my superpower and greatest weakness of being able to push myself even when unmotivated or tired, this thesis is complete and precisely on time.

Lappeenranta, May 27, 2021

Vilma Mäkitalo

(5)

LIST OF ABBREVIATIONS

aFRR automatic frequency restoration reserve AIC Akaike Information Criteria

ANN Artificial Neural Network

API Application Programming Interface

AR Autoregressive

CNN Convolutional Neural Network CRS Competitive Random Search DA-RNN Dual-stage Attention-based RNN DeepAR Autoregressive Recurrent Network DeepTCN Deep Temporal Convolutional Network DLM Dynamic Linear Model

DNN Dense Neural Network DSANet Dual Self-Attention Network DTW Dynamic Time Warping

EA-LSTM Evolutionary Attention-based LSTM FCR-D frequency controlled disturbance reserve FCR-N frequency controlled normal reserve FFNN Feed-Forward Neural Network GRU Gated Recurrent Unit

LASSO Least Absolute Shrinkage and Selection Operator lightGBM Gradient Boosting Tree Method

LSTM Long Short Term Memory

LSTNet Long- and Short-Term Time Series Network

MA Moving Average

MAE Mean Absolute Error

mFRR manual frequency restoration reserve MI-LSTM attention-based Multi-input LSTM MLP Multilayer Perceptron

MLR Multivariate Linear Regression MSE Mean Squared Error

NN Neural Network

RF Random Forest Algorithm RNN Recurrent Neural Network

SARIMA Seasonal Autoregressive Integrated Moving Average

SARIMAX Seasonal Autoregressive Integrated Moving Average with Exogenous factors SVR Support Vector Regression

(8)

TPA Temporal Pattern Attention Mechanism TSO Transmission System Operator

(9)

1 INTRODUCTION

1.1 Background

Most parts of the world are divided into electricity market areas. The purpose of the markets is to balance the production and consumption of electricity so that there is never a surplus or a deficit. The ratio between production and consumption also determines the frequency of electricity. The physical electricity distribution network cannot handle too large changes in the frequency and this is another reason for the markets to exist.

The markets resemble the stock markets. The electricity markets around the world have common main principles although there are some differences due to for example varying production types and climates. This thesis focuses on Finnish electricity markets operated by Nord Pool and reserve markets operated by Fingrid. [1]

The market has different levels and mechanisms to cope with the vivid and unpredictable nature of electricity consumption and production. The unpredictability is caused, for example, by the weather. The weather affects the consumption since when it is hot, apartments and facilities need air conditioning, and when it is cold, they need heating.

Moreover, production is affected mostly through renewable energies like wind and solar since their production is affected by the weather. The volatility of the electricity market has been increasing since the amount of production from renewable sources has been growing [2].

The last and most dynamic level of the markets is the reserve market which has multiple products. It allows significant balancing of production or consumption in seconds. For example, unplanned downtime in large production plants or factories can stop their production and consumption almost instantly and cause major surplus or deficit. The reserve market is usually needed during these kinds of situations and therefore the price of electricity in this market is very volatile. The reserve market can also offer smaller amounts of balancing power, for example in situations where the weather changes rapidly when compared with earlier forecasts. [3]

If the balancing price could be predicted, the participants that offer their production or consumption to the reserve market could be more prepared for major changes and place their bids more accurately. This would increase their efficiency of production and give them economic benefits. On the other hand, the predictions may suggest when the consumption and production forecasts should be made with extra care.

(10)

Not many modern machine learning methods have been used for predicting the balancing price, even though the methods have had good results with similar volatile problems when compared with traditional time-series methods. For example, the day-ahead price of the market is quite well studied and multiple methods, both modern and traditional, are used to predict it. There are some examples of day-ahead price forecasting in [2], [4] and [5]

1.2 Objectives and delimitations

The main objective of this thesis is to utilize a modern machine learning method to accurately predict the price of the consumption imbalance electricity. The secondary objectives are to study the importance of factors affecting the price and compare the performance of the machine learning method with a traditional benchmark method.

There are a couple of delimitations. Firstly, I want to implement a complete open-source prediction model that does not need major modifications and is compatible with existing Application Programming Interfaces (APIs). Secondly, the method must have a license that allows commercial usage so that the predictions can be a part of a company’s product.

1.3 Structure of the thesis

Chapter 2 gives a brief description of how the electricity market of Nord Pool functions.

Different methods of forecasting time-series are shortly presented and compared in Chap- ter 3. Methods that are chosen as the benchmark method and the main method are in- troduced in detail in Chapter 4. Chapter 5 displays the used data, evaluation criteria, experiments, and results. Chapter 6 evaluates the results and discusses possible future work. Finally, in Chapter 7 all conclusions are presented.

(11)

2 FINNISH ELECTRICITY MARKET

2.1 Market platform

The majority of Finland’s electricity trades are made in the electricity markets of Nord Pool, but Finnish producers and consumers can also attend to day-ahead and intraday markets operated by EPEX SPOT [6]. This thesis focuses only on the markets of Nord Pool. Nord Pool provides an electricity stock market for Nordic, Baltic, and Central Western European countries and UK [1]. The market is divided into a day-ahead market and an intraday market that both serve their purposes in balancing electricity production and consumption.

The whole market area of Nord Pool is divided into smaller areas which all have their own Transmission System Operator (TSO) who owns the main power grid of the area.

Finland’s TSO is Fingrid Oy [3]. The TSO has the main responsibility to make sure that the consumption and production meet in their area. The TSO can split their area into smaller parts that all have a balancing coordinator, who is responsible for balancing the consumption and production of that area. A balancing coordinator can be for example an owner of a large power plant.

In Finland, there is another electricity market held by Fingrid Oy, which is called a reserve market [3]. There are multiple products in the reserve market, but all of their purposes are to maintain the balance between electricity production and consumption when the regular electricity market can not handle it. Examples of the products are frequency containment, fast frequency, automatic frequency restoration, and imbalance power reserves. This thesis focuses on the imbalance power reserve which is also called manual frequency restoration reserve (mFRR). The Nord Pool’s and Fingrid Oy’s markets are connected since the base of the imbalance power price is the day-ahead price.

2.2 Day-ahead market

The day-ahead market is open every day of the year [1]. The participants, both electricity producers, and consumers can place their blind bids for the next day’s hours before 12:00 CET on the day before. There are multiple bidding types that can affect one or more hours, but all the types have the maximum or minimum prices and capacities depending on whether the participant is selling or buying the electricity.

(12)

Each TSO can split their area into bidding areas, but it is not mandatory [1]. Finland, for example, is a single bidding area. After the day-ahead market for the next day is closed, an area price for each hour is calculated individually for each bidding area based on the bids of the area. The area can have surplus electricity and a low price or excess electricity and a high price. Electricity is moved from a surplus area to an excess area and the area prices will be leveled as in Figure 1. If there is enough transmission capacity between the areas, the price will become equal and the areas form one price area.

Figure 1. The left figure represents a price calculation of a surplus area and the right figure is the same for excess area. If there is enough transmission capacity, the prices will be equal. [1]

After area prices are calculated, one system price is calculated for the whole market area [1]. This price is calculated assuming there are infinite transmission capacities. Figure 2 shows an example of area prices and system price. There are a couple of joined price areas. One of them is formed by Finland, Latvia, Lithuania, and Estonia.

(13)

Figure 2. An example of area prices and a system price of the 4th of February 2021.

Screenshot from [1].

2.3 Intraday market

Since electricity consumption and production forecasts are not perfect, the day-ahead trades do not ensure the balance of electricity [1]. The intraday market is open around the clock every day of the year. There are different types of orders similar to the day-ahead market, but in the intraday market, the shortest orders only last 15 minutes. Intraday trades are not blind and they are usually closed at least an hour before the delivery hour except in Finland trades can be made till the delivery hour. The intraday market can adjust to the small and rather slow changes caused, for example, by weather changes that affect the production or consumption of electricity.

2.4 Imbalance power reserve

Even the intraday market can not cope with fast changes that often result from unplanned downtime. Electricity producers or consumers, that can increase their production or consumption very fast, can join the reserve maintained by Fingrid Oy [3]. Joining the reserve

(14)

means that the participant offers a specific amount of their consumption or production to the balancing market. They will get paid constantly for being part of the reserve and if they are needed for the balancing, for the production or consumption they provide.

The imbalance power reserve participants place their bids for upregulation or downregulation for each hour they are available at least 45 minutes before the delivery hour. Upreg- ulation means that the participant is ready to increase their production or decrease their consumption. On the other hand, downregulation means that the participant is ready to decrease their production or increase their consumption. In the case of upregulation, the lowest suitable upregulation bids are used and for downregulation, the highest suitable downregulation bids are used. When necessary balancing power is obtained, the price of upregulation or downregulation is set based on the highest or lowest used bid, and every participant, whose bid is used, will get that same price. Although the upregulation price must be at least the day-head price of the hour in that area and the downregulation price must be at most the day-ahead price [3]. The entity, that has made the forecasting error that needs to be balanced, must pay the upregulation or downregulation price. The imbalance power price is the activated upregulation or downregulation price or, if there is no regulation, it is the day-ahead price.

(15)

3 TIME-SERIES FORECASTING

3.1 Methods

The forecasting methods are usually divided into two groups, which are machine learning and statistical methods. The statistical methods are more traditional and often they con- sider only linear or predefined nonlinear relations. They are widely used and researched.

A Multivariate Linear Regression (MLR) and a Least Absolute Shrinkage and Selection Operator (LASSO) are examples of the statistical methods. Machine learning is a more recent technique and most of its methods can handle nonlinear relations and complex systems. Artificial Neural Networks (ANNs), Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) are widely used machine learning methods.

RNNs use sequences such as time-series as inputs. Their main ability is to remember information from the previous timesteps so they can detect seasonal relations. RNNs use a hyperbolic tangent function to compute the output from the input and the output of the previous hidden layer. In [2], it was claimed that the main flaw of RNNs is that they suffer from exploding and vanishing gradients.

In [2], it was explained that a LSTM is based on a RNN and it was created to overcome the problems of a RNN. Besides a hyperbolic tangent function, a LSTM uses three gates, which are called a forget gate, an input gate, and an output gate, to decide what information should be kept and what can be forgotten. The gates use asigmoidas the activation function. First, the forget gate evaluates the significance of the input data and the output of the previous layer. Then the input gate determines, how much the cell state will be updated and, finally, the output gate normalizes the cell state and forgets parts of the memory in order to produce the prediction of that timestep. In [2], a LSTM was used to predict day-ahead prices, intraday prices, and the gap between them. It was compared against a LASSO, a Support Vector Regression (SVR) and a Random Forest Algorithm (RF). The LSTM performed the best, but the LASSO was almost as good.

A GRU is also a variation of a RNN, but it only contains two gates, a reset gate, and an update gate. The reset gate is responsible for defining, how much of the previous values are kept. Thus it corresponds to the forget and input gates of a LSTM. The update gate combines the memory values and the inputs. In [4], it was suggested that a GRU required less computational resources than a LSTM, but it might not have revealed as

(16)

complex relations as the LSTM revealed. It was tested using the electricity prices of the Turkish markets and against multiple statistical methods, an ANN, a Convolutional Neural Network (CNN) and a LSTM.

In [7], an Evolutionary Attention-based LSTM (EA-LSTM) was presented, with an attention layer added to a LSTM model. In neural networks, the attention layers are used to improve the information processing of the network by input selection and encoding. The weights of the attention layer are calculated with a Competitive Random Search (CRS), which is an evolutionary computation. A CRS takes the predicted errors as the inputs. In a CRS, the subspaces of the weights were first encoded as binary. Then the predicted errors were used to find the champion subspace, which then was used to rebuild the weight space. Finally, the new weights were decoded to the original space and used in the attention layer. An EA-LSTM gives more attention to the most significant time periods when making the prediction and, therefore, it reduces the attention-distraction, which can make a LSTM insufficient. The performance of an EA-LSTM was compared against a couple of statistical methods, a few variations of RNNs and a Dual-stage Attention-based RNN (DA-RNN).

An attention-based Multi-input LSTM (MI-LSTM) was proposed in [8]. It also adds an attention layer to a LSTM model. It includes new input gates that control, how much the mainstream factor and the exogenous factors affect the cell state. The mainstream factor is used to control the weights of the exogenous factors. This causes the more important factors to have greater weights than the less important ones. This makes it better than a simple LSTM since insignificant factors can make the predictions of a LSTM inaccurate.

In [8], a MI-LSTM was evaluated by forecasting stock prices. It was compared with multiple variations of a LSTM and a DA-RNN, and it outperformed them all.

In [5], a Dense Neural Network (DNN) model, which has embedded calendar information as a part of the input, was represented. Embedding calendar information means that, for example, hours and weekdays are presented as one-hot-encoded multidimensional arrays, whose dimensions are reduced. Also, cross-features are used, since months have differences in the amounts of daylight, and weekdays and weekends have different hourly patterns of consumption. When forecasting the short-term spot prices, a DNN model with calendar embedding seemed to outperform a LSTM. The logic of a DNN model according to the calendar information can be observed, which increases its reliability in practice.

A ForecastNet [9] is a neural network that has a combination of multiple hidden cells that have their own inputs, hidden layers, and outputs. The output of each cell is a prediction

(17)

for one timestep and a part of the input of the next cell. A ForecastNet can be modified by changing the model of the cells, for example, to a DNN or to a CNN. It can also produce either a linear or a mixture density output. A ForecastNet was tested using ten different datasets that included one synthesized dataset and one electricity demand dataset. It was compared with different LSTMs, a CNN, a Seasonal Autoregressive Integrated Mov- ing Average (SARIMA), a Multilayer Perceptron (MLP) and a Dynamic Linear Model (DLM) models. In eight out of ten cases, the ForecastNet performed the best. The greatest advance of a ForecastNet is that it has a different weight for each timestep so it adapts when forecasting different time periods.

A Deep Temporal Convolutional Network (DeepTCN) handles the multivariate data with dilated causal convolutions and residual neural networks [10]. The architecture of the model includes an encoder, which has stacked dilated causal convolutions, a decoder, that has a residual neural network, and a dense layer, that forms the forecast. The model is flexible, since it can be used with pointwise and probabilistic forecasts, with or without parameters and it can utilise exogenous variables. Using five different datasets including an online retailer’s demands and shipments, an electricity consumption, a car lane occupancy and a demand of spare car parts, a nonparametric and a parametric versions of a DeepTCN were compared against a SARIMA, a Gradient Boosting Tree Method (lightGBM), an Autoregressive Recurrent Network (DeepAR), a DeepState and a SQF- RNN [10]. The retailer datasets was used with the SARIMA and the lightGBM and the other datasets with the other models. Both DeepTCN models were superior with the retailer datasets. With the other datasets, the results were not as clear, but the DeepTCN can be seen as a competent option among the presented machine learning models.

A Long- and Short-Term Time Series Network (LSTNet) tries to tackle the local patterns with CNNs, the long patterns with GRUs and the very long patterns with RNN-skip or attention layers [11]. It also contains a classic autoregression component as the last component of the model. A RNN-skip and an attention layer versions of a LSTNet were compared against the multiple variations of autoregressive models, a variation of a MLP model and a GRU model. The comparison was performed using four datasets contain- ing data of road occupancy rates, solar power production, electricity consumption, and exchange rates. With the exchange rate dataset, the autoregressive models seemed to perform the best but the LSTNets also had pretty good results. With the traffic and the electricity data the RNN-skip version outperformed the other models, while with the solar energy data, the attention layer version was the most effective.

A Dual Self-Attention Network (DSANet) is divided into two convolutional components.

(18)

One of them, a local temporal convolution, is used to perceive the short-term patterns, and the other one, a global temporal convolution, to perceive the long-term patterns [12].

In order to successfully catch the patterns, the global temporal convolution uses multiple filters. This allows parallel computing and modeling long sequences, which is hard to achieve with RNNs. The local temporal convolution is the same as the global one, but it uses shorter filter lengths and a mac-pooling layer. Both components also have a self- attention layer to advance the feature extraction. Moreover, similarly to a LSTNet, a DSANet also has an autoregressive model. A DSANet was utilized with a dataset of daily revenues of gas stations and it was compared against several variations of autoregressive models, a GRU, two variations of a LSTNet and a Temporal Pattern Attention Mechanism (TPA) [12]. It outperformed the other models with all tested forecasting windows, but the results were pretty close to each other.

A DeepAR is based on a LSTM [13]. It has multiple LSTMs and a likelihood model that produces the forecasts. The method of the likelihood model can be chosen based on the data, which gives flexibility for the model. The model was tested using five datasets.

The datasets contained data of electricity consumption, a car lane occupancy, a demand of spare car parts, and two separated sets of an online retailer’s sales. The DeepAR was compared with a Croston, an ETS, a Snyder, an ISSM, and a couple of varieties of a RNN when the parts and the retailer’s sales datasets were used. With the electricity consumption and the car lane occupancy datasets, the DeepAR was compared with a matrix factorization technique. The DeepAR outperformed the other methods with all the datasets.

3.2 Comparison

There are not many articles about forecasting intraday prices. In [14], a comparison was presented between a MLR, a LASSO, an ANN, a RNN, a LSTM, a GRU and a naive method, when forecasting the intraday prices of the Turkish electricity market. The naive method assumed that the intraday price is the same as the corresponding day-ahead price. They used a day-ahead price, a balancing market price, a forecast of a proportional renewable generation, a forecast of a ratio of demand and supply, and a trade value as their inputs. Their results suggested that the machine learning methods were more efficient than the statistical methods. Among the machine learning methods, the GRU performed the best. [14]

On the other hand, all of those methods are univariate. Multivariate models such as a

(19)

DeepTCN, a DeepAR, a DSANet and a LSTNet were compared against a univariate Feed- Forward Neural Network (FFNN) model and two naive models in [15]. The comparisons were conducted using an electricity consumption data from multiple customers and a mixed set of open power system data including an electricity consumption, market prices and renewable power productions. The comparison implied that the multivariate models especially the DeepAR and the DeepTCN had some advance in short forecasting windows and with a heterogeneous data. With a longer forecasting window or simpler data, they performed as good or worse than the univariate models.

Even though a DeepAR was very computationally heavy in [15], it was chosen as the model to be used for the prediction, since it has good results also with electricity-related data sets and it has complete implementations available. I have also used a similar calendar embedding as was presented in [5].

A Seasonal Autoregressive Integrated Moving Average with Exogenous factors (SARI- MAX) has been a very common benchmark method for time-series forecasting. It or a SARIMA method has been used for example in [4], [9] and [10]. Therefore, I have used it as my traditional benchmark method.

(20)

4 IMPLEMENTED METHODS

4.1 SARIMAX

SARIMAX is formed by combining Autoregressive (AR) and Moving Average (MA) models with differencing and exogenous factors [16]. All of its components are linear so it is a linear regression model. Thus it is quite light computationally.

AR model captures the regression between the predicted factor and its previous values [16]. The prediction is made with a linear combination of the previous values. The previous values are also called lagged values or lags. The model can be written as

y_t =c+φ₁y_t−1+φ₂y_t−2+· · ·+φ_py_t−p+_t, (1) whereyt is the prediction, othery values are the lags, candφs are parameters,pis the order of the model andtis white noise or error. This model is usually referred as AR(p) model.

MA model makes the prediction based on the errors of the lagged values [16]. This is also a linear combination model, but not a regression model in the usual sense since the errors are calculated, not observed. The model can be written as

y_t=c+_t+θ₁_t−1+θ₂_t−2+· · ·+θ_q_t−q, (2) whereθsare parameters,sare the errors andqis the order of the model. This model is commonly called MA(q)model.

In this context, differencing is the act to make non-stationary data to stationary data [16].

This is useful since AR and MA models work best with stationary data. In the name of SARIMAX differencing is represented with the integrated (I), which actually is the reverse of differencing. The differencing can be done multiple times if first-order differencing is not enough to make the data stationary. The differences can be presented with the lag notation:

Lⁱy_t=yt−i, (3)

wherei is the order of the difference. The notation is used since it can be treated using ordinary algebraic rules. The second-order difference, for example, can be calculated

(21)

with

y⁰⁰_t =y_t−2yt−1+yt−2 = (1−2L+L²)y_t= (1−L)²y_t. (4)

Differencing, AR and MA models can be combined to form the ARIMA(p, d, q)model, wherepis the order of the AR model,dis the degree of differencing andqis the order of the MA model. The ARIMA(p,1, q)can be written as

y⁰_t=c+φ₁y⁰_t−1+φ₂y_t−2⁰ +· · ·+φ_py⁰_t−p+θ₁t−1+θ₂t−2 +· · ·+θ_qt−q+_t. (5)

The general form of the model can be more easily presented with the lag notation in Equation (3) as

(1−φ₁L− · · · −φ_pL^p)(1−L)^dy_t =c+ (1 +θ₁L+· · ·+θ_qL^q)_t, (6) where the first parenthesis represents the AR(p) model, the(1−L)^dy_tare the differences and the right hand side is the MA(q) model.

Seasonality is added to the ARIMA model via an additional seasonal part, which is highly similar to the ARIMA model [16]. The only difference is that in the seasonal part the lags are not directly the previous values but the corresponding values from the previous seasons. The SARIMA model is usually presented with SARIMA(p, d, q)(P, D, Q)_s where thep,d andqare for the ARIMA part, theP, DandQare for the seasonal part and the s represents the number of observations in a season. The seasonal part can be added to Equation (6) by simply multiplying the existing terms with the seasonal terms as follows

(1−φ1L− · · · −φpL^p)(1−Φ1L^s− · · · −ΦPL^s∗P)(1−L)^d(1−L^s)^dyt

=c+ (1 +θ₁L+· · ·+θ_qL^q)(1 + Θ₁L^s+· · ·+ Θ_QL^s∗Q)_t, (7)

Finally, the exogenous factors are added to the model as Y_t=y_t+

n

X

i=1 r

X

j=1

ωi,t+1−jwi,t+1−j, (8)

whereY_tis the final prediction,y_tis the result of the SARIMA model,nis the number of exogenous factors,ris the number of lags,ωsare constants andwi,t+1−j is theith factor’s value at timet+i−j[17].

(22)

4.2 DeepAR

ANN has one or more layers [4]. Each layer has at least one computational unit called a neuron. Every neuron can have multiple inputs or outputs, but the minimum number of inputs and outputs is one. In traditional architecture, every neuron is connected to every neuron of the previous and the next layer. This is called a fully connected network.

Neurons can be represented with function

Y =f(

Inputs

X

i

(W_ix_i+b_i)), (9) whereY is the neuron’s output,xis the neuron’s input,W is the weight vector of inputs, bis the bias andf is the activation function that allows the inclusion of non-linearity.

RNN is Neural Network (NN) that is created to handle the time-dependent data [4]. This is achieved with loops that retain the information between timesteps. The loop is actually a hidden state, that is updated after every timestep and it affects the output of the neuron.

General function for hidden stateh_tis h_t=

( 0, if(t= 0)

φ(ht−1, x_t), otherwise , (10)

whereφis a non-linear function. The update of hidden state can be written as

ht=g(W xt+U ht−1), (11) whereg is a hyperbolic tangent function hyperbolic tangent andU is error.

LSTM solves the vanishing and exploding gradient problem by also storing the old hidden states of the neuron as memory content [4]. One LSTM neuron contains input gate i_t, forget gate ft and output gate ot. LSTM’s main output is the hidden state ht which is computed with

h_t =o_ttanh (c_t), (12) wherec_tis the new cell state. The output gate is defined as

ot=σ(Woxt+Uoht−1+Voct), (13) whereσis a logistic sigmoid function andVois a diagonal matrix. ctis updated with

c_t =f_tct−1+i_tc˜_t, (14)

(23)

where ˜c_t is the memory content of the neuron that can be obtained with a hyperbolic tangent function as

˜

c_t= tanh (W_cx_t+U_cht−1). (15) Forget gatef_tand input gatei_tare responsible for defining how much old information is lost and new information is taken during every time step. The gates can be defined as

f_t =σ(W_fx_t+U_fh_t−1+V_fc_t−1) i_t=σ(W_ix_t+U_iht−1+V_ict−1).

(16)

The whole system is illustrated in Figure 3.

Figure 3. Illustration of LSTM neuron. f is the forget gate,iis the input gate,˜cis the memory content,cis the new cell state andois the output gate. [4]

DeepAR is an autoregressive RNN, that is trained as a global model, that makes predictions based on historical data from each time-series of the dataset [13]. LSTM components are used to form the DeepAR model. The model is also divided into two parts: an encoder and a decoder. The encoder part processes the known period of time called the conditioning range and the decoder part processes the prediction part called the prediction range. In Figure 4, the DeepAR’s architecture is shown.

Figure 4. DeepAR model. The left side presents the encoder and the right side the decoder. The model is from [13]

(24)

The goal of the model is to model the conditional distribution

P(Z_i,t₀_:T|Z_i,1:t₀₋₁, X_i,1:T), (17)

whereZi,t isith time-series’s value at timet, the conditioning range is[t0, T], the prediction range is[1, t₀−1]andX_i,tisith covariate at timet[13]. Covariates are formed from the values of the exogenous factors and they must be known or manually estimated for every timestep. While training, theZ values for both the conditioning and the prediction ranges must be given. However, when making predictions, theZ values for the prediction range are unknown.

The model’s distribution is assumed to be Q_Θ(Z_i,t₀_:T|Z_i,1:t₀₋₁, X_i,1:T) =

T

Y

t=t0

Q_Θ(z_i,t|Z_i,1:t−1, X_i,1:T)

=

T

Y

t=t0

p(z_i,t|θ(h_i,t,Θ)),

(18)

whereΘis a vector of parameters,θis a function to compute parameters for the likelihood p(z|θ)andh_i,t is the output of an autoregressive RNN

h_i,t =H(hi,t−1, zi,t−1, X_i,t,Θ), (19)

whereH is a multi-layer RNN with LSTM cells [13]. The model is autoregressive, since Equation (19) uses observation z and output hof the previous timestep. The likelihood p(z|θ)should be chosen based on the statistical properties of the data. Examples of commonly employed models are Gaussian, negative-binomial, beta and Bernoulli likelihoods.

When training the model, the parameters are learned by maximizing the log-likelihood L=

N

X

i=1 T

X

t=t0

log(p(z_i,t|θ(h_i,t))), (20)

whereN is the number of time-series andT is the combined length of conditioning and prediction ranges [13]. Equation (20) can be directly optimized by computing gradients with respect toΘ, since all required values are observed.

(25)

5 EXPERIMENTS

5.1 Data

The data used in these experiments is from Fingrid’s [18] and ENTSO-E’s [19] open data APIs. All of the data can be fetched continuously and is free to use for commercial purposes. The used data is hourly time-series data from the beginning of 2018 until the end of 2020. The data has a total of 26301 samples. The first 15700 samples are used as a training set, the next 5300 samples are used as a validation set and the rest of the samples as a test set. There are some missing values in some of the time series. These values are replaced with the previous value of the time series.

All of the factors are listed in Appendix 2. Factors from Fingrid have ID numbers, while the factor from ENTSO-E has text as the ID. The most important factor is 92 since it is the imbalance power price, which I predict. There is a total of 44 factors, that vary from temperatures to production volumes and regulations in different reserves. All of the factors are continuous except the factor 209, which represents the state of the power system on a scale of 1-5.

In Figure 5, the whole three-year period of the imbalance power price is shown. It shows that the imbalance power price has a couple of major peaks during this period. Other- wise, the price has many smaller spikes and some volatility. Obvious periodicity can not be seen. It is very rare that the price is negative. In Figure 6 the price data of the three-year period is shown separately, where the signs of minor annual seasonality can be observed.

The end of May and start of June seem to be quite volatile time and so does the end of November and the start of December. On the other hand, July and August are rather steady. In Figure 7 the data is divided monthly. It supports the observation, that seasons have some differences in volatility, but it also shows that there are clear differences between the times of the day. In Figure 8 there is a total of 30 weeks plotted in three separate plots. The start of 2020 seems to be much more volatile than the other years. They show that weekends and nights are much steadier than the daytime of weekdays.

(26)

Figure 5. The imbalance power price of the whole three-year period that is used in the experiments.

Figure 6.Separated three years of the imbalance power price.

(27)

Figure 7.The imbalance power price monthly separated.

Figure 8.The first 10 whole weeks of each year in the time period.

(28)

Two ways of handling the timestamps are examined. The first way is to embed the timestamps as 39 binary factors. Hour information is divided into 6 factors, a day of week information into 8 factors, where one factor is for national holidays, and month information into 3 factors. Also, 10 factors were generated for the factorial of an hour and month and 12 factors for the factorial of the day of the week and hour. These generated factors are supposed to capture the changes in sunlight hours between seasons and the changes in human behavior between weekdays and weekends. In Table 1, there is an example of the binary factors.

Table 1.Example of dividing the timestamp into binary factors. The used timestamp is 25.07.2019 16:00.

Factor Value Factor Value Factor Value

hour0-3 0 holiday 0 MH10 0

hour4-7 0 month1-4 0 WH1 0

hour8-11 0 month5-8 1 WH2 0

hour12-15 0 month9-12 0 WH3 0

hour16-19 1 MH1 0 WH4 0

hour20-23 0 MH2 0 WH5 1

monday 0 MH3 0 WH6 0

tuesday 0 MH4 0 WH7 0

wednesday 0 MH5 1 WH8 0

thursday 1 MH6 0 WH9 0

friday 0 MH7 0 WH10 0

saturday 0 MH8 0 WH11 0

sunday 0 MH9 0 WH12 0

The second way to handle the timestamps is to use one factor to represent the hours, one the month, one the day of the week, one the factorial of an hour and month, and one the factorial of the day of the week and hour. In this way, the hour factor gets values between 0-23, the month factor gets values between 1-12, the day of week factor gets values between 0-7, the factorial of an hour and month factor gets values between 1-288, and the factorial of the day of the week and hour factor gets values between 1-192.

Factor 92 is shifted backward by three steps. This is done because most of the data is available with a delay of two hours. The shift ensures that in practice we have all of the data available.

When validating and testing the models, the data set is divided into smaller sets with lengths corresponding to the prediction length. Because in practice the future data for exogenous factors are not available, other data than the data of the first timestep of the

(29)

prediction period is shifted. Shifted data is retrieved from the corresponding hour of the previous day that is either a weekday if the shifted day is a weekday, or a weekend if the shifted day is a weekend. For example, if Saturday 14.00 is predicted, the data is retrieved from the previous Sunday’s 14.00 o’clock.

The correlation matrices are calculated with the Kendall Tau correlation coefficient. The coefficient is based on the concordance and discordance of a sample pair of the observed variables X andY [20]. Concordance means that the pair follows the rulex_i < x_j and y_i < y_j or the rule x_i > x_j and y_i > y_j. The discordance is the opposite, so the pair follows the rule xi > xj and yi < yj or the rule xi < xj and yi > yj. Ties are pairs that are neither concordant nor discordant which means eitherxi = xj oryi = yj. The coefficient is calculated with

τn= c−d

pn(n−1)/2−P

t(t−1)/2p

n(n−1)/2−P

u(u−1)/2, (21) wheren is the number of the observations,cis the number of concordant pairs,d is the number of discordant pairs,tis the number of tied observations, wherex_i =x_j, anduis the number of tied observations, wherey_i =y_j.

In Figure 9, there is a correlation matrix of the factors other than the time factors. Factors 178, 182, 196, and 185 represent the temperatures respectively in Helsinki, Jyväskylä, Oulu, and Rovaniemi. If the Finnish climate is considered, it is obvious that they have a really strong positive correlation. They also have strong negative correlations with factors 192, 242, 241, 193, 166, and 165, which are production and consumption forecasts and yields. Overall, the temperatures have high correlations between the other factors. Strong correlations can also be seen among factors 1, 52, 54, 2, 51, and 53, which are all related to automatic frequency restoration reserve (aFRR). A strong negative correlation is also observed between factors 123 and 177, which are the activated frequency controlled normal reserve (FCR-N) and the frequency of the electricity grid. Furthermore, factor 191, which is the hydropower production, has a strong negative correlation with factors 105 and 243, which are the sum of downregulation bids in imbalance power reserve and the sum of upregulation bids in imbalance power reserve.

(30)

Figure 9. The Kendall Tau correlations between the other than the time factors. Factor 92 is the price of the imbalance power.

In Figure 10, there is a correlation matrix between the time factors and the other factors.

The month factors have strong correlations with the temperature factors which leads to strong correlations with factors that have strong correlations with the temperatures. Factor 248, which is the forecast of solar power production, seems to have correlations with many different time factors including the hour factors and the factorial of month and hour factors. The hour factors have many correlations. Moreover, there seems to be some significance between the weekdays and weekends, since the weekdays appear to be very uncorrelated in comparison with the weekends.

(31)

Figure 10.Kendall Tau correlations between the time and the other factors. Factor 92 is the price of the imbalance power.

The correlation matrices in Figures 9 and 10 suggest that the imbalance power price has positive correlations with factors 106, 244, 202, and a day-ahead price. Factor 106 is the downregulation price of the imbalance power, factor 244 is the upregulation price of the imbalance power, and factor 202 is the industry’s power production. Although the price appears to have negative correlations with factors 81, 82, 181, 245, and 194, which are the price of activated frequency controlled disturbance reserve (FCR-D), the volume of activated FCR-D, the wind power production, the forecast of the wind power production and the net import or export of electricity.

5.2 Evaluation criteria

When optimizing the hyperparameters p, q, P, Q and s of the SARIMAX model, the performance of the models are compared using the Akaike Information Criteria (AIC).

The AIC is simply calculated with

AIC =−2∗ln(L) + 2∗k, (22)

(32)

whereLis the value of the model’s likelihood andkis the number of estimated parameters [21]. The smaller AIC values show lesser overfit.

To choose the best performing model among SARIMAX models with different d and D values and DeepAR models, Mean Absolute Error (MAE) and Mean Squared Error (MSE) metrics are used. MAE is calculated with

M AE =

n

X

i=1

|x_i−xˆ_i|/n, (23)

where n is the number of observations, x is the observed value and xˆ is the predicted value. Whereas MSE is calculated with

M SE =

n

X

i=1

(x_i −xˆ_i)²/n. (24)

5.3 Description of experiments

For the SARIMAX models an open-source library from Statsmodels [22] was used and for the DeepAR models an open-source library from GluonTS [23] was used. Both li- braries have complete implementations of the models and possibilities to change various hyperparameters. GluonTS uses the Student’s t-distribution as the DeepAR likelihood model.

First the different hyperparametersp, q, P, Q ands for SARIMAX model were experimented. To decide which values should be tested, the autocorrelation and partial autocorrelation plots in Figure 11 were observed. Based on these plots values from 0 to 7 were tried for thepandP, and values from 0 to 5 for theqandQ. Based on Figure 8 values 12 and 24 were tried forssince the daily seasonality seems to be the strongest. Then for the best performing model of these models, the dand Dvalues 0 and 1 were experimented with. Since these values would create a large number of combinations, thepandqvalues were first experimented with, then theP and Qvalues, next thes value, and finally the d andDvalues. During the experiments, the other hyperparameters are set to 0, if they have not been examined yet, exceptswhich was set to 12.

Many different combinations of hyperparameters for the DeepAR model were examined and a list of the combinations is shown in Table 2. Moreover, the effect of using a different number of exogenous factors, a resampled data, and recursive predictions was examined.

When validating or testing the models, 100 samples are drawn from the trained model to

(33)

Figure 11.The autocorrelation and partial autocorrelation plots of the imbalance power price.

make the final estimation. Here is a list of examined hyperparameters and their descrip- tions:

• Number of layers is the number of hidden layers in the LSTM components of DeepAR model.

• Number of cellsis the number of hidden cells in one hidden layer of LSTM.

• Dropout rateis the regularisation parameter for training the model.

• Batch sizeis the number of samples in each batch.

• Context lengthis the length of the encoder part of the model.

• Time categories indicates with X that the timestamps are embedded into binary categories.

• Patiencedetermines how easily the learning rate is reduced.

• Weight decayadds a penalty to the objective for large weights.

(34)

Table 2. Hyperparameter settings for each trained DeepAR model. The "-"-symbol means that the value is the same as in the above row. The learning rate is 0.001, the number of epochs is 500 and the number of batches per epoch is 400 for each model. The reduced factors means that none of the factors that are related to the reserves were used.

Nr Nr Drop- Batch Context Time Reduced Pa- Weight

Name of of out size length cate- factors tience decay

layers cells rate gories

Net 1 3 40 0.1 64 33 X 10 10⁻⁸

Net 3 4 100 - - 168 X - -

Net 4 3 50 0.2 - - X - -

Net 5 - 40 - 128 - X - -

Net 6 - - 0.1 64 - X X - -

Net 8 - - - X - -

Net 9 - - - -

Net 10 - - - - 36 - -

Net 11 - - 0.2 - 5 - -

Net 12 - - - - 10 - -

Net 13 - - - - 24 - -

Net 15 - - - - 5 X - -

Net 16 - - - - 10 X - -

Net 17 - - - - 24 X - -

Net 18 - - - - 48 X - -

Net 19 - - - - 72 X - -

Net 20 - - - - 100 X - -

Net 21 - - - - 48 - -

Net 22 - - - - 72 - -

Net 23 - - - - 100 - -

Net 24 - - 0.1 - 33 X 5 -

Net 25 - - - X 15 -

Net 26 - - - X 10 10⁻⁹

Net 27 - - - X - 10⁻⁷

Net 29 2 - - - 168 X - 10⁻⁸

Net 33 5 150 - - 48 X - -

Net 34 8 200 - - - X - -

(35)

For the resampled data nets, ResNet 1 and ResNet 2, the training data is resampled in a way that emphasizes periods with spikes, even though it does not fully balance the data.

Other parameters for these nets are: the number of layers 3, the number of cells 40, the dropout rate 0.2, the batch size 64, the context length 48, the patience 10, the weight decay 10⁻⁸ and the timestamps are embedded in binary categories. In Figure 12 is the training data for ResNet 1 and in Figure 13 is the training data for ResNet 2.

Figure 12. The resampled training data of ResNet 1. The green, red and yellow parts in the end are copied two times with noise from the corresponding parts of the original training data.

Figure 13. The resampled training data of ResNet 2. The green, red and yellow parts in the end are copied six times with noise from the corresponding parts of the original training data.

(36)

The recursive net, RecNet, is trained to only predict one timestep at a time. When making predictions for the validation data, the prediction length of 36 timesteps is implemented by shifting the exogenous factors as they are with other nets and by using the previous timestep’s prediction of the price as the observed value for the next predictions in the same period. Other parameters for this net are: the number of layers 3, the number of cells 40, the dropout rate 0.2, the batch size 64, the context length 48, the patience 10, the weight decay10⁻⁸ and the timestamps are embedded in binary categories.

In Table 3, there are the DeepAR models that have a reduced number of factors. The factors were added in a cumulative way so that each model has every factor that the previous model had and one more. Those models’ hyperparameters are set to the dropout rate as 0.1, the number of layers as 40, the number of cells as 40, the batch size as 64, and the context length as 48. The timestamps were embedded with the simpler way without binary categories.

Table 3. The DeepAR models with a different number of exogenous factors. The factors were added cumulatively so that the next net always had every factor that the previous net had.

Name Added factor Name Added factor

ReduNet 1 DAY.AHEAD.PRICE ReduNet 22 191

ReduNet 2 106 ReduNet 23 181

(37)

5.4 Results

5.4.1 SARIMAX validation results

In Table 4, there are the AIC values of SARIMAX models that have different p and q values. The smallest value is 167727 withp= 7andq= 1. If eitherporqvalue is small, the other does not seem to have great impact on the AIC value. On the other hand, if they are both large, the AIC value appears to grow.

Table 4. AIC metrics for different p and q values in SARIMAX(p,0,q,0,0,0,0) models. The smallest and therefore the best value is bolded.

p

q 0 1 2 3 4 5

0 - 168353 167867 167849 167923 168039

1 168342 167986 167829 167825 167912 176082 2 167849 167843 167833 168016 167951 168245 3 167911 168037 167868 167981 168317 167890 4 168138 167766 167883 170631 168248 180390 5 168081 167881 168043 180147 174724 184533 6 168260 167745 168071 202351 185069 184164 7 167739 167727 167939 216371 182604 183930

In Table 5, there are the AIC values for differentP andQvalues. The experiments were interrupted since they were getting very heavy computationally and the results were not improving. The best AIC value is 163972 withP = 0andQ = 2. TheP andQvalues seem to have very little impact on the AIC value.

Table 5. AIC metrics for different P and Qvalues in SARIMAX(7,0,1,P,0,Q,0) models. The smallest value and therefore the best value is bolded.

P

Q 0 1 2 3 4 5

0 - 163974 163972 163974 163974 163976

1 163979 163975 163974 163976 163976 163978 2 163977 163974 163976 163978 163978 163980 3 163979 163976 163978 163980 163980 163982 4 163978 163975 163977 163979 163982 163984

5 163980 163977 - - - -

AIC value fors = 12 is 163972 and fors = 24 it is 163966. This means thats = 24

(38)

has the better result and should be used for the rest of the experiments even though the difference is very small.

In Table 6, there are results for differentdandDvalues in SARIMAX model. The MAE and MSE values seem to have some sort of correlation.

Table 6. MAE and MSE metrics for different d and D values in SARIMAX(7,d,1,0,D,2,24) models. These are results of the predictions with validation data.

Name (d, D) MAE MSE

SARIMAX 1 (0, 0) 21.25 6700 SARIMAX 2 (1, 0) 33.34 24843 SARIMAX 3 (0, 1) 21.90 7853 SARIMAX 4 (1, 1) 77.04 30211

In Figure 14 there is a part of the validation period from SARIMAX 1 model and in Figure 15 there is a part from SARIMAX 3 model. The predictions seem quite similar, although SARIMAX 1 has a bit more oscillation. There are few moments where SARIMAX 3 makes better predictions but overall SARIMAX 1 appears to be more accurate.

Figures 16 and 17 show the whole validation periods of both models. These figures show that neither of the models can predict the medium or major spikes. There is also a very interesting moment between the two major spikes, where SARIMAX 3 has predicted large negative spikes and SARIMAX 1 some medium positive spikes, even when the observed values are actually quite stable and low.

(39)

Figure 14. A part of the SARIMAX 1 model’s validation period. The part is from 18.12.2019 to 30.12.2019.

Figure 15. A part of the SARIMAX 3 model’s validation period. The part is from 18.12.2019 to 30.12.2019.

(40)

Figure 16.The whole plot of the SARIMAX 1 model’s validation period.

Figure 17.The whole plot of the SARIMAX 3 model’s validation period.

(41)

5.4.2 DeepAR validation results

In Figure 18, there are the results for models that are defined in Table 2. The MAE and MSE values stay mostly on the same levels. Although between Net 10 and Net 23 there is much variation. Those models have different context lengths and either the binary time factors or the simpler ones. The context length seems to have some kind of effect on the prediction of the model, but definitely, the effect is not linear. Moreover, the way of handling time factors appears to have a non-linear relation to the prediction. Other hyperparameters do not seem to have much effect on the prediction.

Figure 18.The MAE and MSE results of DeepAR models with different hyperparameter values.

In Figure 19, there are the results for the models that have a varying number of exogenous factors. After six factors, the MAE and MSE values appear to get worse, when more factors are added to the model. However, the impairment is not completely linear, which suggests that there are some factors, that make the prediction more accurate. Nevertheless, in general, an increasing number of factors appears to lead to worsening predictions.

(42)

Figure 19.The MAE and MSE results of DeepAR models with reduced factors.

In Figure 20, there is a part of the validation period from ReduNet 4 and in Figure 21, there is a part from ReduNet 6. The predictions of ReduNet 4 seem to be quite regularly lower than the predictions of ReduNet 6. They also have a wider confidence interval.

Neither of the models can predict the positive or negative spikes. At the end of the plotted period, there are two moments where the price is quite low. The first moment is at the end of the prediction period and the second is at the start of the prediction period. Neither of the two models reacts to the first drop, but they both react to the second one even though the predictions are not that close to the observations.

Figures 22 and 23 show the whole validation periods of ReduNet 4 and ReduNet 6. Over- all, both of the predictions are pretty flat and are not able to predict medium or large spikes. Only spikes in both plots are seen after the second major spike. ReduNet 4 seems to have a little more variation than ReduNet 6.

(43)

Figure 20.A part of the ReduNet 4’s validation period.

Figure 21.A part of the ReduNet 6’s validation period.

(44)

Figure 22.The whole plot of the ReduNet 4’s validation period.

Figure 23.The whole plot of the ReduNet 6’s validation period.

(45)

5.4.3 Test results

SARIMAX 1 was chosen as the best SARIMAX model. In Figure 24, there is a whole plot of the test period that was predicted using it. The MAE value is 32.25 and the MSE value is 4405.

Figure 24.The test period of SARIMAX 1.

ReduNet 6 was chosen as the best DeepAR model. In Figure 25, there is a whole plot of the test period that was predicted using it. The MAE value is 22.32 and the MSE value is 3719.

(46)

Figure 25.The test period of ReduNet 6.

In Table 7, there are the MAE and MSE results for SARIMAX 1 and ReduNet 6 with both validation and test data. The validation values are better than the test values, which is normal and shows that the models are not overfitted.

Table 7.MAE and MSE metrics for SARIMAX 1 and ReduNet 6 with validation and test data.

Validation Test

Name MAE MSE MAE MSE

SARIMAX 1 21.25 6700 32.25 4405 ReduNet 6 17.11 6699 22.32 3719

Compared with the training and validation periods, the test period has much more close to zero prices especially in the first couple of months. SARIMAX 1 performs there very poorly since its values are much higher. On the other hand, ReduNet 6 can adjust to the lower values quite well. ReduNet 6 also appears to follow the higher spikes more precisely than the SARIMAX 1, even though neither of the models can predict the spikes very well. In Appendix 1 there are close-up plots of the test period with both models.

In Figure 26, there are the residuals of SARIMAX 1’s test period and, in Figure 27, there are the residuals of ReduNet 6’s test period. It seems that the largest residuals are almost the same for both models. SARIMAX 1 also appears to have greater negative values and

(47)

somewhat more variation than the ReduNet 6. These observations are supported by the residual histograms in Figure 28 and in Figure 29.

Figure 26.The residuals of SARIMAX 1’s test period.

Figure 27.The residuals of ReduNet 6’s test period.

(48)

Figure 28.Histogram of the residuals of SARIMAX 1’s test period.

Figure 29.Histogram of the residuals of ReduNet 6’s test period.

5.4.4 Convergence of training

In Figure 30, there is a training loss of ReduNet 6. There is a sharp drop at the beginning, which is very common, but after that, the error only slowly reduces and does not get that low.

(49)

Figure 30.The training loss of ReduNet 6.

A validation loss of ReduNet 6 is plotted in Figure 31. It has some oscillation and descen- dent trend. Although, like the training error, the validation error does not reach very low values and the descending is quite slow most of the training. These losses suggest that the model does not properly learn the behavior of the price.

Figure 31.The validation loss of ReduNet 6.

(50)

6 DISCUSSION

6.1 Current study

For the SARIMAX model it appears to be difficult to adjust to a different general level of the imbalance power price that was in the training data. This can be seen in the test data.

At the start when the prices are low, the SARIMAX predictions are continuously too high and at the end, when the prices are higher, the predictions are better if not somewhat too low. The DeepAR model seems to handle this change of level much better.

Since bothdandDparameters of SARIMAX 1 are zeros, it suggests, that the used data is stationary, which is quite surprising, if the plots of the data are considered. This may be the reason, why the sarimax model cannot adjust its order of magnitude with the test data. The seven previous lag values, the error of one previous lag, and the errors of two previous seasonal lags affect the prediction of SARIMAX 1. It is interesting, that the actual seasonal lags do not seem to have an influence on the current value, since the seasonality is quite clear in the week plot of the price data in Figure 8.

The couple first hours of every prediction period in the test data are quite precisely predicted using the DeepAR model. Otherwise, the DeepAR model’s predictions are very stable and are not able to follow the numerous spikes. Since the price has some daily seasonality, the SARIMAX model has a greater ability to follow the regular spikes, even if the predictions are not that accurate.

In the test data, the DeepAR model has the MAE of 22.32 and the MSE of 3719 and the SARIMAX model has the corresponding values of 32.25 and 4405. These values are better for the DeepAR than for the SARIMAX which suggest that the DeepAR is performing better than the SARIMAX. Moreover, the fact that the DeepAR can better adjust to the changes in the price’s order of magnitude supports the statement. On the other hand, it can not be stated that either of the models could make accurate predictions of the imbalance power price since the MAE values are relatively high and the plots show bad predictions.

The main result of this study is that neither SARIMAX nor DeepAR models learned the behavior of the price. Multiple combinations of hyperparameters, resampling the data, a recursive model, and reduced factors were experimented with, but only the reduced factors seemed to affect the model performance. Unfortunately, the reduced factors were

(51)

tested last, since the other experiments could have given better results with the reduced factors. The reduced factors were tested last because it was assumed, that the DeepAR model would be able to drop the unnecessary factors during the training.

There are multiple possible reasons, why the models do not perform very well. The sim- plest reason is that the imbalance power price is stochastic and mostly affected by human actions like mispredictions, that it is impossible to predict it with these methods. The second reason could be that the chosen exogenous factors do not include the real factors that affect the price. The correlation matrices support this claim since there are only a few factors that seem to correlate with the price and none of the correlations are that strong.

Moreover, the validation results of the DeepAR models, which have a different number of exogenous factors, suggest, that the smaller number of factors is better and therefore there are factors that have no information about the price. Another possibility is that the used hyperparameters of DeepAR model are unsuitable for this problem. Although this is somewhat unlikely since in [13], it is observed, that the DeepAR should work with only a little hyperparameter tuning.

6.2 Future work

In the future, it could be beneficial to study more thoroughly the effect of different factors and factor combinations that were used in this study and also some that were not. For example, the use of other weather factors and weather forecasts could be useful. Also implementing some hyperparameter tuning for ReduNet 6, which was the best DeepAR model, could improve the learning of the model and produce more accurate predictions.

Since DeepAR models appear to have problems forecasting the fast changes of the imbalance power price, it could be beneficial to try out some Dynamic Time Warping (DTW) method with the DeepAR. One example is presented in [24]. The paper proposes quite promising results against Euclidean loss when producing time-series predictions, especially when there are sharp changes in the data.

Predicting imbalance power price