Detection and data-driven root cause analysis of paper machine drive anomalies

(1)

Lappeenranta University of Technology LUT School of Business and Management Industrial Engineering and Management Business Analytics

Mikko Ritala

Detection and data-driven root cause analysis of paper machine drive anomalies

Author: Mikko Ritala

Examiners: Professor Pasi Luukka

Post Doctoral Researcher Jan Stoklasa

Supervisors: Director of Analytics & Applications Development Arttu-Matti Matinlauri

(2)

ABSTRACT

Lappeenranta University of Technology LUT School of Business and Management Industrial Engineering and Management Business Analytics

Mikko Ritala

Detection and data-driven root cause analysis of paper machine drive anomalies Master’s thesis

2019

81 pages, 34 figures, 13 tables, and 10 appendices

Examiners: Professor Pasi Luukka, Post Doctoral Researcher Jan Stoklasa

Keywords: Industrial Internet, predictive maintenance, machine learning, artificial intelligence, anomaly detection, root cause analysis, data preprocessing, model interpretation

The Industrial Internet has increased interest in the collection and utilization of data. The latter has become easier due to increased computing power and the development of

analytical methods. The goal of this thesis is to develop methods for detecting anomalies and identifying their root causes. Machine learning (ML) is used to create regression models for the electricity consumption of paper machine (PM) drives. ML enables a way of creating models describing the behavior of equipment. In the empirical part of the thesis, ML models are compared to the physics-based model. Creating a physics-based model requires considerable knowledge of the process. Considering all the process features that affect electricity consumption is a time-consuming task. The ML model, on the other hand, learns the effects of process features from historical data.

Anomalies are identified by comparing model output and measured process value. Time periods when the difference between model output and measured process value is significant are considered anomalous. During an anomalous time period, there may have been an undesired change in the process that could lead to equipment damage. Process features that can explain anomalies are sought from the data using the Pearson correlation.

Knowing what caused the anomalies can help to prevent machine failures.

An application is created during the thesis that is used in the case study in which machine failures are studied. A model is created to find anomalous periods, and the application is used to identify the root causes of anomalies. The application can explain anomalies, but root causes for machine failures cannot be identified. In future research, other methods for root cause identification besides correlation could be studied.

(3)

TIIVISTELMÄ

Lappeenrannan teknillinen yliopisto

LUT School of Business and Management Tuontantotalous

Business Analytics Mikko Ritala

Paperikoneen linjakäyttöjen anomalioiden tunnistus ja juurisyyanalyysi Diplomityö

2019

81 sivua, 34 kuvaa, 13 taulukkoa ja 10 liitettä

Tarkastajat: Professori Pasi Luukka, Tutkijatohtori Jan Stoklasa

Hakusanat: Teollinen Internet, ennakoiva kunnossapito, koneoppiminen, tekoäly, poikkeavuuksien havaitseminen, juurisyyanalyysi, datan esikäsittely, mallin tulkinta

Teollinen Internet on lisännyt kiinnostusta datan keräämiseen ja hyödyntämiseen.

Laskentatehon ja analyyttisten menetelmien kehitys on helpottanut datan hyödyntämistä.

Tavoitteena tässä työssä on kehittää menetelmiä poikkeavuustilanteiden tunnistamiseen ja niiden juurisyiden selvittämiseen. Työssä rakennetaan paperikoneen käyttöjen sähkön kulutusta mallintavia regressiomalleja koneoppimisen avulla. Koneoppimista käyttämällä voidaan tehdä erilaisten laitteiden toimintaa kuvaavia malleja. Työn empiirisessä osuudessa koneoppimismalleja verrataan fysiikkaan perustuvaan malliin. Fysiikkaan perustuvan mallin tekeminen vaatii paljon prosessiosaamista. Kaikkien sähkönkulutukseen vaikuttavien

tekijöiden huomioon ottaminen vie paljon aikaa. Koneoppimismalli oppii prosessinmuuttujien väliset vuorovaikutukset historiadatasta.

Poikkeavuudet tunnistetaan vertailemalla mallin tulosta mitattuun prosessiarvoon.

Ajanjaksoja pidetään poikkeavina, kun mallin tuloksen ja mitatun prosessiarvon erotus on merkittävä. Poikkeavan ajanjakson syynä voi olla epäedullinen muutos prosessissa. Muutos prosessissa voi pahimmassa tapauksessa johtaa laitteiden hajoamiseen. Selittäviä tekijöitä poikkeavuustilanteille etsitään datasta Pearson korrelaation avulla. Tieto siitä, mikä aiheutti poikkeaman voi auttaa vikatilanteiden ehkäisemisessä.

Työn aikana rakennettiin sovellus, jota hyödynnetään vikatilanteiden tutkimisessa. Ensin rakennetaan malli poikkeavuuksien löytämiseksi, minkä jälkeen juurisyitä etsitään

sovelluksen avulla. Työkalun avulla voidaan selittää poikkeavuustilanteita, mutta juurisyitä vikaantumisille ei löytynyt. Tulevaisuudessa muitakin menetelmiä kuin korrelaatiota voidaan tutkia juurisyiden etsimisessä.

(4)

ACKNOWLEDGMENTS

First and foremost, I would like to express my greatest gratitude to Valmet for giving me this opportunity. Special recognition goes to my supervisor Arttu-Matti Matinlauri. He has done a wonderful job of creating a unique working environment within the DevOps-team. The support from the whole team and other Valmeteers helped me to get through the challenges that I faced during the thesis. Thanks to Jari Kääriäinen, who always helped with

papermaking-related issues. I would also like to thank Miska Valkonen, who had time to discuss ML-related topics.

I am very glad that I have reached the end of my studies at the Lappeenranta Lahti University of Technology. At the same time, I feel sad, because this chapter of my life is coming to an end. There was a point in my studies when I had a hard time find things that interested me and offered the right challenges. Without LUT and its business analytics master’s program, I would not be where I am now. I, therefore, wish to thank the whole University and all my classmates for inspiring me and making my time during those years unforgettable. My thesis supervisor, Pasi Luukka, has also earned a special acknowledgment for his valuable feedback and guidance throughout the thesis. Finally, I would like to thank my family for encouraging and supporting me from day one.

(5)

List of symbols and abbreviations

ANFIS – Adaptive network-based fuzzy inference system ANN – Artificial neural network

API – Application programming interface AWS – Amazon Web Services

CBM – Condition-based monitoring CM – Condition monitoring

CFS - Correlation-based feature selection

CI/CD - Continuous integration and continuous deployment DAE – Deep auto-encoder

DL – Deep learning

EM – Expectation-maximization ETL – Extract, transfer, load

IDE – Integrated development environment II – Industrial Internet

IoT – Internet of Things KNN – K-nearest neighbor MAR – Missing at random MAJ – Majority vote

MCAR – Missing completely at random MCMC – Markov Chain Monte Carlo MCS – Multiple classifier systems

MICE – Multivariate imputation by chained equations MNAR – Missing not at random

ML – Machine learning

MLPR – Multilayer perceptron regressor MV – Missing value

LOCF – Last observation carried forward PM – Paper machine

PoC – Proof of concept UI – User interface

RBAC – Role-based access control RCA – Root cause analysis

ReLU – Rectified Linear Unit

SHAP – SHapley Additive exPlanations SF – SnowFlake

SME – Subject matter expert SQL – Structured query language VII – Valmet Industrial Internet

(8)

Table of Figures

Figure 2-1 Dryer group 3 ... 6

Figure 2-2 Wire section ... 7

Figure 3-1 The most common services used in the VII platform. ... 8

Figure 3-2 Valmet Industrial Internet platform architecture ... 9

Figure 4-1 Measurement with anomalous observations. ... 11

Figure 4-2 Tabular data with MVs. ... 14

Figure 4-3 Safe, borderline, and noisy observations ... 16

Figure 4-4 Three features f1 (relevant), f2 (redundant) and f3 (irrelevant) ... 18

Figure 4-5 RReliefF algorithm ... 20

Figure 4-6 Pseudocode for RFE ... 22

Figure 4-7 Three-layer neural network ... 24

Figure 4-8 Single neuron ... 25

Figure 4-9 ANN training process. ... 26

Figure 4-10 Adaboost algorithm ... 27

Figure 4-11 Force plot (Feature contribution in a single prediction) ... 30

Figure 4-12 SHAP values of features ... 30

Figure 4-13 Dependence plot (SHAP values of a single feature) ... 31

Figure 4-14 SHAP summary plot... 32

Figure 5-1 Missing values in function of time. ... 36

Figure 5-2 Missing values in function of time after elimination. ... 37

Figure 5-3 Comparison of imputation methods. ... 38

Figure 5-4 Test set of physics-based model. ... 42

Figure 5-5 Test set of gradient boosting. ... 42

Figure 5-6 Test set of linear regression. ... 43

Figure 5-7 Tuned gradient boosting model during the testing period. ... 44

Figure 5-8 SHAP summary plot. ... 45

Figure 5-9 SHAP dependence plot. ... 46

Figure 5-10 Anomalous period found with the gradient boosting method. ... 47

Figure 5-11 Total power of dryer group 3 (a) and correlating tags (b) during anomaly. ... 49

Figure 6-1 Tableau dashboard ... 51

Figure 6-2 Training and testing periods. ... 53

Figure 6-3 Test period results of the wire section model. ... 54

Figure 6-4 Study periods. ... 55

Figure 6-5 Study period 3 results. ... 56

(9)

Table of Tables

Table 2-1 Electricity consumption of a modern PM. ... 5

Table 5-1 Properties of the dataset. ... 35

Table 5-2 Properties of the dataset after elimination. ... 37

Table 5-3 R-squared and MSE of the test and train set with different imputation methods .. 38

Table 5-4 Train and test sets, and feature selection set. ... 39

Table 5-5 R-squared and MSE of the test and train set with different subsets... 40

Table 5-6 Train and test sets. ... 40

Table 5-7 Performance of different models. ... 41

Table 5-8 Hyperparameter grid. ... 44

Table 5-9 Eleven most cross-correlating tags. ... 48

Table 6-1 Failures at the wire section. ... 53

Table 6-2 Maintenance action at the PM. ... 54

Table 6-3 Study periods. ... 55

(10)

1

1 Introduction

The term Industrial Internet (II) was invented by General Electric. Basically, the term affords an industrial perspective of the Internet of Things (IoT), which is the wider concept. The IoT encompasses four different IoT strategies: enterprise; commercial; consumer; and industrial.

The IoT refers to an interaction between devices that are connected to the Internet. Data produced by these devices can be used to gain insight and improve the performance of the devices. (Alasdair 2016, 1-4)

Companies are investing increasingly in II applications every year. II applications aim to increase competitive advantage and sustainability. Condition-based maintenance (CBM) is an aspect of II that has considerable potential for development in the coming years. The goal of CBM is to identify immature equipment failures and avoid unnecessary maintenance.

Maintenance decisions in CBM are done based on information collected by condition monitoring (CM). (Kumar et al. 2018)

The amount of data is continuously increasing, and utilizing this asset has raised the need for big data analytics. Working with big data is a little more challenging, because of its three main features: large volume; vast variety; and high velocity. These features make data processing crucial when doing data analysis. Big data analytics enable the obtaining of insight from massive amounts of data, which was previously impossible. ML algorithms can learn the optimal behavior of the machine from the history data gained by CM. (Lei et al.

2018; Wolfgang et al. 2017)

CM is commonly utilized in industrial processes. Due to the vast amounts of data and measurements, it is often hard to locate where the problem is when equipment is behaving abnormally. To keep production running, the time for a thorough inspection of parts is minimal. There is thus no opportunity to conduct a thorough root cause analysis (RCA).

Abnormal behavior is therefore often disregarded. Maintenance is performed between fixed intervals, or when something fails. Data-driven approaches to RCA may help to avoid accidents without compromising productivity.

1.1 Background

Valmet launched its Industrial Internet services in 2017. The Valmet Industrial Internet (VII) framework is under heavy development. Developers are continually creating new

applications. Many of the applications aim to improve performance, the quality of the end

(11)

2

product, cost-effectiveness, and productivity. Anomaly detection combined with data-driven RCA responds to many of these targets. Anomaly detection is not a new subject at Valmet, but it has not been greatly utilized, because it takes time to build physics-based models. The challenges arising from data are very customer-specific. ML algorithms are easier to apply with varied data. Anomaly detection with data-driven RCA is faster to implement for new customers than current solutions. It could increase the number of prevented failures, without requiring too much additional work from PM operators. (Valmet internal 2019)

Valmet’s industrial history goes back more than 220 years. Valmet now operates throughout the world in process technologies, automation, and services for the pulp, paper, and energy industries. Significant investments in research and development have led Valmet to its current position. In 2018, Valmet spent 66 million euros on R&D. In that year, net sales were 3.3 billion euros, and it had more than 12,000 employees. The Industrial Internet and

digitalization have been listed as Valmet’s major growth accelerators. The Industrial Internet is Valmet’s way of achieving one of its Must-Wins, which is customer excellence. A

customer-oriented II approach guarantees that there will be a demand for II products and services. (Valmet 2018)

At Valmet, models are usually physics-based. There are many physics-based models for different parts of a PM. One use case for a physics-based model is to compare model output to measured values. If the model is sufficiently accurate, the deviations of measured values from model output may indicate a degradation of parts. Physics-based models are useful, but the potential of ML algorithms remains undiscovered. A lot of data is collected from PMs, but most of it remains unused. (Valmet internal 2019)

The reader of the thesis is expected to know the basics of data analytics and databases and the principles of statistical computing. Prior knowledge of the papermaking process is not required, but it may be helpful in interpreting the results.

1.2 Objective and scope

The main goal of the thesis is to explore ML model-based anomaly detection methods and identify causes of anomalies. An application is built for internal use and Valmet’s customers, based on the best methods found in the empirical part of the thesis.

The main goal is achieved by completing the following objectives:

• Using data preprocessing to remove any undesired properties in data

(12)

3

• Utilizing feature selection methods to find an optimal set of features which help to explain the electricity consumption of drives

• Building a model for anomaly detection that can be interpreted

• Utilizing the Pearson correlation in the identification of the root causes of anomalies The scope of this thesis consists of a complete data analysis process, from data gathering to application deployment. The focus on research is data preprocessing, ML, model

interpretation, and RCA. Some limitations arise from the data security of Valmet and its customers, as well as the scarcity of literature available for similar approaches.

1.3 Execution of research

The project starts by becoming familiar with the existing methods found in the literature.

Valmet’s physics-based models work as a benchmark and as an information source for the functions of a PM. Additionally, meetings with subject matter experts (SME) were arranged to obtain a comprehensive insight into PM drives. Information from Valmet’s internal sources and methods found from the literature are presented in the theoretical framework. Some of the information from internal sources is classified and cannot be presented in this thesis.

Methods introduced in the literature review are applied in the empirical part of the thesis. The VII framework is used to acquire data. SMEs help to explain the results acquired by the methods used in the thesis. Finally, the project culminates in the case study. An application is created for SMEs for studying anomalies. The application is built based on the theoretical and empirical frameworks.

1.4 Structure of the thesis

The thesis is divided into seven chapters. The chapters are presented in the natural order to ensure that the theoretical framework provides a basis for the empirical part of the thesis.

Theoretical framework:

• Electric drives. A short summary of drives, and the factors that influence the electricity consumption of drives in different sections at a PM. The sections where models are created are also presented.

• Valmet Industrial Internet. How data is stored, transferred, and utilized in analytics and applications at Valmet.

• Methodology. Review of methods that are used in the empirical part of the thesis for data preprocessing, ML, model interpretation, and RCA.

(13)

4 Empirical part:

• Proof of concept. Methods introduced in the theoretical framework are applied and compared.

• Application. How best approaches are compiled in the application and used in the case study.

The structure of the thesis follows the project’s chronological order. The first chapters support an understanding of the following ones. Empirical framework conclusions are then drawn, based on the results. The last chapter is a summary that briefly describes the sections of the thesis.

(14)

5

2 Electric drives

Drives are electric motors that create rotational power to rotate the large cylinders (or rolls) of a PM. Nowadays, drives are powered by induction motors. Induction motors are the simplest, cheapest, and most reliable electric motor. These features make it the most commonly used electrical motor for producing rotary power. (Karjalainen 1999)

Electricity consumption per produced tonne of a modern PM by sections is presented in Table 2-1. Measurements are taken by Valmet from the SC PM. The total electricity used by drives is 100 kWh/t, which is 29 percent of the total electricity consumption. Generally, the speed and stretching of fabrics affect the electricity consumption of drives in every section.

Paper type and basis weight determine how quickly the PM can be driven. At higher speeds, the time for dewatering decreases, so under pressure of vacuum units, it has to be

increased. This creates additional friction between fabrics and vacuum units. Viscosity also increases the electricity consumption of drives when driving speed increases. (Valmet internal 2019; Karjalainen 1999)

Table 2-1 Electricity consumption of a modern PM.

SECTION ELECTRICITY CONSUMPTION

DRIVES AT WIRE 31 kWh/t

DRIVES AT PRESS 48 kWh/t

DRIVES AT DRYER 19 kWh/t

REEL 2 kWh/t

VACUUM SYSTEM 67 kWh/t

SHORT CIRCULATION 79 kWh/t

AIR CONDITIONING 34 kWh/t

POST-PROCESSING 64 kWh/t

TOTAL 344 kWh/t

Table 2-1 shows the electricity consumption of the drives in different sections. The wire section consumes almost a third of the electricity of all the drives in the PM. The share of electricity consumption at the wire can often be higher. Most of the friction at the wire section comes from the vacuum units, which remove water from the paper web. (Karjalainen 1999) The press section uses almost half the electricity used by the drives, but it may not always be this high. Dewatering, nip pressure, and internal energy losses from bending compensated rolls create most of the electricity consumption in the press section. (Karjalainen 1999)

(15)

6

During a normal run, the electricity consumption of drives in the drying section is not

especially significant. Most of the power is required during the acceleration, due to the large moment of inertia at the dryer section. During a normal run, most of the power is used to overcome the friction between rolls and scrapers. Condensed water in the dryer cylinders can also significantly increase electricity consumption. (Karjalainen 1999)

Power at the reel is not significant compared with other sections. Power required at the reel comes from the required tightness of a paper web when reeling it to a roll. (Karjalainen 1999) In the empirical part of the thesis, models are created for drives at one dryer group in a dryer section and for the wire section. Side views of the dryer group and wire section are

presented in Figures 2-1 and 2-2 respectively. The dryer section consists of numerous groups of cylinders referred to as “dryer groups”. Dryer groups are heated with steam to increase the dry content of the paper web. Every dryer group is controlled as a unit, and the common power output is more important than the power output of a single drive within a dryer group. The PM operator can change the division of produced power between drives as he/she sees fit. Division change does not necessarily change the total power output. (Valmet internal 2019)

Figure 2-1 Dryer group 3 (Valmet internal 2019).

Dryer group 3, presented in Figure 2-1, has 2 fabrics (top and bottom), 15 guide rolls (green), 9 heated rolls (red, 9-17), 1 vacuum roll (blue, V9), and 4 drives, 3 in the guide rolls and 1 in the vacuum roll. The electricity consumption of the drives is not determined only within the group. The power created by adjacent groups also has an impact, which must be considered when creating a model. The following dryer groups pull the paper web, which may decrease the electricity consumption of dryer group 3. The dryer groups prior to dryer group 3 create drag, which may increase the electricity consumption of the group. (Valmet internal 2019)

(16)

7 Figure 2-2 Wire section (Valmet internal 2019).

The wire section has only two drives, one for top and one for bottom fabric. The green

objects are rolls, and the yellow objects are vacuum units. The stock suspension comes from a headbox to a wire section. The main purpose of a wire section is to remove water from the paper web. A wire section must also ensure that the desired structural properties of the paper are met. (Valmet internal 2019)

(17)

8

3 Valmet Industrial Internet

This chapter consists of a short summary of the VII platform and the different services used to create value from data. The chapter’s objective is to introduce the reader to the services and concepts without going into unnecessary detail. The main purpose of building the VII platform is to centralize the storage and processing of the data in one place, where it can be standardized – in other words, to help the daily lives of everyone working with data. The services used in the platform are Amazon web services (AWS), Snowflake (SF), Matillion, Tableau, Python, Bitbucket, and Jenkins. Figure 3-1 shows the most commonly used services of the VII. (Valmet internal 2019)

Figure 3-1 The most commonly used services in the VII platform.

• AWS provides various services, such as cloud storage (S3) and serverless cloud computing (Lambda). Lambda functions can be scheduled to run on user-specified intervals. They can also be run whenever they are invoked by an event – for example, when new data is uploaded to S3.

• SF is a scalable cloud Structured Query Language (SQL) data warehouse, and it is used as a central data storage at Valmet.

• Matillion is a tool to extract, transfer, and load (ETL) data. Matillion is used to standardize customer data.

• Tableau is a dashboarding tool for creating dashboards/user interfaces (UI) for applications. Tableau can access data from cloud data storage. It is possible to do

(18)

9

simple calculations within Tableau, but more advanced analytics must be performed with other tools, such as Python.

• Python is a widely known programming language and the language most used by data scientists (Hayes 2019). The major advantages of Python are its simple syntax and comprehensive libraries.

• Bitbucket works as a code version control repository in which Python and other code are saved. Users can clone the source code, make changes, and push changes back to Bitbucket. The changes made are tracked in the repository, and it is always

possible to go back to an earlier version.

• Jenkins is a continuous integration and continuous deployment (CI/CD) tool. Jenkins listens to a repository, such as Bitbucket, and when changes are made to the source code, Jenkins deploys changes to production. Bug fixes and new features are thus implemented to existing applications, without any additional effort.

• Other advanced analysis tools used are Alteryx and R.

Figure 3-2 shows the high-level representation of the Valmet Industrial Internet.

Figure 3-2 Valmet Industrial Internet platform architecture (Valmet internal 2019).

The data pipeline is the process by which data is gathered, processed, and made available for applications. Data is first gathered by sensors at the customers’ machines and saved to

(19)

10

their local data storage, from which data is uploaded to S3. Uploading is done with R scripts, which are scheduled to upload data between fixed intervals. Now data is in S3 as batch files, where it is still quite difficult to use for analytics. Services like Matillion and Lambda are utilized to transfer and format data to SF. The data in SF is in conventional form. Simple analyses can be conducted in SF UI with SQL, or data in SF can be accessed in Tableau. SF can also be accessed with an application programming interface (API) provided by SF. SF API enables SF usage with Python or other programming languages. (Valmet internal 2019) Lambdas are also used when creating online applications that require analysis which is impossible in Tableau or SF. For example, a VII application that compares measured values to a model output to discover if the process is working as normal can operate as follows:

1. Lambda function is triggered whenever new data is uploaded to SF

2. Lambda function downloads data from SF then calculates the model output and compares it to the measured value

3. Lambda function uploads results to SF 4. Results are visualized in Tableau dashboard

The Tableau dashboard is visible in Valmet’s customer portal for authorized Valmet employees and its customers. When many customers are using the same application,

access to certain data is limited by role-based access control (RBAC). (Valmet internal 2019) The VII platform enables fast and secure development work. Many new ideas can be brought alive quickly without using many resources. The VII platform saves time and money while creating considerable value for its users and Valmet’s customers. (Valmet internal 2019)

(20)

11

4 Methodology

This chapter consists of a literature review of existing research in the field of anomaly detection, model interpretation, and the methods utilized in data-driven RCA. The aim is to cover some of the methods from the entire process, starting with data preprocessing, continuing to model building and interpretation, and finally identifying the root causes of anomalies. There are different approaches to anomaly detection, such as distance-based and clustering-based approaches. In this thesis, the focus is on model-based approaches.

Anomaly detection is not a new research area; indeed, it has been studied for more than a hundred years. The development has been rapid in recent decades, due to increases in computational power and advances in data mining. Anomalies are present in numerous disciplines. Anomalies are unexpected deviations from the norm. To detect anomalies, one must know the normal behavior of the data. Anomalies can be detected by separating anomalous observations from normal ones. This may seem an obvious classification task, but it is anything but. Anomalies also often differ from each other. This makes it difficult for classification algorithms to distinguish anomalous from non-anomalous observations.

Furthermore, the number of anomalous compared to non-anomalous observations is fractional. (Mehrotra et al. 2017, 1-6)

In Figure 4-1, the anomalous behavior of the data can be seen in the marked area (red circle). The measurement receives considerably higher values than in the normal stage.

Figure 4-1 Measurement with anomalous observations.

(21)

12

Mehrotra et al. (2017, 21-22) emphasize that there are three cases to consider when assessing anomaly detection algorithms.

1. Correct Detection: Detected anomalies correspond to actual anomalies in the process.

2. False Positives: When the process is normal, even unexpected observations appear in the data. This may be due to noise.

3. False Negatives: The process deviates from the normal stage, but it is not recognized as an anomaly, because the data signal is not sufficiently significant compared to noise.

The observations in the marked area of Figure 4-1 can be classified as an anomaly. It may also be the case that the process is working as it should, but such behavior was simply unobserved in the history data. The data in Figure 4-1 exhibits low and high spikes,

considered as outliers. These outliers are often caused by noise and are seen as anomalies even when the process is behaving normally. The third case may be present when noise filtering is too aggressive, and anomalies are disregarded as noise.

Underlying processes can often be described by models. The process’s features have functional relationships. Since process features affect each other, it is possible to

approximate one feature with a function of the other features. Anomaly detection with models can be done with two different approaches. In the first approach, the focus is on the

parameters in the model and an assessment of how one model parameter affects the model output. The second approach is to compare measured data points with model outputs. The difference between the model output and a measured value is referred to as the “anomaly score”. (Mehrotra et al. 2017, 57-58)

The second approach is successfully utilized in the study of Zhao et al. (2018). Their goal was the early fault detection of wind turbine components. They used a deep auto-encoder (DAE) network to model various parameters from the wind turbine. Anomalies were identified by the residuals of these parameters. They were able to detect failures more than 14 hours before actual failure. Since they modeled various variables, they were also able to locate faulty components from the residuals.

In this thesis, anomalies are studied using the second approach. The anomaly score used in the empirical part considers only moments when the measured value is higher than the model output. During those moments, there may be additional friction in the process. The anomaly score is set to zero when the model output is lower than the measured value.

(22)

13

4.1 Data preprocessing

Data preprocessing has a huge impact on prediction models since the quality of data is usually not perfect. For example, data might have missing values, outliers, noise, redundant features, or dimensionality of data is too high to be utilized effectively. This chapter is a review of basic data preprocessing methods to make data useful for ML algorithms.

4.1.1 Data normalization

Data normalization is usually a mandatory step before ML techniques can be applied. This is because many ML techniques are based on distances. Features with a larger distance between minimum and maximum values have more weight in prediction. To make features equal, they must be normalized. The common normalization method in the literature is min- max normalization. In min-max normalization, all features are normalized between fixed intervals. The interval used to scale the data is usually [0,1] or [−1,1]. (Gracía et al. 2015, 46-47)

Observation 𝑣 of feature 𝐴 is min-max normalized between range [𝑎, 𝑏] as follows:

𝑣^′ = max(𝐴)−min (𝐴)^{𝑣− min (𝐴)} (𝑏 − 𝑎) + 𝑎, (1)

where 𝑚𝑖𝑛(𝐴) and 𝑚𝑎𝑥(𝐴) are the minimum and maximum of observed values of feature 𝐴 respectively.

Another commonly used normalization method is z-score normalization. This is particularly useful when the dataset is expected to contain outliers. Outliers can bias the min-max normalization, because values are scaled between the minimum and maximum values. By applying z-score normalization, new feature values have a mean of 0 and a standard deviation of 1. There is a variation in z-score normalization, which is even more robust to outliers. It works simply by replacing standard deviation with mean absolute deviation.

(Gracía et al. 2015, 47-48)

Z-score normalization is applied to observation 𝑣 of feature 𝐴 between range [𝑎, 𝑏] as follows:

𝑣^′ = ^{𝑣−𝐴̅}

𝑠𝑡𝑑(𝐴), (2)

where 𝑠𝑡𝑑(𝐴) is the standard deviation, and 𝐴̅ is the mean of observed values of feature 𝐴 respectively.

(23)

14 4.1.2 Dealing with missing values

In the industrial environment, data is usually incomplete, noisy, and inconsistent. It, therefore, requires processing before it can be used in further analysis. In the industrial environment, missing sensor data is very common, and it happens for various reasons. Data may be missing because of unreliable sensors, network communication errors, synchronization problems, and different kinds of equipment failure. An example of an incomplete dataset can be seen in Figure 4-2. (Gracía et al. 2015, 40; Guzel et al. 2019)

Figure 4-2 Tabular data with MVs. Reproduced from García et al. (2015, 61).

Garcia et al. (2015, 60-61) identify three common approaches for dealing with missing data:

• The simplest way is to discard observations containing missing values (MV).

However, this is not usually possible if the number of MVs is substantial. Another concern is that there may be a pattern behind missing values. Important information may be lost if observations with MVs are discarded.

• Another approach is to apply maximum likelihood procedures. A model is built with a complete part of the dataset, and imputation is conducted in the form of sampling.

• The third approach is to use imputation methods, in which MVs are filled with estimated ones. The features are not usually independent of each other. MVs can, therefore, be estimated by identifying relationships between features.

There are different assumptions about missing data. Methods for imputation should be selected based on these assumptions. Common assumptions about missing data are:

(24)

15

• Missing at random (MAR) assumes that the probability that an observation has a missing value for a feature depends on other features rather than the values of the feature itself

• Missing completely at random (MCAR) assumes that the probability that an

observation has a missing value for a feature does not depend on the values of the feature itself, nor on other features

• Missing not at random (MNAR) assumes that the probability that an observation has a missing value depends on the feature itself, as well as other features

Numerous imputation methods are available, and the imputing of missing data may be the focus of a thesis in its own right. Due to the scope of this thesis, only a short review of some imputation methods is presented.

In the study conducted by Steiner et al. (2016), MAR was assumed for each dataset. The study compared straightforward imputation methods, such as mean/median imputation, the last observation carried forward (LOCF) method, the simple random imputation to

expectation-maximization (EM) algorithm, and multiple imputations (MI) using the Markov Chain Monte Carlo (MCMC) simulation. The study concludes that when EM and MCMC were applied to fill MVs in the data, better prediction results were achieved.

The article by Guzel et al. (2019) attempts to tackle missing sensor data problems by utilizing Deep Learning (DL) and the Adaptive-Network-based Fuzzy Inference System (ANFIS). The study concludes that DL and ANFIS outperform non-linear models used in the study in terms of root-mean-square error. ML methods are becoming popular in missing data estimation according to Guzel et al. (2019). K-nearest neighbor (KNN) is one of the most commonly used algorithms in missing data problems, despite the fact that it was originally introduced as a classification algorithm. Tutz et al. (2015) showed that nearest neighbor methods

performed well in a high dimensional setting in which the number of features was high compared to the observations.

Multivariate Imputation methods, such as multivariate imputation by chained equations (MICE), are easy to apply through libraries built for R and Python. MICE take into account the process that created the missing data and preserve the relations within the data and the uncertainty about these relations. MICE work under the assumptions of MAR and MNAR.

However, in the case of MNAR, additional modeling assumptions are required which affect the produced imputations. (Van Buuren et al. 2011)

In the empirical part mean, LOCF, and MICE are utilized to estimate missing values in the dataset. MICE is used under the assumption of MAR.

(25)

16 4.1.3 Dealing with noise

Another common problem with raw data is noise. Noise can be defined as unwanted data items, irrelevant features, or data points that are not in line with the rest of the records. There are various causes of noise. For example, measuring devices may be malfunctioning, or errors may occur when sending/retrieving data to/from data storage. Noise can reduce system performance in terms of accuracy, model-building time, size, and interpretability. (Zhu et al. 2004; Rathi 2018)

The classification task can be exhaustive, even without noise. Sometimes classes form small disjuncts inside other classes. Classes can also have similar characteristics which lead to overlapping and reduced classification performance. When noise is present in data, it may lead to extreme overlapping, due to irrelevant noisy observations. In Figure 4-3, observations are divided into safe, borderline, and noisy examples. Safe examples are clearly separate from the decision boundary and belong to their own class. Borderline examples are near the decision boundary and are therefore easily misclassified. Noisy examples fall inside the wrong class and cannot be classified correctly. (García et al. 2015, 109)

Figure 4-3 Safe, borderline, and noisy observations. Reproduced from García et al. (2015, 110).

(26)

17

There are two types of noise according to García et al. (2015, 110-111):

• Class noise is incorrectly labeled classes, due to data entry errors or lack of

knowledge when labeling observations. Class noise can be divided into contradictory examples and misclassifications. Contradictory examples are duplicate examples with different class labels. Misclassifications are observations labeled in the wrong class.

• Feature noise is considered to be invalid feature values and MVs.

Noise can be handled by multiple classifier systems (MCS). MCS aim to gain noise robustness by combining multiple classifiers. MCS reduce the individual problems of each classifier caused by noise. MCS can also be utilized in regression problems. Instead of choosing the best label, the final output is averaged among all models in the MCS. In this thesis, ensemble methods like bagging and boosting are utilized to reduce the influence of noise.

MCS is a parallel approach which means that all available classifiers are given the same input. Outputs are merged with a voting scheme to acquire a final prediction. Sáez et al.

(2013) introduce various voting schemes used in classification problems. Two of the methods which can also be applied in regression problems are:

• The majority vote (MAJ) approach, assigns an observation to a class that receives most of the votes among all classifiers.

• A weighted majority vote is a similar approach to MAJ. Labels assigned by each classifier are weighted according to the accuracy of the model in the training phase.

4.1.4 Methods for feature selection

Big data presents new challenges in terms of feature selection, because a number of features in the data can be enormous. Finding the best feature subset from thousands of features can be exhausting. Dimensionality is a serious problem for many ML algorithms.

The term “the curse of dimensionality” often appears in the literature. Dimensionality increases computational complexity, which increases training time and decreases model performance (Li et al. 2016). The article by Li et al. (2016) offers an example of relevant, redundant, and irrelevant features. An example can be found in Figure 4-4.

(27)

18

Figure 4-4 Three features f1 (relevant), f2 (redundant) and f3 (irrelevant). Reproduced from Li et al. (2016).

In Figure 4-4, the first feature, f1, is relevant, because it can be used to classify data into two classes, blue and red. f2 is a redundant feature, because it is strongly correlated with f2 and thus has no additional value in the classification task in hand. f3 is an irrelevant feature, because both classes exhibit similar behavior regarding f3.

Feature selection can be divided into filter, wrapper, and embedded methods. Filter methods are independent of the learning algorithm. They can be used in any situation, but selected features may not be optimal, because there is no learning algorithm guiding the selection of features. Wrapper methods are computationally intensive, because features are evaluated by their contribution to the learning algorithm’s predictive performance. Embedded methods are a compromise between filter and wrapper methods. Embedded methods interact with the underlying model. They are more efficient than wrapper methods, because they do not need to iterate through every feature subset. (Li et al. 2016)

Li J. et al. (2016) have made a comprehensive review of feature selection methods for conventional data. Methods are divided into four main categories: similarity-based;

information theoretical-based; sparse learning-based; and statistical-based methods.

Similarity-based methods assess feature importance by their ability to approximate similarity within data. Supervised feature selection methods utilize observation labels to assess similarity. Unsupervised methods use various distance metrics. Methods in this family are independent of the learning algorithms. A drawback of these methods is that most of these algorithms cannot handle feature redundancy. It may lead to a subset of highly correlated features. (Li J. et al. 2016)

Information theoretical-based methods aim to minimize redundancy and maximize the

relevance of features. Most of these algorithms are supervised, because feature relevance is often assessed by its correlation to class labels. In addition, these algorithms often work only with discrete data. (Li J. et al. 2016)

(28)

19

Sparse learning-based methods have received attention in recent years due to their

performance and interpretability. The feature selection of these methods is embedded in the learning algorithms. It can lead to very good performance in a specific learning algorithm.

Features thus chosen are not guaranteed to perform well with other learning algorithms. (Li J. et al. 2016)

Statistical-based feature selection methods are often used as filtering methods. They utilize different statistical measures instead of learning algorithms. These methods often analyze features individually, meaning feature redundancy is ignored. Statistical-based feature selection methods are often used in data preprocessing. (Li J. et al. 2016)

Additionally, there are hybrid feature selection, deep learning, and reconstruction-based methods. These methods cannot be classified into the categories mentioned above. The idea in hybrid feature selection methods is to generate subsets of features via different feature selection methods and choose the best features from each of these subsets. Feature selection is usually embedded in the model in deep learning feature selection methods.

Relevant features are chosen between the input layer and the first hidden layer.

Reconstruction-based methods define a feature’s relevance by its ability to describe original data with the reconstruction function. (Li J. et al. 2016)

Some methods that are suitable for regression problems are described below and used in the empirical section of the thesis. These methods are Regression Relief (RReliefF), Least Absolute Shrinkage and Selection Operator (Lasso), Correlation-based feature selection (CFS), Low variance, and Recursive Feature Evaluation (RFE).

RReliefF

Robnik-Sikonja et al. (2003) propose two algorithms, ReliefF for classification and RReliefF for regression. Both algorithms are supervised similarity-based filter methods. Algorithms are an extension of the original Relief algorithm. The original Relief algorithm works in a

supervised fashion and only for binary classification problems. The quality of features is calculated as follows:

𝑊[𝐴] = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐴 | 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑟𝑜𝑚 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠𝑠) − 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐴 | 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑟𝑜𝑚 𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠) (3) It estimates the quality of features by their ability to separate observations that are near to each other. Robnik-Sikonja et al. (2003) state that ReliefF and RReliefF work in presence of noise and MVs. The ReliefF algorithm works by randomly selecting an observation 𝑅_𝑖, and searches for the nearest neighbors from the same class (nearest hits) and the nearest neighbors from other classes (nearest misses). The quality of features is then based on

(29)

20

feature value, and nearest hits and misses. In regression problems, the nearest hits and misses cannot be calculated. Nearest hits and misses are therefore replaced in RReliefF as follows:

𝑃_{𝑑𝑖𝑓𝑓𝐴} = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐴 | 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠) (4)

𝑃_{𝑑𝑖𝑓𝑓𝐶} = 𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 | 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠) (5)

and

𝑃𝑑𝑖𝑓𝑓𝐶|𝑑𝑖𝑓𝑓𝐴 =

𝑃(𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 | 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝐴 𝑎𝑛𝑑 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠) (6) so 𝑊[𝐴] for regression task is calculated using Bayes’ rule:

𝑊[𝐴] = ^𝑃𝑑𝑖𝑓𝑓𝐶|𝑑𝑖𝑓𝑓𝐴

𝑃_{𝑑𝑖𝑓𝑓𝐶} − ^(1−𝑃𝑑𝑖𝑓𝑓𝐶|𝑑𝑖𝑓𝑓𝐴)𝑃_{𝑑𝑖𝑓𝑓𝐴}

1−𝑃_{𝑑𝑖𝑓𝑓𝐶} (7)

The pseudocode of RReliefF by Robnik-Sikonja et al. (2003) is presented in Figure 4-5. The inputs are training observations 𝑥 and the target value (𝜏(𝑥)). The output is a vector 𝑊 that gives quality for every feature. In the empirical part, all the features which receive 𝑊[𝐴] > 0 are selected.

1. set all 𝑁_𝑑𝐶 , 𝑁_𝑑𝐴[𝐴], 𝑁_{𝑑𝐶&𝑑𝐴}A[A], W[A] to 0;

2. for i:= 1 to m do begin

3. randomly select observation 𝑅_𝑖; 4. select k instances 𝐼_𝑗 nearest to 𝑅_𝑖; 5. for j := 1 to k do begin

6. 𝑁_𝑑𝐶 ∶= 𝑁_𝑑𝐶 + 𝑑𝑖𝑓𝑓(𝜏(𝐼_𝑗), 𝑅_𝑖, 𝐼_𝑗) · 𝑑(𝑖, 𝑗);

7. for A := 1 to a do begin

8. 𝑁𝑑𝐴[𝐴] ∶= 𝑁𝑑𝐴[𝐴] + 𝑑𝑖𝑓𝑓(𝐴, 𝑅𝑖, 𝐼𝑗) · 𝑑(𝑖, 𝑗);

9. 𝑁_{𝑑𝐶&𝑑𝐴}[𝐴] ∶= 𝑁_{𝑑𝐶&𝑑𝐴}[𝐴] + 𝑑𝑖𝑓𝑓(𝜏(𝐼_𝑗), 𝑅_𝑖, 𝐼_𝑗) ·

10. 𝑑𝑖𝑓𝑓(𝐴, 𝑅_𝑖, 𝐼_𝑗) · 𝑑(𝑖, 𝑗);

11. end;

12. end;

13. end;

14. for A := 1 to a do

15. 𝑊[𝐴] ∶= 𝑁_{𝑑𝐶&𝑑𝐴}[𝐴]/𝑁_𝑑𝐶 − (𝑁_𝑑𝐴[𝐴] − 𝑁_{𝑑𝐶&𝑑𝐴}[𝐴])/(𝑚 − 𝑁_𝑑𝐶);

Figure 4-5 RReliefF algorithm. Reproduced from Robnik-Sikonja et al. (2003).

In Figure 4-5, 𝑁_𝑑𝐶, 𝑁_𝑑𝐴[𝐴], and 𝑁_𝑑𝐶&𝑑_𝐴[𝐴] are the weights for different target values 𝜏(𝐼_𝑗) (line 6), different features (line 8), and different predictions and different features (lines 9 and 10) respectively. 𝑚 is a user-defined parameter that determines how many times the process is repeated. The term 𝑑(𝑖, 𝑗) in Figure 4-5 (lines 6, 8 and 10) is:

(30)

21 𝑑(𝑖, 𝑗) = ^𝑑¹^(𝑖,𝑗)

∑^𝑘_𝑙=1𝑑₁(1,𝑙) (8)

and

𝑑₁(𝑖, 𝑗) = 𝑒⁻⁽

𝑟𝑎𝑛𝑘(𝑅𝑖,𝐼𝑗)

𝜎 )

2

, (9)

where 𝑟𝑎𝑛𝑘(𝑅_𝑖, 𝐼_𝑗) is the rank of the observation 𝐼_𝑗 in a sequence of observation ordered by the distance from 𝑅_𝑖, and σ is a user-defined parameter that controls the influence of the distance.

Lasso

Lasso is a sparse learning-based embedded method. Lasso was proposed by Tibshirani (1996). Lasso utilizes 𝑙1-regularization, which limits the power of each coefficient. Some coefficients in the model can be reduced to exactly zero. These features can, therefore, be removed. Tibshirani (1996) defines the lasso estimate (𝛼̂,𝛽̂) as follows:

(𝛼̂, 𝛽̂) = arg min {∑^𝑁_𝑖=1(𝑦_𝑖− 𝛼 − 𝛽_𝑗 𝑥_𝑖𝑗)²} 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑|𝛽_𝑗| ≤ 𝑡., (10) where 𝑥𝑖 = (𝑥𝑖1, … , 𝑥𝑖𝑝)^𝑇 is the feature vector of 𝑖:th observation, 𝑦𝑖 is the corresponding target, 𝑁 is the number of observations, 𝑡 is the tuning parameter, 𝛽̂ = (𝛽̂₁, … , 𝛽̂_𝑝)^𝑇, and 𝛼 is 𝛼̂ = 𝑦̅. In the empirical part, all features assigned with a non-zero coefficient are selected for the “optimal” feature subset.

CFS

CFS is a supervised statistical-based filter method. CFS uses correlation-based heuristics in the evaluation of a feature subset. CFS attempts to maximize the correlation between the target feature and the feature subset while minimizing the correlation between features in the feature subset. Finding the optimal feature subset this way is computationally challenging.

CFS tackles this issue by calculating the utility of each feature. It considers feature-target and feature-feature correlation. It then starts with an empty set and expands it one feature at a time. Addition order for features is determined by utility. The addition continues until some stopping criteria are met. (Li J. et al. 2016)

The feature subset is evaluated using the following function, first introduced by (Ghiselli, 1964):

𝐶𝐹𝑆_𝑠𝑐𝑜𝑟𝑒(𝑆) = ^𝑘𝑟^̅̅̅̅̅^𝑐𝑓

√𝑘+𝑘(𝑘−1)𝑟̅̅̅̅̅_𝑓𝑓, (11)

(31)

22

where the CFS score describes the quality of the feature subset 𝑆 with k features. 𝑟̅̅̅̅ is the _𝑐𝑓 average target-feature correlation, and 𝑟̅̅̅̅ is the average feature-feature correlation in the _𝑓𝑓 feature subset 𝑆. The numerator can be seen as a measure of how well 𝑆 describes the target and the denominator as the measure for redundancy within 𝑆. (Hall et al. 1999) Low variance

Low variance is a statistical-based filter method. Low variance features contain less information than features with higher variance. By using this method, all features are eliminated which have lower variance than the predefined variance threshold. All features with zero variances should be removed, because they do not contain any information. A low variance method is commonly used as a preprocessing step rather than as an actual feature selection method. (Li J. et al. 2016)

RFE

RFE is a supervised wrapper method. The ranking of features differs, depending on the learning algorithm in use. In the scikit-learn package, features are ranked by the coefficients of features or feature importance metric. In this thesis, RFE ranks features based on their coefficients, because the learning algorithm used in the feature selection is linear regression.

The higher coefficient value indicates the greater importance of that feature. RFE returns the user-specified number of highest ranked features. One must iterate through a number of features to acquire the feature subset which produces the best accuracy. (Scikit-learn 2019;

Guyon et al. 2002)

The steps through RFE are described in the pseudocode in Figure 4-6.

1. divide data into training and testing sets;

2. for 𝑖:= 1 to the maximum number of features do

a. train model with the training set containing all the features;

b. select 𝑖 features with largest coefficients or feature importance;

c. save selected feature subset;

d. save accuracy with the testing set;

3. choose feature subset which produced the best accuracy with the testing set;

4. end;

Figure 4-6 Pseudocode for RFE

(32)

23

4.2 Machine learning models

In this chapter, some ML techniques for regression problems are briefly reviewed.

Traditionally, models are physics-based. This means that the relationships between features are explained by the laws of physics. This approach requires extensive knowledge of the process in hand. Processes may have so many features that deriving an accurate model is very complicated. ML techniques are one way to overcome this problem if a lot of data is available. Learning algorithms can learn relationships between features by fitting a curve to the training data. It is an iterative process, which aims to minimize the error between the fitted curve and data points. (Mehrotra et al. 2017, 57-58)

4.2.1 Linear regression

Linear regression can be used to model continuous features, such as electricity

consumption. The method assumes that features have linear relationships. When the term

“linear regression” is used in the literature, it usually encompasses multiple linear regression as well. (Ryan 2009, 146)

The function of linear regression is

𝑌 = 𝛽₀ + 𝛽₁ 𝑋₁ + 𝛽₂ 𝑋₂ + · · · + 𝛽_𝑚 𝑋_𝑚, (12) where 𝑌 is model output, 𝑋_𝑖, 1, . . , 𝑚 is an independent feature, and 𝛽_𝑖, 0, . . , 𝑚 is the

corresponding coefficient. The goal is to minimize the difference between model outputs and observed values by optimizing coefficients as known as least square estimates.

Ryan (2009, 133-135) illustrates how matrix algebra can be applied to regression. Least square estimates for function

𝑌 = 𝑋𝛽 + 𝜀 (13)

can be obtained by using the function

𝛽^ = (𝑋^′𝑋)⁻¹𝑋′𝑌 , (14)

where

𝑋^′𝑋 = [ 𝑛 ∑ 𝑋

∑ 𝑋 ∑ 𝑋²], (15)

and 𝑋^′𝑌 = [∑ 𝑌

∑ 𝑋𝑌]. (16)

(33)

24

Linear regression may also be solved with a gradient descent method. Many algorithms work iteratively to find these optimal coefficients. These processes are usually gradient solvers.

The gradient descent methods work by changing coefficients on every iteration toward a better fit. Coefficients are changed until the average error between observed values and predicted values do not change, or the maximum number of iterations is reached. (Rebala et al. 2019, 27-36)

4.2.2 Multilayer perceptron

A multilayer perceptron is an artificial neural network. It can be used for classification and regression problems. A three-layer neural network can be seen in Figure 4-7.

Figure 4-7 Three-layer neural network. Reproduced from (Krawczak 2013, 3).

The network consists of neurons, which are the individual processing units of their inputs.

The neurons are linked by connections, and each connection has a weight. Neurons are in the form of layers, and information moves through the network layer to the next layer. Each neuron of each layer receives information from each neuron from the previous layer. The first layer receives features as inputs, and the layers after the first layer receive inputs from the previous layer. The final layer produces the outputs of the neural network. The operation of a single neuron is illustrated in Figure 4-8. (Krawczak 2013, 1-3)

(34)

25

Figure 4-8 Single neuron. Reproduced from (Krawczak 2013, 3).

In Figure 4-8, external signals are denoted by 𝑥_𝑖, where 𝑖 = 1,2, … , 𝑁, 𝑥_𝑖 can be the input to the network or the output of a neuron from the previous layer. Weights of connection are denoted by 𝑤_𝑖𝑗, where 𝑖 = 1,2, . . . , 𝑁 is the index of the incoming signal, and 𝑗 is the index of the considered neuron. At the point where weights and 𝜃_𝑗 are connected, the following calculation is performed:

𝑛𝑒𝑡𝑗 = 𝑤1𝑗∗ 𝑥1 + 𝜃1+ … . + 𝑤𝑖𝑗 ∗ 𝑥𝑖 + 𝜃𝑗, (17)

where 𝜃𝑗is a bias weight. The activation function determines the value passed to the neurons of the next layer. Rectified activation functions known as Rectified Linear Units (ReLUs) are now a commonly used activation function. ReLUs are simple and fast to execute, alleviate the vanishing gradient problem, and induce sparseness. The rectified linear function is defined as:

𝑓_𝑅𝑒𝐿(𝑥_𝑖) = 𝑚𝑎𝑥(0, 𝑥_𝑖). (18)

An issue with ReLUs is that negative inputs are always set to zero. This means negative gradient values cannot get past that neuron during back-propagation. Back-propagation is the training method for the neural network, which is done by adjusting the weights of the connections by minimizing errors between the output and actual value. (Godin et al. 2017;

Krawczak 2013, 3-4; Rumelhart et al. 1986)

(35)

26

Figure 4-9 ANN training process.

The ANN training process is illustrated in Figure 4-9. Stochastic gradient descent is the process where all observations in the training set are inputted to ANN one at a time. The weights of the connections are updated after each iteration. Repeating the process for the whole training set is referred to as an epoch. The required number of epochs depends on the size of the network, learning rate, and size of the training data. The learning rate defines how much weights are updated after each iteration. (Nielsen 2015 15-24; 40-50)

4.2.3 Bagging

Bagging methods work by building several estimators for the same prediction task on random bootstrap subsets of the original dataset. The word “bagging” is an acronym of the words “bootstrap aggregating”. More about the bootstrap can be read in the paper by Efron et al. (1994). Bagging is used to reduce the variance of a base estimator. For example, the base estimator can be a decision tree. (Breiman 1996)

Bagging can be used for regression and classification problems. In classification problems, the final prediction is decided using a voting scheme. In regression problems, the final prediction is the average of all individual estimators. Bagging is a relatively easy method to

(36)

27

increase the accuracy of a single learning algorithm. The only downside of this procedure is that interpretability decreases. (Breiman 1996)

4.2.4 Boosting

Boosting is seen as one of the most powerful recently discovered learning ideas. Boosting is originally designed for classification problems, but it was later extended to regression

problems. In boosting, an ensemble of weak learners are built sequentially, rather than in parallel as in bagging. For example, an algorithm called the “Adaboost.M1” iteration starts by fitting a decision tree to the data, and all observations have equal weight. After each

successive iteration, the observations that are most wrongly estimated (regression) or misclassified (classification) receive higher weights, which forces the following iteration to focus on these observations. More weight is given to accurate learners in an ensemble. A visualized example of Adaboost.M1 algorithm can be seen in Figure 4-10. (Hastie 2009, 387- 338)

Figure 4-10 Adaboost algorithm. Reproduced from Hastie et al. (2008, 338).

In Figure 4-10, weak learners 𝐺_𝑚(𝑥), 𝑚 = 1, 2, . . . , 𝑀. form an ensemble 𝐺(𝑥) in which each 𝐺𝑚(𝑥) is weighted by 𝛼₁, 𝛼₂, . . . , 𝛼_𝑀. The gradient-boosting algorithm works similarly to the Adaboost.M1 algorithm. While Adaboost.M1 identifies wrongly classified/estimated

observations by using weights, gradient boosting uses gradients in the loss function. The

(37)

28

loss function is a measure of how well models fit the data. Learning algorithms always seek to minimize the loss function. (Singh 2018)

The hyperparameters of the gradient-boosting algorithm can be tuned to improve the

performance of the model. Hyperparameters are parameters that affect the learning process of ML learning algorithms. Hyperparameters are independent of data, unlike parameters, which are optimized during the training. In this thesis, the focus is on hyperparameters, which have the most influence in the model performance. These hyperparameters are the learning rate (“learning_rate”), the number of learners (“n_estimators”), the maximum depth of a tree (“max_depth”), and the minimum number of samples required to split an internal node (“min_samples_split”). More of these hyperparameters are presented in Scikit-learn (2019).

The effects of each hyperparameter tuned in this thesis are described as follows:

• learning rate: The contribution of each successive weak learner is decreased by the learning rate. Increasing the learning rate gives more influence to weak learners trained at the beginning of the iteration, and vice versa.

• n_estimators: The number of learners trained. Usually, when increasing the number of trees, the learning rate is decreased.

• max_depth: The maximum depth of the individual weak tree learner. The best value depends on the interaction between the features.

• min_samples_split: Increasing the number of samples required in a split of an internal node may help to reduce overfitting.

4.3 Model interpretation

It can be extremely difficult to interpret the decision making of complex ML models. However, it is still sometimes necessary for various reasons – for example, legal. Complex models are often avoided, because simpler models are more interpretable. Even when accuracy of a complex model is higher than the accuracy of a simple model. To address this problem, Lundberg et al. (2017) proposed the SHAP (Shapley Additive exPlanations) method. SHAP uses an explanation model which is an interpretable approximation of the original complex model.

The basic idea behind additive feature attribution methods is to explain the prediction of the model 𝑓 by the explanation model 𝑔. Explanation models often use simplified inputs of the original inputs, 𝑥 denoted as 𝑥’. The original inputs are mapped with the mapping function

𝑥 = ℎ_𝑥(𝑥’). (19)

Detection and data-driven root cause analysis of paper machine drive anomalies