Fault Detection of Wind Turbines Using Deep Learning

(1)

Fault Detection of Wind Turbines Using Deep Learning

Muratcan Kilic

Master’s Thesis

School of Computing University of Eastern Finland

31 May 2021

(2)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

International Master’s Programme in Information Technology

Kilic, Muratcan: Fault Detection of Wind Turbines Using Deep Learning Master’s Thesis

Supervisors: Dr. Xiao-Zhi Gao and Dr. Xiaoshu Lu (University of Vaasa) July 2021

Abstract: In Finland, the wind energy industry is growing gradually, and as the wind power capacity increases over time, the need for fault detection systems is cru- cial. This work investigates several deep learning models over a limited dataset and computational power, and compares several representations of data. The instances of the dataset are wind turbine sounds. A data synthesis technique is applied on the small training data was gathered. In the experiments on fault detection, this work classifies the instances in three categories: working,problematic, andnot working. The deep learning models are convolutional neural networks, recurrent neural networks, and long short term memory models. The experimental process includes three types of audio data. The audio representations are listed as Mel Fre- quency Cepstrum Coefficients (MFCC), spectrogram, and Mel spectrogram. The results show that using a small dataset and relatively shallow structures due to the real world limitations is not feasible in deep learning applications.

Keywords: Fault Detection and Diagnosis, Deep Learning, Convolutional Neural Networks, Recurrent Neural Network, Long Short Term Memory, Mel Frequency Cepstrum Coefficients, Spectrogram

(3)

Preface

During my studies in University of Eastern Finland, this thesis has been a work of hardships among the limited resources and time I have. Hence, I give my thanks to Dr. Xiao-Zhi Gao and Dr. Xiaoshu Lu for their support writing this thesis. Also, thanks to the company W¨artsil¨a for providing me their wind turbine audio data. I am grateful for your support. And finally, I express my thanks to Oili Kohonen for her assistance and guidance during my studies for two years. Even though I could not achieve the expected results, I hope this work can be helpful to others.

Muratcan Kilic

(4)

1. Introduction

Wind turbines are the energy sources that convert the energy of the wind force to electrical energy. They have been useful since wind power is considered a renewable energy source. Wind turbines are placed in groups of large wind farms. A wind farm usually takes place onshore.

Finland has increased its cumulative wind power capacity gradually since 1991, according to the recent report, Finland has a capacity of 2586 MW of energy pro- duction circa 2020[1]. In the future, it is assumed that the cumulative wind power capacity increase at a higher pace, thus, this trend may require the reassessment of fault detection systems.

Afault detection systemis a diagnosis scheme that detects the time and pat- tern of the fault that occurs in the given mechanism. They help for the reduction of maintenance costs because of their automated manner[2]. They are also predictive in nature, assisting to predict the time for the next fault, thus the system organizers might be more prepared. The fault detection system here applied is a deep learning technique, which is a subfield of neural networks. The model is a supervised learning method. For the detection of failures in mechanisms and prediction of the next failure, a dataset of wind turbine sounds is used.

Fault detection systems are improved by years, and they facilitate mostly with the applications of artificial intelligence. One of the most efficient techniques, deep learning, has a rising popularity since the 2010s started, and it is applied in many fields nowadays. It has been a major step for many artificial intelligence problems, and many fields such as speech recognition, image processing, computational lin- guistics, sound generation and more, use it as a confident and efficient method[3].

The model given in this work classifies the sound data in three different main categories, listed as working, problematic, not working. Also, the fault detection system given predicts the possible occurrences of times of failure. This kind of structure is called recurrent neural network (RNN), which is a model of deep learning[4]. Also, convolutional neural networks, which handles the data in a two-dimensional structure, has been used as a technique in fault detection and diagnosis[5].

The aim of this work is to create a typical fault detection system of the three categories given above. The objectives of this work are acquiring the audio data of wind turbines, preprocessing the data taken into a feasible format, and feeding into

(7)

several different types of neural networks and comparing the results.

In Chapter 2, the related work about deep learning, recurrent neural networks, wind turbines, fault detection systems is surveyed. Chapter 3 provides the theoreti- cal information about machine learning basics, artificial neural networks, deep learning, recurrent neural networks, and processing of audio data, respectively. Chapter 4 introduces the research model used in this work and its outlines. Chapter 5 presents the experimental results of the model, and Chapter 6 discusses the results and concludes the work.

(8)

2. Related Work

2.1 Fault Detection and Diagnosis

Fault detection and diagnosis have been helpful to perform the routine tasks that were performed by humans before. Many tasks that are used to be human-made are now in an automated manner[6]. The tasks with human control would lead to many disasters and accidents, such as the Piper Alpha incident. Piper Alpha was known as an oil platform that was facilitated in India. The disaster resulted in the deaths of more than 100 people due to explosions[7].

A fault detection system should be quick in diagnosis. However, this might reduce performance, thus the control engineers should find a balance between performance and speed[8]. Also, the isolation of faults is an essential aspect for those systems as well. That is, the system should be able to recognize different types of failures. Thus, an implementation of early-stage fault isolation with clear real-time data is convenient[9].

A fault detection should be robust since the sensors stay under high amounts of stress and the existence of noise[10]. To increase the robustness, every system has different kinds of measures. For example, in a published work of 2018[11], a special case of Kalman filteris used. Kalman filters help to approximate the uncertainties in a given system and increase its robustness[12]. The given fault detection system should be able to handle multiple faults at once. This is usually an overlooked feature for these systems[13] and leads to the reduction of robustness.

As a subtopic of fault detection, novelty detection is also common. Novelty detection indicates that instead of known patterns of a fault, the system observes a completely new class of fault. This kind of failures can be detected with methods such as negative selection algorithm[14], support vector machines[15], Gaussian processes[16], and autoencoders[17].

With the passing of time, new technologies and updates on frameworks might render a fault detection system useless. Hence, building adaptable systems is essential when producing them commercially. One such case is building a self-adaptable system with some Gaussian estimate[18]. An adaptable framework has the ability to adapt different kinds of energy systems, however, the computational load might be stringent in a more general and adaptable model, which might give rise to chal-

(9)

lenging issues. One has to compromise some features to give balance to adaptability and computational sufficiency[19].

2.2 Wind Turbines and Their Noise

Wind turbines are devices that convert wind energy to electrical energy. They can be onshore or offshore. A wind turbine has several main components. These are a turbine rotor, a gearbox, a generator, a power electronic system, and a transformer for grid connection[20]. Turbine rotors generate sound up to a magnitude of 103 dB depending on the wind current[21]. The generated sound can be converted to digital data and used in the analysis. If there is a fault in the structure of wind turbines, the generation of turbine sound will be different than usual. Even though an average human can notice the noise of an average faulty wind turbine by their observations, building a fault detection system is of importance due to its readiness and classification capacity that surpasses humans.

2.3 Deep Learning

Modern fault detection systems can be implemented by deep learning methods since it has been popularized in 2010s. The related work includes auto-encoder neural networks[22], recurrent neural networks[23] such as long short term memory (LSTM) models[24], convolutional neural networks[25]. Fault detection is considered as an implementation of anomaly detection techniques. Anomalies are outlying conditions, which is unfamiliar to the usual flow of the system, hence anomalies also include faults. The model for the anomaly detection system can be supervised or unsupervised. Moreover, there are semi-supervised approaches such as implementing a deep belief network[26]. Unsupervised approach is not feasible to track the cause of a fault, however they can be used to make a quick prediagnosis[27], and might be a subsystem of a larger system.

Since this work uses recurrent neural networks, knowledge about past recurrent neural networks research is needed. There are many types of recurrent neural networks that are applied in fault detection and diagnosis in control systems, such as Elman neural networks[28], which consists of internal states, LSTM models, which mentioned above, Hopfield networks[29], an earlier model in the literature, and its variants such as bidirectional associative memories[30], andecho states[31].

LSTM models are used for anomaly detection, thus LSTM is also applied in this work. The vanishing gradient problem, which causes the model to halt prema- turely, which can be avoided with this model, which has been the central topic of the research of Hochreiter and led to the invention of LSTMs[32].

(10)

Fault Detection Method Reference

Kalman Filter [11]

Negative Selection Algorithm [14]

Support Vector Machines [15]

Gaussian Processes [16]

Autoencoders [17][22]

Long Short Term Memory [24]

Recurrent Neural Networks [23]

Convolutional Neural Networks [25]

Deep Belief Networks [26]

Elman Neural Networks [28]

Hopfield Networks [29]

Bidirectional Associative Memories [30]

Echo States [31]

Table 2.1: List of Fault Detection Methods

Table 2.1 shows some of the fault detection methods found in the literature and their references, summarizing the discussion given above.

(11)

3. Background Information

3.1 Machine Learning

A computational system has the ability to learn, even though it is methodically different from a human being. The invention of machine learning has roots as early as 1950s, and the first mention of this term appears in 1959[33]. The first paper that coins the term ”machine-learning” discusses about neural networks and applies rote learning on minimax trees, where the system computes all the possibilities of a checker game from a few steps ahead. However, this approach is poor and expensive, and in some conditions, impossible. Searle mentioned that, theoretically, one can design a ”Chinese room” that can answer to all possible questions that the computer has learned in a rote way[34]. As the reader may suggest, since there is a limit both in terms of memory and time, designing systems that are non-deterministic (but stochastic) has been a more widespread approach in the field of artificial intelligence.

That is, there is always a fraction of randomness in the algorithm that cannot be determined.

Machine learning is a stochastic process that helps the computational system to improve itself using pre-existing data, and also generate, classify, cluster, and predict new data. There are a vast amount of learning models, and each one consists their own advantages and disadvantages. In a general sense, those models are classified in three distinct categories: Unsupervised learning, supervised learning, and reinforcement training. Furthermore, there is semisupervised training[35], which is a hybrid category of unsupervised and supervised learning.

Unsupervised learning is the categorization of data and creating clusters, without dependence on a training model. Even though unsupervised learning techniques help the machine to distinguish differences in patterns, the machine cannot have an idea of what they are. Unsupervised learning techniques include K-Means clustering, which has been popular due to its simplicity.

Unlike the unsupervised approach,supervised learningtrains the given model, and predicts the outcome of future data. Hence, in supervised learning, training and testing are needed. Training data has some properties and labels assigned to them by a human supervisor. Labels may be continuous or discrete. For example, in regression, labels are chosen from real numbers or a subset of it. Test data

(12)

Figure 3.1: An illustration of cross-validation where k = 4, k_i represents the fold number and pi represents the accuracy of the model

may or may not hold labels, if they are labeled, it is not used to fit the model, but for the comparison between the prediction and the reality, for the purpose of evaluation of the model. Properly fitting the model is important and has been a part of past research, especially to overcome the problem of overfitting. Overfitting happens then the learning model fits to the given training data without notable issues, however results poorly in prediction. There are no formally defined criteria for overfitting, nor any universal methods to solve it. There are widely accepted rules of thumb to solve overfitting. One method iscross-validation. In cross-validation, as illustrated in Figure 3.1, the data is folded into k parts, and for each iteration, k−1 parts are used for training purposes, whereas 1 part is used for testing. Later, amongk models that are fit differently, the model with better performance in terms of accuracy ratio is used for further applications[36].

A reinforcement learning model is an active environment that acts on a rewarding system. Generally, the model consists of a set of states S and a set of possible actions A, and there is a reward function f that optimizes the model, and no initial labeling is required. There are many different reinforcement learning methods, where S and A can be either continuous or discrete. Monte Carlo and Q-Learning are two well-known reinforcement learning methods.

(13)

Figure 3.2: An illustration of a simple neural network with a input layer, 3 hidden layers and an output layer. For clarity, the reader should be aware that b_i is added for each node of a layer. f represents the activation function.

3.2 Artificial Neural Networks

One of the extensively studied fields in machine learning is artificial neural networks, which is a directed and weighted graph structure. This graph structure layered and is fully connected between consecutive layers. A typical artificial neural network consists of an input layer, multiplehidden layersand anoutput layer.

Also, for each layer, there is a bias, which is added to each node of an individual layer. A simple neural network is represented as in Figure 3.2.

Since the model of this work is an artificial neural network, formal definitions will be convenient. Let A={L, W, B} represent an artificial neural network where L = {L₁, L₂, . . . , L_r} is the set of layers, where r is the layer count and each L_i = {xi1, xi2, . . . , ximi} where xij are nodes andmi is the node count of the given layer, and W ={w₁₁⁽¹⁾, . . . , w^(r−1)_im

i } is the set of weights. B ={b₁, b₂, . . . , br−1}is the set of biases. Values of each n_ij are determined as

x_(i+1)j =f(bi+

mi

∑︂

k=1

xikw⁽ⁱ⁾_jk) (3.1)

where f is an activation function and wherew_jk⁽ⁱ⁾ represents the weight between nodes xik and x_(i+1)j. Note that values of input layer are already initialized.

Activation functions are important when determining the output of an artificial neural network system. The activation function f can be in any range including unbounded ranges, either linear or nonlinear, continuous or discrete. Some well- known functions include:

(14)

1. Identity function, which does not change the output, hencef(x) =x. How- ever, identity function is practically unfeasible, because derivative of f(x) is f^′(x) = 1, and as will be clarified later, that renders backpropagation training algorithm useless, because the algorithm is based on derivatives of functions.

2. Sigmoid, an efficient non-linear, differentiable activation function, is defined as

f(x) = 1

1 +e^−x (3.2)

. Sigmoid is special, because it normalizes the output in a range of (0,1).

Sigmoid functions are prone to vanishing gradient problem, hence it needs additional optimization[37].

3. As an alternative to sigmoid, hyperbolic tangent (tanh), which is defined as

f(x) = e^x−e^−x

e^x+e^−x = e^2x−1

e^2x+ 1 (3.3)

gives output in a range of (−1,1). Actually, tanh is a rescaled version of sigmoid function from range (0,1) to (−1,1). Let σ be the sigmoid function.

Then

tanh(x) = 2σ(2x)−1, (3.4)

since

2σ(2x)−1 = 2

(︃ 1 1 +e^−2x

)︃

−1

= 2−(1 +e^−2x) 1 +e^−2x

= 1− _e2x¹

1 + _e¹2x

= e^2x−1 e²x+ 1

=tanh(x)

(3.5)

In practice, a hyperbolic tangent activation function is useful in applications with strictly polarized outputs, such as sentiment classification[38], where the outputs can be negative, neutral, or positive.

4. Even though Rectified Linear Unit had been used without a given name before[39], it has been popularized in 2012 with AlexNet[40], which also be- came a milestone in deep learning. Rectified linear unit, ReLU is defined as

f(x) =

{︄0, x≤0

x, x >0 (3.6)

(15)

5. Unfortunately due to a practical problem that leads to convergence to zero in training in ReLU activation function, the function was modified to

f(x) =

{︄αx, x≤0

x, x >0 (3.7)

where α < 1 is a very small constant[41]. This variant of ReLU activation function is called leaky ReLU activation function.

The summary of the activation functions can be found in Table 3.1.

Since artificial neural networks are supervised structures, they should have the means of training. There are many training algorithms for this purpose. One such algorithm isperceptron training algorithm, which is specialized for one particular class of artificial neural networks. Perceptron training algorithm is designed for an ANN with an input layer with one or multiple neurons and an output layer with one neuron. It does not include any hidden layers, and activation function on its output layer is threshold function, which is

f(x) =

{︄−1, x≤0

1, x >0. (3.8)

For a randomly initialized ANN, let weights be

W=

⎡

⎢

⎣ w₀ w₁ ... w_n

⎤

⎥

⎦

(3.9)

and let input values of nodes be

X =

⎡

⎢

⎣ 1 x₁

... xn

⎤

⎥

⎦

(3.10)

and simply write them as

y=f(W^T ·X) = f(w₀·1 +w₁x₁+· · ·+x_nw_n), (3.11) which is a convenient way to write Equation 3.1 where f is the threshold function, and the bias is represented with value 1 and weight w0. For each weight wi, the update step equation is w_i =w_i+ ∆w_i, and the difference is denoted as

(16)

Function Equation Range Plot

Identity f(x) =x (−∞,∞)

Sigmoid f(x) = _1+e¹−x (0,1)

tanh f(x) = ^e_e^xx^−e+e^−x^−x (−1,1)

ReLU f(x) =

{︄0, x≤0

x, x >0 [0,∞)

Leaky ReLU f(x) =

{︄αx, x≤0

x, x >0 (−∞,∞)

Table 3.1: List of common activation functions, their plots and ranges

∆w_i = η(t−y)x_i, where η is the learning rate with 0 < η ≤ 1 and t represents the output value from the training dataset. This process repeats until the system reaches to convergence, so given enough time, the number of iterations is finite[42].

(17)

Figure 3.3: An illustration of failure of a perceptron that is designed with intention of recognizing XOR function. There is no line x₂ = ax₁ +b that separate black dots which represent the output value 1 from white dots which represent the output value -1. The dotted line is placed to visualize the impossibility.

Unfortunately, after discovering that XOR function failed in the given model due to a phenomenon called linear inseparability, which is shown in Figure 3.3, the research on artificial neural networks ceased until the popularization of the backpropagation algorithm in 1986[43]. Backpropagation algorithm is designed for differentiable activation functions with the assistance of stochastic gradient descent, and holds a basis for deep learning. Now, the representation of weights forjth layer is

W^(j) =

⎡

⎢

⎣

w₀₁^(j) w^(j)₀₂ . . . w^(j)_1m_i w₁₁^(j) w^(j)₁₂ . . . w^(j)_1m

i

w₂₁^(j) w^(j)₂₂ . . . w^(j)_2m_i ... ... ... w_i1^(j) w^(j)_i2 . . . w^(j)_im_i

⎤

⎥

⎦

, (3.12)

where allw0i are representations of bias, and for eachi, k,w0i =w0k. And the input values of nodes for jth layer are represented as

X_j =

⎡

⎢

⎣ 1 x_j1

... x_jm_i

⎤

⎥

⎦

. (3.13)

(18)

The intended output values can be computed through Equation 3.11 for each layerj. The output notation is y^(j). It is important to see thaty^(j) might not be a scalar value, but a vector value.

For given error function E, the function must be differentiable to apply the backpropagation algorithm. For f in layer j, the gradientof E is

∂E

∂y^(j) = [︄

0, ∂E

∂y₁^(j), ∂E

∂y^(j)₂ , . . . ]︄

(3.14) There are different types of equations of errors, however, a common use in backpropagation algorithms for error functions is

E = 1 2

∑︂

j

(y^(j)−t^(j))², (3.15) wheret^(j) is the output value from the training dataset for jth layer. E andt^(j) can be vector values.

For sigmoid functionf_σ, its derivative is

f_σ^′(x) =f_σ(x)(1−f_σ(x)) (3.16) and for tanh, its derivative is

tanh(x) = 1−tanh²(x). (3.17) The change for a given weight w_ik^(j) is

∆w_ik^(j)=−η ∂E

∂w^(j)_ik ,[44] (3.18)

where 0< η ≤1 is the learning rate.

Usingchain rule, for each weight, write the gradient

∂E

∂w^(j)_ik = ∂E

∂y_i^(j)

∂z_i^(j)

∂z^(j)_i

∂w^(j)_ik (3.19)

where z_i^(j)=f⁻¹(y_i^(j)). It can be seen that

∂z_i^(j)

∂w_ik^(j) =y_i^(j) (3.20)

since z_i^(j) represents the weighted sums of node input values. Also, the property

∂y^(j)_i

∂z^(j)_i =f^′(y_i^j) holds. And, finally,

∂E

∂y_i^(j)

=t^(j)_i −y_i^(j), (3.21)

(19)

which is a partial derivative ofE according to the termy^(j)_i . Then the final equation

is ∂E

∂w^(j)_ik

=y^(j)_i (t^(j)_i −y^(j)_i )f^′(y_i^j). (3.22) For each epoch, i.e. step, the system updates the weights until a condition is met. This condition can be a limit for number of steps or error function has value E ≤ϵ for a particular number ϵ. The update is

(w^′)^(j)_ik =w^(j)_ik + ∆w^(j)_ik . (3.23)

3.3 Deep Learning

Deep learning is used as an informal term that represents an artificial neural network that has many hidden layers. A deep learning structure can be feedforward or recurrent. Feedforward neural networks takes the output of previous layer and uses as an input of next layer. On the other hand, recurrent neural networks have the ability to loop through the same layers on given conditions. There are different types of deep learning models, such as convolutional neural networks (CNN), generative adversarial networks (GAN),recurrent neural networks (RNN), deep belief networks (DBN), which are some of well-known models.

3.3.1 Convolutional Neural Networks

A convolutional neural network[45] is a type of artificial neural network that processes multidimensional data. Convolutional neural networks apply a mathematical operation called convolution. For a given multidimensional array, akernelwhich is usually smaller than the array is applied. For a two-dimensional arrayA=X×Y, the convolution operation of given coordinates (x, y) ofAwith a kernelKis formally defined as

(A∗K)(x, y) = ∑︂

i

∑︂

j

A(i, j)K(x−i, y−j), (3.24) where∗is theconvolution operator. In practical applications of two-dimensional models, size of kernels are chosen as 3×3, 5×5, or 7×7. In the common approach, the values of each kernel are randomized for each layer during the training phase.

Moreover, practically, the same input is exposed to different kernels at the same layer. Assuming there are k kernels for a given layer, there are k different outputs of array A, combined into one tensor of dimensions (X×Y ×k).

Another feature of convolutional neural networks includes a method calledpool- ing. Pooling decreases the size of a given multidimensional array into a smaller one.

For example, a two-dimensional array is reduced from sizei×jtoi^′×j^′, wherei^′ < i

(20)

Figure 3.4: Max pooling example for a two-dimensional array. For the upper- left block, the maximum element, which is 4, is preserved, and other elements are collapsed into the maximum element.

and j < j^′. Pooling functions may be different on application, however, a common pooling function is max pooling. Max pooling divides an array into strides of size n×n× · · · ×n, depending on the dimensionality, and eliminates all elements in each stride except the maximum value. Figure 3.4 illustrates the max pooling operation.

In convolutional neural networks, common activation function of the application is ReLU, except for the output layer. Output layer represents the number of possible classes for a classification. Before the output layer, a CNN should be converted in a manner similar to a multilayer ANN, which is referred to as flattening. In flattening, an m-dimensional array of size (which depends on the dimensionality of the application of the model)n₁×n₂× · · · ×n_m is translated into a one-dimensional array of size n₁n₂. . . n_m. After flattening, a fully-connected layer follows, which does not include multidimensional input, but one-dimensional input instead. Then, the output layer uses a special activation function called softmax. The softmax function is defined as

f_i(x) = e^xⁱ

∑︁

je^x^j (3.25)

where eachx_i represents the output weights in the fully-connected layer before activation. With this, the sum of all the values of the output layer will be 1, and this represents the probabilities for each given class.

(21)

The training method of convolutional neural networks is backpropagation. Dur- ing backpropagation, the derivative of softmax should be computed, which is

∂f_i(x)

∂x_i =f_i(x)(1−f_i(x)) (3.26) and

∂f_i(x)

∂xj

=−f_i(x)f_j(x) (3.27)

wherei̸=j[46].

There is a loss function which is found more suitable for CNNs, called cross- entropy, which is closely related to softmax. The formal definition of cross-entropy loss is

E =−∑︂

j

x_jln(x_j), (3.28)

which is in the range of [0,1][47]. Like softmax, cross-entropy depends on the prob- ability distribution. Cross-entropy loss is intuitively more suitable when classifying categorical data, which is usually the case of architecture of convolutional neural networks.

Wrapping up all given information, a typical structure of CNN can be seen in Figure 3.5. The example here is a small-scale convolutional neural network, which classifies two different objects. Creating datasets is easier in ”object a-or- not” approach, because there would be two categories of datasets, one for the given object a, and the other dataset for several random objects but objecta.

3.3.2 Recurrent Neural Networks

Recurrent neural networks (RNN) are structures that are designed for learning sequential data, which one of each input is considered at a time, and memorized[43], hence in an RNN, the output of the last of the hidden layers is feeded into the first of the hidden layers depending on the given criteria, creating a recurrent approach.

For each step, the output of hidden layer is also sent to the output layer, creating outputs Y = {y(t) | t ∈ T}, where t is the given time in time series T, and y(t) is the output. The initial output at the start is y(0), and the output of hidden layer for time t ish(t). Assuming the initial output of the hidden layer ish(0), for each iterationt ∈T, hidden layer gets the inputs h(t−1) and x(t). This process is simplified in Figure 3.6. Mathematically, the process can be formulated as

h(t) = f₁(W_hh(t−1) +W_xx(t)), (3.29)

(22)

Figure 3.5: An example of CNN structure. Until flattening, the data is convoluted by 5×5 kernels, and then the activation function ReLU is applied. After activation, pooling reduces the size of data, and passes it to next layer with half size. Finally, fully-connected layer works like a typical ANN with ReLU and the output part uses softmax as activation. There are 2 classes, since there 2 outputs.

(23)

Figure 3.6: Simple diagram of a RNN.

where W_h is the weight matrix of the hidden layer, W_x the weight matrix of input layer, and f₁ is the activation function. And then the overall output is,

y(t) = f₂(W_yh(t)), (3.30)

where W_y weight matrix of the output layer, and f₂ is the activation function. It is assumed that the bias is not included. Practically, f₁ is applied as tanhand f₂ is applied as the softmax function.

However, the applications with typical RNNs have a common problem called vanishing gradient problem. That means, during the back-propagation phase, the partial derivative of error function with respect to output is minimized in a fast pace, and this prevents the model to learn, leading to a premature halt. This happens due to the existence of computations at each time step t ∈T, minimizing the error dramatically. Hence, for longer sequences, the network forgets the earlier time steps and the model becomes less effective.

The solution for the vanishing gradient problem is implementing long short term memory models (LSTM), with the ability to retain long-term memory[48], unlike the typical RNN model. In LSTM, there are three different types of gates, which referred to asexternal input,forget, and outputgates. The LSTM also includes a layer of self-loop, which is an inner loop that is different from the outer loop of a typical RNN. Those loops are retained in different cells. Calling the set of cells C, the weight of the forget gate f_i(t) at celli at the given time t is formulated by

f(t)_i =σ (︄

∑︂

i∈C

V^(f_i ⁾x_i(t) +W^(f)_i h(t−1) )︄

(3.31) where xi(t) is the input for the given timet at cell i,h(t) is the hidden layer values for the given time t, and V^(f_i ⁾ is the matrix of input weights for the forget gate for cell i, whereas W^(f_i ⁾ is the matrix of weights of recurrence for the forget gate for cell i. The activation function σ is the sigmoid function. It is assumed that there is no bias included.

Lets_i(t) be the inner loop, namely, the internal state at cell i at the given time

(24)

t. Then the internal state is computed by s_i(t) =f_i(t)s_i(t−1) +e_i(t)σ

(︄

∑︂

i∈C

V_ix_i(t) +W_ih(t−1) )︄

(3.32) wheree_i(t) is the value of external input gate for the given time tat cell i,V_i is the matrix of input weights for the LSTM layer for celli, whereasW_i is the matrix of weights of recurrence for the LSTM layer for cell i.

The computation of the external input gate is similar. Fore_i(t), it is given with the equation

e_i(t) =σ (︄

∑︂

i∈C

V_i^(e)x_i(t) +W_i^(e)h(t−1) )︄

(3.33) where V^(e)_i ,W^(e)_i are input weights and weights of recurrence for the external input gate for cell i.

The output h_i(t) for cell i at the given time t is controlled by the output gate o_i(t), which is computed as follows,

h_i(t) =g(s_i(t))o_i(t), (3.34) where g is the activation function, and in a similar fashion with the other gates,

o_i(t) =σ (︄

∑︂

i∈C

V_i^(o)x_i(t) +W^(o)_i h(t−1) )︄

, (3.35)

whereV^(o)_i ,W_i^(o)are input weights and weights of recurrence for the output gate for cell i[49].

The brief architecture of the given celliof an LSTM model is given in Figure 3.7.

Note that in some applications, Equation 3.32 uses the sigmoid activation function.

3.4 Representation of Audio Data

3.4.1 Analog and Digital Signals

A digital audio data is a time series of n-bit values of amplitude. Digital audio data is derived from the analog audio signal and sampled into discrete values of amplitude. Sampling means that, derivation of data from some points of analog signal (the chosen two points from the analog signal is usually equidistant) and creating a discrete array of amplitude values. Let x an analog signal, and x(t) be an analog signal at a time t. For the analog signal, t ∈ R, since it is continuous.

(25)

Figure 3.7: The architecture of an LSTM. For inputs x(t) and h(t−1), sigmoid functions are applied for different gates f_i(t),e_i(t), ando_i(t) and also to the LSTM layer, which is an illustration of Equation 3.32.

Let x_d be the digital signal with x_d[s] denoting the signal at the sample s. For the digital signal, s ∈ Z, since it is discrete. The relationship between an analog and digital signal is

x_d[s] =x(s/f_s), (3.36)

for a sample s, where f_s is the sampling rate, which is denoted with 1/second or hertz (Hz).

A practical consideration when sampling values is thebit depth. Even though the amplitudes of an analog signal differ in a range of [a, b] and can get all the real- number values between those, however a digital signal is limited by the bit depth.

That is, for a discrete signal with bit depth ofn-bit, the count of possible amplitude values are 2ⁿ. For a chosen range of [a, b], the difference between two possible values is ₂^b−an−1. The conversion between signals is illustrated in Figure 3.8.

Another practical problem that arises in applications isaliasing. Aliasing means that the signal is undersampled. When aliasing happens, the signal frequency is altered. Aliasing prevents the signal to be reconstructed in an adequate way. The problem of aliasing is shown in Figure 3.9. By the Nyquist-Shannon sampling theorem, a digital signal can be fully reconstructed if the analog frequency of the signal has the valuef ≤f_s/2[50]. The valuef_s/2 is called theNyquist frequency.

In the real world, the reason that music players use 44100 Hz sampling rate derives from this, since humans can perceive the frequencies between 20 and 22000 Hz.

(26)

Figure 3.8: The digital equivalent of the analog signal x(t) = sint, with a sampling rate of 2 Hz. Hence, there are two instances of data per second.

3.4.2 Discrete and Fast Fourier Transform

TheFourier transformis a powerful technique to separate the frequencies of audio signals and rebuild them. For a discrete signal, the Fourier transform is formulated as

x[k] =

N

∑︂

j

x[j]e⁻^2πtji^N , (3.37)

where i=√

−1 and N is the size of the signal.

However, the computational complexity for this process is not practical, since it is O(N²). Fortunately, the algorithm of fast Fourier transform (FFT) gives a solution in complexity of O(N logN)[51].

In FFT, first, the sequences will be separated into even and odd subsequences.

Let n= 2r for an even number, and n= 2r+ 1 for an odd number, and write x[k] =∑︂

j

x[j]e⁻^2πtji^N (3.38)

=

(N−1)/2

∑︂

r

x[2r]e⁻^2πtri^N/2 +e⁻^2πti^N

(N−1)/2

∑︂

r

x[2r+ 1]e⁻^2πtri^N/2 (3.39)

=x[2r] +e⁻^2πti^N x[2r+ 1], (3.40) creating a subproblem. Now, each of x[2r] and x[2r+ 1] in the same manner, the recursion will be done, until the problem is reduced to a very small size, and this

(27)

Figure 3.9: Comparison of analog and digital signals for x(t) = sin4t (4 Hz) with a sampling frequency of 2 Hz. The digital signal is not reconstructed properly, hence the as audio data, it will sound distorted. To prevent aliasing, the Nyquist frequency here must be 8 Hz.

cutoff is determined manually[52].

3.4.3 Spectrogram

The audio signals can be visualized to observe the time-variance of audio frequencies.

The spectrogram of audio data is computed with a variation of Fourier transform, using a sliding window. This variation is called short-time Fourier transform (STFT), which is derived from Equation 3.37. Sliding window means that, a short- time segment of the signal is selected and combined with the given window for each

(28)

(a)α= 1

(b)α= 0.5

(c)α = 0

Figure 3.10: Tukey window with different values of α.

step, as shown in Equation 3.41.

X[m, k] =

∞

∑︂

j=−∞

x[j]w[j−m]e⁻^2πtji^N , (3.41) where w is the sliding window and m is the window size. The sliding window used is usually the Tukey window, which is defined as

⎧

⎪⎨

⎪⎩

w[n] = ¹₂[1−cos(︁_2πn

αN

)︁], 0≤n < ^αN₂ w[n] = 1, ^αN₂ ≤n ≤ ^N₂ w[N −n] =w[n], 0≤n≤ ^N₂

(3.42)

whereα is the rectangularity of the signal. As it goes to 0, the signal becomes more rectangular. The value is chosen 0≤α ≤1, andN is the window size. Figure 3.10 shows an example for different α values on a Tukey window of α= 0.25.

A spectrogram graph is two-dimensional. While x axis represents the time di- mension, y represents the frequencies. The values are represented with colors and

(29)

it denotes the amplitude. Figure 3.11 shows different spectrograms of a sinusoid, an FM signal, speech, and a wind turbine with α values 0.25, 0.5, and 1.

One can observe that the use of different values for Tukey window slightly changes the spectrogram representation. However, the overall big picture does not alter compared to smaller details, making spectrograms feasible for the representation of audio data.

3.4.4 Mel-Frequency Cepstrum Coefficients

The detection of distinguishing features is essential when feeding an audio representation to the artificial neural network model. Hence, especially in speech recognition[53], a procedure calledMel-frequency cepstrum coefficients(MFCC) is applied. The technique of MFCC was first proposed in 1980 in a published paper to recognize monosyllabic words[54], but the applications are not limited to speech recognition. The motivation of MFCC is observing several computational features to distinguish between sounds. Within the observation of sound from spectrogram data, inference of unique features is not always possible.

With Fourier transform, the frequencies of the signal should be derived. How- ever, before Fourier transform, the signal should be windowed. The window that is used for this purpose is called the Hamming window. Hamming window is defined as

w[n] =a₀−(1−a₀)cos (︃2πn

N )︃

(3.43) where a0 = 0.54 and N is the window size. The Hamming window is shown in Figure 3.12.

Figure 3.12: A Hamming window with the window size of N = 512.

Now, after defining the window, apply the windowing to the digital signal x[n]

(30)

using the equation

x^∗[m] =

∞

∑︂

j=−∞

x[j]w[j−m], (3.44)

and

X[m, k] =F(x^∗[m]) (3.45)

where F is the Fourier transform of the signal.

For the next step, there will be a new definition called Mel scale. Since frequency scale is a nonlinear, the difference of sound between lower frequencies is more remarkable than the difference of sound between higher frequencies, Mel scale is proposed to overcome this problem, which is defined as

M(f) = 2595·log10

(︃

1 + f 700

)︃

(3.46) and visualized in Figure 3.13.

Figure 3.13: Frequency vs Mel graph.

Now, for the extraction of features, n different filter banks are chosen. In practical applications, usually n= 26 orn = 40 is the case. The filter bank is computed in given steps:

(1) Determine the lower bound and upper bound of the frequencies, i.e, apply f_min =min(X[m, k]) and f_max =max(X[m, k]).

(2) Apply M(f_min) and M(f_max).

(31)

(3) For n filter banks, there are n+ 2 points. Place the lower bound and upper bound of the Mel values at the start and the end of an array of size n + 2, respectively.

(4) For each step, increase the value by ₂^b−an−1 and fill the empty values in the array.

Define the resulting array as A[i].

(5) Apply M⁻¹(A[i]) for each i ∈ A by the increasing order, and define the resulting array as B.

(6) To B, apply the equation

C =⌊(N + 1)·B/fs⌋, (3.47)

where ⌊x⌋ is the floor function.

(7) For kth filter bank, compute the filter graph with the given formula,

H(k) =

⎧

⎪⎪

⎨

⎪⎪

⎩

0 k < C(n−1)

k−C(n−1)

C(n)−C(n−1) C(n−1)≤k ≤C(n)

C(n+1)−k

C(n+1)−C(n) C(n)≤k ≤C(n+ 1) 0 k > C(n+ 1)

(3.48)

which will create n different filter banks[55].

Now, since the values for the filter banks are obtained, take the logarithm of every H(i), which is H^∗(i) =log₁₀H(i). And finally, apply

c_k=

M

∑︂

n=1

H^∗(n)cos [︃ π

M (︃

n+1 2

)︃

k ]︃

. (3.49)

Every ck generated here is a coefficient. In practical applications, the largest half of the coefficients are removed and the remaining part is outputted. In Figures 3.14 and 3.15, example cases of MFCC can be seen.

(32)

(a) Sinusoid that is defined by x(t) = sin(2π3·10³t)

(b) Sinusoid that is defined by y(t) =x(t+ cos(2π0.25t))

(c) Male speaker saying ”Test” to micro- phone.

(d) Wind turbine sound of 720 rotations per minute.

Figure 3.11: Different types of sounds with different values of Tukey window, from above, the values are α= 0.25,5,1, respectively.

(33)

Figure 3.14: MFCC application withn = 26 on a male speaker saying ”Test,” which is the same signal as Figure 3.11(c).

Figure 3.15: MFCC application with n= 26 on the signal given in Figure 3.11(b).

(34)

4. Implementation

In this chapter, the practical models are presented and discussed. There are three different structures for each of three representations of audio data, which are MFCC, spectrogram, and Mel spectrogram, respectively. All given structures are deep learning neural network models with 3 different outputs of classes. The given classes are listed as working, problematic, not working.

4.1 Data Collection

The data was collected from various sources. Theworkingclass part and sections of other classes of training data sources are disclosed data from onshore wind farm facilities of Wärtsilä. For problematic part, apart from the data of Wärtsilä, supporting data from the open source dataset called MMII Dataset is used for ab- normal fan sounds[56]. For not working part, apart from the data of Wärtsilä, several instances of audio data shared in YouTube and open source audio instances from royalty-free media sharing environments are gathered. For the training data, working part was gathered from wind farm recordings found on YouTube, and problematic part was gathered from MMII Dataset. For not working part, again, YouTube and royalty-free media sharing environments are used. However, there is a data distribution bias problem. The working and problematic instances of data have the size of 15 minutes for each category and for training and testing groups, whereas not working instances of training data have the size of 5 minutes 27 seconds and for testing data, the size is only 27 seconds.

Sampling rate for audio is 22100 Hz. The audio is mono, i.e, there is only one channel of audio data for each instance. Given audio instances are divided into chunks of arrays of size to 22100, which represents 1 second of sound.

4.1.1 SMOTE

Synthetic minority over-sampling technique (SMOTE) is a technique that was developed in 2002 to increase the amount of minority data[57]. The SMOTE algorithm is a complex randomized algorithm, which adapts the algorithm of k- nearest neighbours and creates random instances to increase the amount of data.

(35)

Training Training (SMOTE) Testing (SMOTE) Testing (SMOTE)

Working 900 900 899 900

Problematic 900 900 900 900

Not Working 327 900 26 900

Table 4.1: Number of instances for each class, without and with the adaptation of SMOTE for k = 5.

With SMOTE, the imbalancing problem of data is fixed. Table 4.1 gives the number of instances of data for each class for training and testing groups for k= 5.

As intended, majority classes are not synthesized but minority classes are synthesized to close the gap between majority classes, creating a balanced dataset.

4.1.2 Data Representation and Preprocessing

The sampled form of each instance of the data is a sequence of 22100 different values in the range of [−128,127], which is equivalent to signed 8-bit integers. However, this sequence should be preprocessed in a representable manner for machine learning.

There are three different representations of the data. These data representations are MFCC, spectrogram, and Mel spectrogram, as mentioned.

For the MFCC application, there are N = 1024 fast Fourier transform parameters, with a Hamming window with length of 12.5 milliseconds. Between each window, there are 5 milliseconds of difference. Hence, the windows overlap. There are n = 64 chosen filter banks for the model, and n/2 = 32 of them are actually used for the feature extraction. For each instance, the MFCC algorithm creates an array of 198×32 values. Figure 4.1 shows examples of the MFCC representations of several instances.

For the spectrogram application, the window used is Tukey window with the size of α= 0.25 parameter. There are N = 256 fast Fourier transform parameters.

The size of the window is 11 milliseconds (This is equivalent to 256 time frames of a signal of fs = 22100). Between each window, there are 11/8 milliseconds or 32 time frames of difference. For each instance, the spectrogram algorithm creates an array of 129×98. Figure 4.1 shows examples of the spectrogram representations of several instances.

For the Mel spectrogram application, the window used is Hann window. Hann window is a variation of Hamming window with the parameter a₀ = 0.5. There are N = 2048 fast Fourier transform parameters. The size of the window is 92 milliseconds or 2048 time frames. Between each window, there are 23 milliseconds or 512 time frames of difference. After generating the spectrogram representation, the values are transformed into their corresponding Mel values according to Equation 3.46. For each instance, the Mel spectrogram algorithm creates an array of 128×44

(36)

(a) MFCC representation of aworking data instance.

(b) MFCC representation of a problematic data instance.

(c) MFCC representation of afaileddata instance.

Figure 4.1: MFCC representations of the instances from the dataset.

values. Figure 4.3 shows examples of the Mel spectrogram representations of several instances.

Finally, the generated values should be translated to fit in a range, since the values are distributed differently in MFCC, spectrogram, and Mel spectrogram representations. To make the data fit into the present models, the range is set by minimum-maximum scaling, according to the given equation

x^′[t] = 2

(︃ x[t]−min(x) max(x)−min(x)

)︃

−1, (4.1)

wheremin(x) is the minimum value of the arrayx, andmax(x) is the maximum value of the array x. The given equation fits the data into the [−1,1] range. To fit the data into the [0,1] range, one can apply

x^′[t] = x[t]−min(x)

max(x)−min(x). (4.2)

In CNN and RNN applications, [−1,1] range is applied, whereas [0,1] range is applied in LSTM applications in the experimental setup.

(37)

(a) Spectrogram representation of aworking data instance.

(b) Spectrogram representation of a problematic data instance.

(c) Spectrogram representation of a faileddata instance.

Figure 4.2: Spectrogram representations of the instances from the dataset.

(38)

(a) Mel spectrogram representation of aworking data instance.

(b) Mel spectrogram representation of a problematic data instance.

(c) Mel spectrogram representation of afailed data instance.

Figure 4.3: Mel spectrogram representations of the instances from the dataset.

4.2 Training Models

The training models are based on three main models: CNN, RNN, and LSTM. Each main model is designed differently to suit for three representations of the data given above. For practical purposes, additional methods are applied to the given models.

(39)

Figure 4.4: Applying a 3×3 kernel into a 6×6 array. There are a total of 4×4 operations, leaving the edges unchanged, so they are eliminated.

Before defining the models, several practical models are going to be introduced.

4.2.1 Practical Methods

Padding and Kernel Initialization

CNN models have k different kernels for each convolutional layer. With padding, edges of a two-dimensional array of size m×n are pruned and converted into an array of sizem−k+ 1×n−k+ 1, because the convolution operation for each kernel of size k×k will exclude the edges, as shown in Figure 4.4.

The kernel initialization denotes the weight values of the kernel. The values are randomized for each iteration by using a method called Glorot uniform initialization[58]. The random values are distributed in a range [−L, L], where

L=

√︃ 6

i+o, (4.3)

where idenotes the the number of input units and o denotes the number of output units.

Kernel Regularization

A regularizer is a penalty value given to the weights of a layer of the artificial neural network model. The value of regularization is added into the overall loss.

(40)

The applied regularization algorithm is L2 regularization[59], which computed as

R =α·w_s, (4.4)

where 0 ≤ α < 1 is the arbitrary L2 value and ws is the sum of the squares of all weights of the given layer, which is w_s =∑︁

ijw²_ij. After determining R, it is added into the loss L, i.e, L^′ =L+R.

Adam Optimization

The adaptive moment estimation (Adam) is an algorithm that was developed in 2014 to improve stochastic gradient descent[60]. The method consists of four parameters:

α,β₁,β₂, andϵ. First, for each iteration, the gradient of errorE is computed, which will be calledg_t for iteration stept. Then themomentums µandν are computed, µ_t =β₁µt−1+ (1−β₁)g_t, (4.5) ν_t =β₂νt−1+ (1−β₂)g²_t, (4.6) then the computed momentums are normalized with formulas

µˆ =_t µ_t

1−β₁^t, (4.7)

νˆ =_t ν_t

1−β₂^t, (4.8)

and finally the weights are updated with the given formula w^(t+1) =αw^(t) µˆ_t

√νˆ +_t ϵ. (4.9)

Adam optimization is a different approach than the application of stochastic gradient descent in the backpropagation algorithm, and it is known as more effective and faster in convergence.

Spatial Dropout

In a general sense, dropout is a regularization technique of weights to overcome the problem of overfitting[61]. For a three-dimensional matrix of weights W⁽ⁱ⁾ for layer i in a convolutional neural network, in a ratio of 0 < α < 1, their values are dropped into 0.

After dropping the random weights in the array to 0, each of the remaining weights are added with the value _1−α¹ .

(41)

The two-dimensional implementation of this technique is called two-dimensional spatial dropout. Instead of dropping the values of weights individually, α per- centage of two-dimensional feature maps are dropped. This technique is illustrated in Figure 4.5.

(a) Dropout with ratioα= 0.4.

(b) Two-dimensional spatial dropout with ratio α= 0.2.

Figure 4.5: Examples of dropout and spatial dropout. Nodes denoted with red are dropped to 0.

Batch Normalization

The batch normalization layer is a practical layer to increase the speed of the training[62] and to reduce the computational burden. The algorithm creates a mini- batchB ={w1, w2, . . . , wm}where valueswithe weights. The algorithm normalizes the inputs from B. First, the mean and variance µ_B and σ_B are computed, respectively,

µB = 1 m

m

∑︂

i=1

wi, (4.10)

σ_B = 1 m

m

∑︂

i=1

(w_i−σ_B)², (4.11)

then the values w_i are normalized in a manner with arbitrary value ϵ wˆ =_i w_i−σ_B

√︁σ²_B+ϵ (4.12)

(42)

and scaled and added with arbitrary values γ and β, respectively,

′

i =γwˆ +_i β. (4.13)

Batch normalization assumes stability on the distribution, rendering the optimization part of the model quicker than usual. Hence, the model saves time where the resources are limited due to practical issues.

4.2.2 Model Definitions

In this section, the experimental models are defined. There are 9 different models, 3 of each of them are used for a different representation of data. The models are CNN, RNN, and LSTM. The RNN and LSTM models are relatively simpler compared to CNN models. All models are optimized with Adam, and the loss function is cross- entropy.

CNN

TheModel 1 is slightly based on a research of 2015 on the classification of environ- mental sounds[63]. The model is based on the MFCC representation of the data.

The components of the model are defined in Figure 4.6.

The Model 2 is based on the spectrogram representation of the data. The components of the model are defined in Figure 4.7.

Figure 4.7: CNN model for spectrogram representation of data.

TheModel 3 is based on the Mel spectrogram representation of the data. The components of the model are defined in Figure 4.8.

(43)

Figure 4.6: CNN model for MFCC representation of data.

(44)

Figure 4.8: CNN model for Mel spectrogram representation of data.

RNN

The Models 4, 5, and 6 are similar to each other. The models are based on the MFCC, spectrogram, and Mel spectrogram representations of the data, respectively.

The components of the models are defined in Figure 4.9. Inside of the RNN cell, the activation function tanh is used.

(a) RNN model for MFCC representation of data.

(b) RNN model for spectrogram representation of data.

(c) RNN model for Mel spectrogram representation of data.

Figure 4.9: RNN models for different representations of data.

(45)

LSTM

The Models 7, 8, and 9 are similar to each other. The models are based on the MFCC, spectrogram, and Mel spectrogram representations of the data, respectively.

The components of the models are defined in Figure 4.10. Inside of the LSTM cell, the activation function sigmoid is used.

(a) LSTM model for MFCC representation of data.

(b) LSTM model for spectrogram representation of data.

(c) LSTM model for Mel spectrogram representation of data.

Figure 4.10: LSTM models for different representations of data.

Table 4.2 summarizes the learning models given above with their methods, number of layers, and their representation of data.

(46)

Model Method Number of Layers Data Representation

Model 1 CNN 10 MFCC

Model 2 CNN 6 Spectrogram

Model 3 CNN 6 Mel Spectrogram

Model 4 RNN 2 MFCC

Model 5 RNN 2 Spectrogram

Model 6 RNN 2 Mel Spectrogram

Model 7 LSTM 3 MFCC

Model 8 LSTM 2 Spectrogram

Model 9 LSTM 2 Mel Spectrogram

Table 4.2: Summary of the learning models and their properties.

(47)

5. Results and Discussion

The experimental results are below expectations. This problem is due to the representation of data, limitation of time and computational resources, and the inability of determination of the correct hyperparameters. Table 5.1 gives the list of accura- cies and number of fitting steps (epoch) of the training and the test data for each model.

Training Data Accuracy Test Data Accuracy Epochs

Model 1 97.70% 71.37% 5

Model 2 100% 29.81% 30

Model 3 100% 62.88% 30

Model 4 52.63% 40.37% 47

Model 5 33.52% 33.33% 10

Model 6 32.30% 33.37% 30

Model 7 97.30% 39.89% 200

Model 8 98.48% 34.30% 30

Model 9 99.74% 35.70% 140

Table 5.1: The rates of success for each model, and their epoch count.

At this step, it can be observed that epoch counts for the models differ. The training is stopped for a model is when the loss value of the model begins to increase and gets rolled back to the previous epoch. For example, in the case of Model 1, after the 5th epoch, the loss increased, and the model was stopped early. The only exception is Model 7, where the exact epoch limit for training is achieved and it did not stop early.

Models 2, 7, 8, and 9 suffered from overfitting. Models 4, 5, and 6 suffered from underfitting. The only relatively satisfactory results were taken from Models 1 and 3.

5.1 Working as a Binary Classification Problem

The classification results of the given 9 models are investigated within the confusion matrices in detail. A confusion matrix is a comparison table of actual

(48)

True False Positive TP FP Negative TN FN

ACTUAL

PREDICT

Working Problematic Not Working Working

Problematic Not Working

Table 5.2: Ternary and binary confusion matrices.

observations and predictions of the given model. Other statistical metrics other than accuracy are investigated. To achieve this, the models are transformed into binary classification problems. So, two of the three classes, Problematicand Not Working are merged, since they are similar conditions in practical environments. On the other hand, a problematic system will lead to the halting eventually.

In other words, they are in a causal relation.

Thebinary confusion matrix (denoted in Table 5.2) consists of four different values:

1. True positive (TP), which an actual Working instance is classified as a Working instance.

2. True negative (TN), which an actual Problematic or Not Working instance is classified as a Problematicor Not Working instance.

3. False positive (FP), which an actual Problematic or Not Working instance is classified as a Workinginstance.

4. False negative (FN), which an actual Problematic or Not Working instance is classified as a Workinginstance.

Conversion of a three-class problem into a binary classification problem gives the advantage to compute statistical metrics other than the accuracy. In the binary classification analysis of this work, the defined metrics are

1. Recall, which is the ratio of true positives to the sum of true positives and false negatives, denoted as R = _TP+FN^TP ,

2. Precision, which is the ratio of true positives to the sum of true positives and false positives, denoted as P = _TP+FP^TP ,

3. Specificity, which is the ratio of true negatives to the sum of true negatives and false positives, denoted as S = _TN+FP^TN .

Fault Detection of Wind Turbines Using Deep Learning