Applying Machine Learning Algorithms to Psychiatric Patient Data

(1)

Tommi Nikkanen

APPLYING MACHINE LEARNING ALGORITHMS TO PSYCHIATRIC PATIENT DATA

Faculty of Information Technology and Communication Sciences M.Sc. Thesis November 2021

(2)

Table of Contents.

1 INTRODUCTION ... 4

2 MACHINE LEARNING PARADIGMS ... 1

3 MACHINE LEARNING ALGORITHMS ... 2

3.1 K-NEAREST NEIGHBOUR (^K-NN) ... 2

3.1.1 Standardization and normalization ... 3

3.1.2 The curse of dimensionality ... 4

3.2 N^AÏVEB^AYES ... 6

3.3 DISCRIMINANT A^NALYSIS ... 8

3.3.1 Linear and Quadratic Discriminant Analysis ... 8

3.3.2 Parameter estimation ... 10

3.3.3 Fisher’s LDA ... 10

3.4 D^ECISIONT^REES ... 12

3.4.1 How to build the tree? ... 13

3.4.2 Entropy ... 14

3.4.3 Information Gain ... 15

3.4.4 ID3 algorithm ... 15

3.5 ENSEMBLE LEARNING ... 16

3.5.1 Bootstrapping and aggregation ... 17

3.5.2 Random Forest ... 17

3.5.3 Evaluation of Random Forest ... 18

3.5.4 Bias and variance error ... 18

4 MODEL EVALUATION ... 20

4.1 EVALUATION MEASURES FOR BALANCED DATASETS ... 20

4.2 EVALUATION MEASURES FOR IMBALANCED DATASETS ... 22

4.3 OTHER METHODS ... 24

4.3.1 Holdout method ... 24

4.3.2 K-fold cross validation... 24

5 SMOTE (SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE) ... 26

6 THE PROJECT... 28

6.1 THE DATASET ... 28

6.2 PREPROCESSING ... 29

6.3 CLASSIFICATION ... 33

6.4 PROGRAMMING ENVIRONMENT ... 33

7 RESULTS, CLASSIFICATION OF THE CLASS ACCIDENT CATEGORY ... 35

7.1 RANDOM FOREST ... 35

7.2 R^ANDOMF^{OREST AND}SMOTE ... 37

7.3 N^AÏVEB^AYES ... 38

7.4 DISCRIMINANT ANALYSIS ... 39

7.5 K-NN ... 40

7.6 THE COMBINED RESULTS ... 43

8 RESULTS, CLASSIFICATION OF CLASS COMPENSATION DECISION ... 45

9 CONCLUSION ... 47

10 REFERENCES ... 48

(3)

ABSTRACT

Tommi Nikkanen: Applying Machine Learning Algorithms to Psychiatric Patient Data Tampere University

Faculty of Information Technology and Communication Sciences M.Sc. Thesis

November 2021

The purpose of this work is to classify data in the field of psychiatry and neurology by applying different supervised machine learning algorithms.

The work is divided into two parts. The methodogical part represents all the methods used in the project part. Evaluation measures are given for balanced and imbalanced datasets to compare how well the different models perform. Also, a SMOTE (Synthetic Minority Over-sampling Technique) algorithm is utilized to help with inbalanced datasets. The project part consists of two different classification tasks: a classification of applications into six predefined categories, and a binary classification of the applications for compensation into two groups - accepted / declined.

Different supervised machine learning algorithms were applied to the data.

Random forest gave slightly better results than the other classifiers in the first classification task. Random forest was used in the second classification task. The results got improved in both classification tasks by using SMOTE algorithm for generating synthetic samples to balance the different categories in the dataset.

Keywords: machine learning, k-nearest neighbour, discriminant analysis, decision tree, ensemble learning, random forest.

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(4)

1 Introduction

Machine learning has a variety of different applications today, for example, classification, associative learning, image processing and speech recognition. It is a very wide field of research on its own, but it has applications in other fields of research as well, e.g., in the field of medicine to separate cancerous tissue from healthy.

This masters’s thesis is done as part of the research project of classification and incidence of patient injuries in the Faculty of Medicine and Health Technology at Tampere University.

There are two different classification problems. The first one is a classification of

applications for compensation into six predefined categories and the second one is a binary classification problem into two categories: applications for compensation accepted / declined.

Different supervised machine learning algorithms were used in the first classification task, such as K-nearest neighbour, decision trees and ensemble learning. Random forest was used in the second classification task. SMOTE (Synthetic Minority Over-sampling Technique) algorithm is used in both classification tasks to help with imbalanced datasets.

Chapter 2 gives a brief introduction to machine learning generally. The machine learning methods are explained in Chapter 3. The methods covered in this thesis are k-nearest

neighbour, naïve Bayes classifier, discriminant analysis and ensemble learning. In ensemble learning particularly Breiman’s random forest algorithm is given. Chapter 4 gives different measures to evaluate machine learning models. Evaluation measures are given for balanced and inbalanced datasets. In Chapter 5, Synthetic Minority Over-sampling Technique is introduced to deal with imbalanced datasets, and it is also utilized in the project part. The project part and the results are covered in Chapters 6-9.

I want to give thanks to the project group: Olli Kampman, M.D.; Martti Juhola, Ph.D, and Juho Niemi, B.M. Juho Niemi also helped in the translations of the medical lexicon from Finnish into English. Also, I want to give special thanks to Martti Juhola for supervising and examining the thesis and Henry Joutsijoki, Ph.D. for giving feedback and examining the thesis.

(5)

2 Machine Learning Paradigms

There are two popular approaches to machine learning (ML): supervised and unsupervised learning. In supervised learning the task is to find a function that maps the attribute values of an instance to a target attribute value. A machine learning algorithm tries to find the best this kind of function. However, there are so many different possible functions from input to output that not all of these can be tested. For this reason, machine learning algorithms are designed to find a certain kind of functions. This tendency is called the learning bias. It is a challenge to find the algorithm that best suits to a specific dataset. Many different algorithms should be tested to find out the one that best suits to the dataset. The supervised machine learning algorithms explained in this thesis are k-nearest neighbour, naïve Bayes, linear discriminant analysis, decision trees and random forest (Chapter 2). [Kelleher and Tierney, 2018]

In unsupervised learning there is no target attribute. The goal is to find regularities in the data. The most important methods in unsupervised learning are cluster analysis and principal component analysis (PCA). PCA is used for example in dimensionality reduction (discussed in Section 3.1.3) and finding patterns in a high dimensional data. [Alpaydin, 2016].

One use-case of clustering is customer segmentation in which a company wants to divide their customers into segments to see what kind of customers they have. In clustering, the data is divided in groups so that every object in one group is more similar in a certain way to each other than to those objects in other groups. Clustering algorithms usually start by guessing the initial groups and then iteratively update the groups so that the similarity is maximized inside the groups and the dissimilarity between the groups is maximized. One problem in clustering is how to measure similarity. Euclidean distance measure could be used if the values are numerical, but for varying feature types it is more difficult. Clustering is utilized, for example, in recommender systems and in anomaly detection in various fields. [Alpaydin. 2016; Kelleher and Tierney, 2018].

(6)

3 Machine Learning Algorithms

3.1 K-nearest neighbour (k-NN)

The k-NN classifier is based on similarity learning. The class label for a new case of n’th dimension is determined based on the nearby objects in n-dimensional space. The k-NN algorithm is given in Figure 1. The algorithm takes the datapoints as the input and searches for the K closest cases in the training data. This is achieved by computing the distance from the current case to each case in the training data. Finally, the class is set to be the most common class out of the K closest neighbours. If the chosen K value is too small, e.g., K=1 only the nearest neighbour is considered, it makes the algorithm sensitive to noise. If the chosen K value is too big, the accuracy reduces because also further datapoints are considered that are not related to the current case. The best K value can be found by testing different K values systemically and checking which value gives the best error rate. [Marsland, 2009; Marsland, 2014]. The algorithm is very sensitive to curse of dimensionality, which is explained in Section 3.1.2.

Input:

T = training data vector, each element corresponds to a one case of the training data K = number of the nearest neighbours

dist = distance function X = vector of the current case

Y = vector of the class labels for each case in T Output:

X_class := class label for the vector X ---

1. M := vector for storing distance to every case in T 2. # Calculate distance to each case

3. for i := 1 .. T

4. M(i) = dist(X, i) # Store the calculated distance from X to i

5. idx_array := the indices of the K nearest neighbours for X calculated from M 6. # calculate the most common class among the nearest neighbours

7. X_class = mode(Y(idx_array))

Figure 1: k-NN algorithm [Marsland, 2014]

(7)

The algorithm assumes that each attribute is equally important. If this is not the case for the data, attribute weighting can be used together with the algorithm. There are also modifications to the algorithm that allow pruning noisy data. The algorithm utilizes distance metrics when searching for the nearest neighbours. The simplest one is the Euclidean distance which can be thought as a straight line between n points in an xy-plane. In the same manner, the Manhattan distance is the path along the grid lines between n points in an xy-plane. Both Euclidean distance and Manhattan distance are defined for real-valued vectors. The Hamming distance metric can be used for binary data. There are also many other distance metrics, but they are not covered here. [Marsland, 2014; Han et al., 2012]

The Euclidean distance between two points in an n-dimensional space is calculated as given in Eq. 1. [Han et al., 2012]:

𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛_𝑑𝑖𝑠𝑡(𝑥, 𝑦) = √(𝑥₁− 𝑦₁)²+ (𝑥₂− 𝑦₂)²+ … + (𝑥_𝑛− 𝑦_𝑛)² (1)

The Manhattan distance between two points in an n-dimensional space is calculated as given in Eq. 2. [Han et al., 2012]:

𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛_𝑑𝑖𝑠𝑡(𝑥, 𝑦) = |𝑥₁− 𝑦₁| + |𝑥₂− 𝑦₂| + … + |𝑥_𝑛− 𝑦_𝑛| (2)

These two equations are special cases of Minkowski metric. The function for Minkowski is given in Eq. 3. The function takes a parameter p, which is the order of the norm. [Han et al., 2012]:

𝑀𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖(𝑥, 𝑦, 𝑝) = (|𝑥₁− 𝑦₁|^𝑝+ |𝑥₂− 𝑦₂|^𝑝+ … + |𝑥_𝑛− 𝑦_𝑛|^𝑝)

1

𝑝 (3)

such that p ≥ 1.

3.1.1 Standardization and normalization

Some algorithms, e.g., k-NN, which are distance based and PCA with multiple features require scaling of continuous variables to have the data comparable. Otherwise, features with bigger scale would dominatate the other features.

Different ranges in attributes can outweight attributes having bigger scale than lower scale, for example, binary value [0,1] compared to a big integer value. Min-max normalization can

(8)

be used to prevent this problem by transforming the data to a range from 0 to 1. The formula is given in Eq. 4. The difference of the value to be normalized and the minimum value of the attribute is divided by the difference of the maximum attribute value and the minimum attribute value. [Marsland, 2014].

𝑛𝑒𝑤_𝑣𝑎𝑙 = 𝑣𝑎𝑙 − 𝑚𝑖𝑛_𝑎𝑡𝑡

𝑚𝑎𝑥_𝑎𝑡𝑡− 𝑚𝑖𝑛_𝑎𝑡𝑡 (4)

Min-max normalization is not good at handling outliers. Therefore, it is essential to have accurate data when using normalization. Z-Score standardization is a method that can handle outliers. Unlike min-max normalization, Z-Score does not guarantee that all the features will have the same scale. It tells how much the current value differs in relative to others. The formula for Z-score standardization is given in Eq. 5. The mean of the distribution is subtracted from the current value and the result is divided by the standard deviation of the mean. [Urdan, 2010].

𝑍 − 𝑠𝑐𝑜𝑟𝑒 = 𝑐𝑢𝑟_𝑣𝑎𝑙 − 𝑚𝑒𝑎𝑛

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (5)

3.1.2 The curse of dimensionality

The number of attributes in the dataset corresponds to the number of dimension. The curse of dimensionality describes the problem when the datapoints get too far away from each other in the high dimensional space. The high dimensional space gets very sparse when the number of dimension is high, and if all the datapoints are very far away from each other, the model may not generalize well. Generalization means how well the model predicts values for unseen data.

When the number of dimension (attributes) is increased, more training cases are needed so that all the combinations of attribute values are represented in the dataset. A higher number of new cases are needed for each new dimension. [Flach, 2014; Marsland, 2014]

The dimensionality can be reduced by feature selection. Redundant or irrelevant features could have negative impact on the performance of the algorithms. If there are features that have high correlation with other features, it is likely that those only contain duplicate data and should

(9)

be left out. Having too many features included could make the algorithm find irrelevant patterns in the data. It is not always an easy task to get the best set of features and the only way is to try to build the model with different subsets of the data and evaluate their performance. [Kelleher et al., 2018].

Feature extraction is another way to deal with the curse of dimensionality. For example, in PCA the features which have least variation are removed from the dataset. [Flach, 2014;

Marsland, 2014]

(10)

3.2 Naïve Bayes

The Naïve Bayes classifier is a probabilistic classifier. Naïve in the name means that it is assumed that the features are independent of each other. However, this is not always the case with the real-world data. The presumption of feature independence makes naïve Bayes computationally very fast. Also, naïve Bayes does not require a big dataset, but if a single class label is not represented in the attribute, the probability will be zero for the attribute, also cancelling every other posterior probability in the product. [Barber, 2012; Han et al., 2012].

Naïve Bayes classification is based on the Bayes’ theorem which is given in Eq. 6. In the equation, P(A | B) is the probability that the event A happens given that the event B holds. It is also known as the posterior probability of A given B. P(A) is called the prior probability of A.

[Han et al., 2012].

𝑃(𝐴 | 𝐵) = 𝑃(𝐵 | 𝐴)𝑃(𝐴)

𝑃(𝐵) (6)

In the classification task, the Bayes’ theorem is utilized in calculating the probability that the case X belongs to the class C. The probability is calculated this way for each class and the highest probability outcome defines the predicted class. Let An be the attribute where n is the ordinal number of the attribute. Let X = (𝑥₁, 𝑥₂, 𝑥₃, …, 𝑥_𝑛) be a training case where n corresponds to the attribute number. Let m be the number of classes in the data. Let 𝐶 = {𝐶₁, 𝐶₂, … , 𝐶_𝑚} be the set of classes. Then, the case X belongs to the class 𝐶_𝑖 if and if only 𝑃(𝐶_𝑖 | 𝑋) > 𝑃(𝐶_𝑗 | 𝑋) for 1 ≤ 𝑗 ≤ 𝑚 when j ≠ 𝑖. [Han et al., 2012].

According to the Bayes’s theorem (Eq. 6), the probability for a certain attribute, 𝑃(𝐶_𝑖 | 𝑋) can be calculated as given in Eq. 7. By the naïve presumption, the 𝑃(C_i | X) becomes simple and computationally fast to calculate. Because P(X) is the same for each class and because naivety is presumed (that all the classes are equally likely), 𝑃(𝑋 | 𝐶_𝑖) is maximized to get the output class. [Han et al., 2012].

𝑃(C_i | X) = 𝑃(𝑋 | 𝐶_𝑖)𝑃( 𝐶_𝑖)

𝑃(𝑋) (7)

Without the equality assumption 𝑃(𝑋 | 𝐶_𝑖)𝑃( 𝐶_𝑖) would be needed to be maximized instead.

If there are, for example, three classes in the dataset, 𝐶₁ would be the result of the classification,

(11)

𝑃(𝑋 | 𝐶_𝑖) = ∏ 𝑃(𝑥_𝑘 | 𝐶_𝑖)

𝑛

𝑘=1

(8)

The calculation of 𝑃(𝑥_𝑘 | 𝐶_𝑖) is different for categorical and continuous attributes. In the case of a categorical attribute, it becomes very simple to calculate the probability: it is the number of cases in class 𝐶_𝑖 having the attribute value 𝑥_𝑘 divided by the total number of cases in 𝐶_𝑖. The equation for the continuous attribute is given in Eq. 9. Gaussian distribution is assumed for the attribute An. The standard deviation 𝜎 and mean 𝜇 are calculated based on the attribute An

values for the cases having the class 𝐶_𝑖. [Han et al., 2012].

𝑔(𝑥) = 1 𝜎√2𝜋𝑒⁻

(𝑥−𝜇)²

2𝜎² (9)

Therefore, the way how X’s predicted class label 𝐶_𝑖 is calculated, is given in Eq. 10. [Han et al., 2012].

𝑃(𝑋 | 𝐶_𝑖)𝑃( 𝐶_𝑖) > 𝑃(𝑋 | 𝐶_𝑗)𝑃(𝐶_𝑗) 𝑓𝑜𝑟 1 ≤ 𝑗 ≤ 𝑚, 𝑗 ≠ 𝑖 (10)

Having any zero value in the product in Eq. 8 results in zero. There is a technique called Laplacian correction (Laplace estimater, Laplace smoothing) to avoid this problem. In the technique, one training case is added to the cases with the value 𝑥_𝑘 of class 𝐶_𝑖 in the training data. For example, if there are zero cases with the value 𝑥_𝑘 of class 𝐶_𝑖 in the training data of 1000 cases where in total there are two different values of X, we would get the probability of 𝑃(𝑥_𝑘 | 𝐶_𝑖) = 1/1002 ≈ 0.0009 instead of 0 to have a positive total probability. This way, the zero values are prevented and the new corrected probabilities are very close to the original values.

(12)

3.3 Discriminant Analysis

Discriminant analysis can be used for dimension reduction (which was covered in Section 3.1.2) and classification. Linear Discriminant Analysis (LDA) is used to find the features or the linear combination of the features that discriminate between two or more groups best. One way to represent a two-class classifier is in terms of a discriminant function g(x). The equation g(x) = 0 defines the decision boundary which is a line that separates the samples of two classes.

A new unseen sample is categorized to the first class if g(x) > 0, and otherwise to the second class. This decision boundary is a hyperplane when g(x) is linear. [Bishop, 2006; Marsland, 2014]

There are two different approaches on LDA by two different researchers. For two class classification Fisher’s and Welch’s LDA are identical, only their approach is different. Welch’s LDA tries to minimize the total probability of misclassification, which depends on the features’

properties like the normal distribution of the data and the class probabilities. Fisher’s approach tries to find such linear combination of the features that there is a large separation between the projected class means while a small variance in each class cluster. Fisher’s LDA also does not have normal distribution requirement on the data and it can be generalized to more than two classes by small changes to the covariance matrices. [Kuhn, 2013]. Welch’s LDA is explained in Section 3.3.1, parameter estimation for Welch’s LDA in Section 3.3.2 and Fisher’s LDA is explained in Section 3.3.3.

3.3.1 Linear and Quadratic Discriminant Analysis

For the basis of classification, we use Bayes’s theorem to get Eq. 11. This is the same equation that was used in the context of the naïve Bayes classifier, but only with the difference that the features are presumed to be independent from each other in the naïve Bayes classifier. The density function 𝑓_𝑘 for a class k is given in Eq. 12, where the parameters are: µ_𝑘 is the mean vector of dimension p, 𝛴_𝑘 is the covariance matrix of dimension p, and 𝜋_𝑘 is prior probability of the class k. According to the Bayes’ theorem and presuming that all the parameters are known, 𝑃(𝐶_𝑘 | 𝑥) gives probability that the case/vector x belongs to the class k. Moreover, 𝜋_𝑘 is the prior probability of the class k defined as the number of the cases of the class k divided by the total number of cases in the training data. [Hastie et al., 2008]. For example, for a three class problem, if the prior probabilities are 𝜋₁= 0.8, 𝜋₂ = 0.1, and 𝜋₃ = 0.1 and densities 𝑓_1, = 0.3, 𝑓₂ = 0.3 and 𝑓₃ = 0.4, we get the following probabilities for the case x for the three classes:

𝑃(𝐶₁ | 𝑥 ) = ^0.8∙0.3

(0.8∙0.3)+(0.1∙0.3)+(0.1∙0.4) ≈ 0.77, 𝑃(𝐶_𝑘 | 𝑥) ≈ 0.10 and 𝑃(𝐶_𝑘 | 𝑥) ≈ 0.13.

(13)

𝑃(𝐶_𝑘 | 𝑋) = ^𝑓^𝑘^(𝑥)π^𝑘

∑^𝐾_𝑙=1𝑓_𝑙(𝑥)π_𝑙, 𝑘 = 1, 2, … , 𝑚 (11)

𝑓_𝑘(𝑥) = 1

√(2π)^𝑝/2|Σ_𝑘|^1/2𝑒⁻⁻¹²^(x−µ^𝑘⁾^T^Σ^𝑘⁻¹^(x−µ^𝑘⁾, 𝑘 = 1, 2, … , 𝑚 (12)

When 𝑃(𝐶_𝑘 | 𝑥) is defined, we can find the class that has the maximum posterior probability for the case/vector x as given in Eq. 13. The denominator from Eq. 11 can be left out because it is common for each k, so we can maximize 𝑓_𝑘(𝑥)π_𝑘 equivalently (lines 1-2). Also, we can maximize equivalently 𝑓_𝑘(𝑥)π_𝑘 and log (𝑓_𝑘(𝑥)π_𝑘), because log is a strictly increasing function (lines 2-3). In the last line, the density function 𝑓_𝑘 is inserted into the equation.

Ĝ(x) = arg max

𝑘 𝑃(𝐶_𝑘 | 𝑥) = arg max

𝑘 𝑓_𝑘(𝑥) π_𝑘 = arg max

𝑘 log (𝑓_𝑘(𝑥)π_𝑘) = arg max

𝑘 [−1

2(x − μ_𝑘)^TΣ⁻¹(x − μ_𝑘) + log (π_𝑘)]

(13)

Quadratic discriminant analysis (QDA) is a generalized version of LDA. The only difference to LDA is that there is no assumption that the covariances 𝛴_𝑘 are equal for each class, so the covariances need to be estimated for each class separately. Therefore, we would maximize Eq. 14 instead of Eq. 13. QDA creates a quadratic decision surface, whereas LDA created a linear decision surface. [Hastie et al., 2008].

Ĝ(x) = arg max

𝑘 [−1

2𝑙𝑜𝑔|Σ_𝑘| −1

2(x − μ_𝑘)^TΣ_𝑘⁻¹(x − μ_𝑘) + log(π_𝑘)] , 𝑘 = 1, 2, … , 𝑚 (14)

(14)

3.3.2 Parameter estimation

The parameters for the multivariate Gaussian distributions need to be estimated from the training data, as most of the time they are unknown. The parameters 𝜋̂_𝑘, 𝜇̂_𝑘 and 𝛴̂ are explained next.

𝜋̂_𝑘 is known as the maximum likelihood estimator for the class k. It is defined as 𝑁_𝑘 divided by N, where 𝑁_𝑘 is the number of cases of the class k and N is the total number of cases in the training data. [Hastie et al., 2008].

• 𝜋̂_𝑘 =^𝑁^𝑘

𝑁

μ̂_k is the mean vector for the class k. It is defined as the sum of a vector 𝑥_𝑖 of the class k divided by the number of cases of the class k. [Hastie et al., 2008].

• 𝜇̂_𝑘 = 𝛴_𝑔_𝑖_=𝑘 ^𝑥^𝑖

𝑁𝑘

Σ̂ is the common covariance matrix for all the classes. In the inner sum statement, we calculate the sum for each k: we multiply the i’th case subtracted by the class mean by its transpose.

Then the result is normalized by the scalar quantity N-K, where N is total number of cases and K is the total number of classes. After the first sum statement, we calculate the sum over all the classes. [Hastie et al., 2008].

• Σ̂ = ∑ ∑ ^(𝑥^𝑖^−μ^k^)(𝑥^𝑖^−μ^k⁾^𝑇

𝑔_𝑖=𝑘 𝑁−𝐾 𝐾𝑘=1

3.3.3 Fisher’s LDA

In two class problem, we want to find the linear combination (Eq. 15) of the features such that there is a large separation between the projected class means (between-class variance) and also a small variance in each class (within-class variance). This is illustrated in Figure 2. There are cases of two classes, blue and red, their corresponding histograms and the line in green where the dots/cases are projected. The class separation is better than in Welch’s LDA in which the projection line was drawn between the class means. [Hastie et al., 2008; Li et al., 2014].

𝑦 = 𝜃^𝑇𝑋 (15)

(15)

Figure 2: Fisher’s LDA [Li et al., 2014]

We define the objective function in Eq. 16, in which (𝑚̂₁ − 𝑚̂₂)² is the between-class variance and 𝑠₁²+ 𝑠₂² is the within-class variance.

𝐽(𝜃) = (𝑚̂₁ − 𝑚̂₂)²

𝑠₁²+ 𝑠₂² (16)

Let’s define the mean vector 𝑚_𝑘 for the class k in x-space and the projected mean vector 𝑚̂_𝑘 for the class k in y-space. In addition, we define variance 𝑠_𝑘² for the class k by utilizing 𝑚̂_𝑘 so that 𝑠_𝑘² is the sum of the projected samples subtracted by the projected mean of the class k.

The difference 𝑚̂₁ - 𝑚̂₂ in y-space gives an estimation of the separability between the classes 1 and 2. [Li et al., 2014]

• 𝑚_𝑘 = ¹

𝑁_𝑘∑_𝑥_𝑖_∈𝐶_𝑘𝑥_𝑖, where k = 1, 2.

• 𝑚̂_𝑘 = ¹

𝑁𝑘∑_𝑥_𝑖_∈𝐶₁𝑦_𝑖 = ¹

𝑁𝑘∑_𝑥_𝑖_∈𝐶₁𝜃^𝑇𝑥_𝑖 = 𝜃^𝑇𝑚_𝑘, where k = 1, 2.

• 𝑠_𝑘² = ∑_𝑥_𝑖_∈𝐶_𝑘(𝜃^𝑇𝑥_𝑖 − 𝜃^𝑇𝑚̂_𝑘), where k = 1, 2.

We want to transform Eq. 16 into a more convient quadratic form to make it easy to solve.

(𝑚̂₁ − 𝑚̂₂)² can be expressed as 𝜃^𝑇𝑆_𝐵𝜃 by inserting 𝑚̂_𝑘 = 𝜃^𝑇𝑚_𝑘 into the equation. The 𝑆_𝐵 matrix which is also called as the between-class scatter matrix is of form (𝑚̂₁ − 𝑚̂₂)(𝑚̂₁ −

(16)

𝑚̂₂)^𝑇. The sum 𝑠₁² + 𝑠₂² can also be rewritten as 𝜃^𝑇𝑆_𝑊𝜃. The matrix 𝑆_𝑊 is known as the within-class scatter matrix and is of the following form:

• 𝑆_𝑊= ∑_𝑥_𝑖_∈𝐶_𝑘(𝑥_𝑖 − 𝑚̂_𝑘)(𝑥_𝑖− 𝑚̂_𝑘)^𝑇

Finally, we obtain Eq. 17, where 𝑆_𝐵 and 𝑆_𝑊 are d x d matrices of the same dimension as 𝜃.

Maximizing the function gives us the projection where the within-class scatter 𝑆_𝑊 is as small as possible while the between-class scatter 𝑆_𝐵 is as large as possible. This means that the classes are clustered tightly together and their class means are as far apart as possible. The general case solution to this problem to find the w that maximizes J is known as the eigenvalue problem.

For a special case, when 𝑆_𝑊 is invertible, 𝜃 can be calculated by 𝜃 = 𝑆_𝑊⁻¹ (𝑀₁− 𝑀₂). [Li et al., 2014; Marsland. 2014]

arg max

𝜃 𝐽(𝜃) = 𝜃^𝑇𝑆_𝐵𝜃

𝜃^𝑇𝑆_𝑊𝜃 (17)

3.4 Decision Trees

There are two reasons why decision trees are a good choice as a machine learning algorithm.

First, decision tree is based on a binary tree structure and the computational cost of querying it is low as is building the tree. Second, decision trees can be followed: it is easy to understand how the logic works compared to hard-to-interpret models.

The most popular decision tree algorithm is ID3 (Iterative Dichotomiser 3), which uses information gain as the attribute selection measure. The information gain is presented in Section 3.4.3 and ID3 algorithm is presented in Section 3.4.4. Other popular algorithms are ID3’s successor C4.5 and Classification and Regression Tree (CART). Decision tree is also used in ensemble learning in which the random forest algorithm uses it as the base classifier.

Ensemble learning and random forest are explained in Section 3.5. [Han et al., 2012; Kelleher et al., 2018]

Decision trees are working well when the input data is mostly nominal or ordinal. If the input data is numerical, then a new branch would be created for each numerical value, which would make the tree enormous. It is possible to convert numerical data into ordinal by defining intervals, but this can be difficult. Decision trees are sensitive to noise, because the tree is split after each decision which makes the number of cases left in each branch lower. If there are

(17)

only few cases left in a branch, noise can have big influence. For this reason, decision trees should be kept as low as possible. [Marsland, 2014; Kelleher et al., 2018].

3.4.1 How to build the tree?

Decision tree algorithms are used in producing the tree. Decision trees can be constructed in top-down or down-top fashion, but top-down is a more common way. If the tree is built in top- down manner, the tree starts from the root node which presents a feature and is split up to two or more branches depending on if a binary tree or non-binary tree is produced. A simple decision tree is given in Figure 3. There are three features: sunny, windy, and rainy. If a new case would be classified, we would follow the tree starting from the root node “Sunny?”. If the feature “Sunny?” has “Yes” as the value, the decision tree will give “Go out” as the result for the classification. If the answer is “No”, we would go downward to the leaf “Windy?”, and do a similar comparison. This process is continued until the decision tree gives an output.

[Marsland, 2014]

A decision tree can also be presented as a program code as multiple if-statements which can be utilized in a program code. Figure 3 could be turned into the following if-statements:

• If it is sunny then go out

• If it is not sunny and it is windy then stay in

• If it is not sunny and not windy and rainy then stay in

• Otherwise go out

[Marsland, 2014]

(18)

Figure 3: An example of a decision tree

It is important to choose those features that give out the most information to have efficient and compact tree. Entropy and information gain are utilized in calculating which features to choose.

They are explained in the following sections. [Marsland. 2014]

3.4.2 Entropy

Entropy is the metric to measure homogeneity or purity in the data. The entropy function is given in Eq. 18. The entropy is defined as the sum of the product of the probabilities for all the classes, where 𝑝_𝑖 is the probability of the class i in the data. It describes uncertainty of the random variable p. The C is the number of classes in the data. The logarithm is with the base 2 because the entropy is measured in bits. Furthermore, 0 log 0 = 0 is defined, because the sum of zero probabilities should be zero. If all the cases share the same output class, the feature does not give any information and the result is therefore 0. The amount of entropy is between 0 and 1 for two classes. For more classes the maximum value depends on the size and the number of different values in a set. The value is maximum when all the values in a set are different. [Marsland. 2014; Kelleher and Tierney, 2018; Cover et al., 2006]

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) = − ∑ 𝑝_𝑖log₂𝑝_𝑖

𝐶

𝑖

(18) Windy?

Sunny?

Rainy?

Go out

Yes No

No

Go out Stay in

Yes Stay in

(19)

3.4.3 Information Gain

We define the information gain function by using the above-mentioned entropy function. ID3 algorithm utilizes this function to calculate the biggest information gain when choosing the best feature. The feature that gains best information is chosen as the splitting attribute. The information gain function is given in Eq. 19. The information gain function is defined as the difference between the entropy of the feature T and the entropy when feature A would be chosen. It tells how much information would be gained if we splitted the data by the feature A.

If we split by the feature A, there are v different branches from the feature A. In the case of a binary feature, there are two branches: the left branch and the right branch. To calculate the information gain for a split, we need to calculate the entropy for the root node’s children branches with weighted average based on the number of examples in them. The Information Gain is then the entropy of the root node minus the entropy of the children. [Marsland, 2014;

Han et al., 2012]

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛𝐺𝑎𝑖𝑛(𝑇, 𝐴 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − ∑|𝑥|

|𝑇|^{𝐸𝑛𝑡𝑟𝑜𝑝𝑦}(𝑥)

𝑣

𝑥∈𝐴

(19)

The set T contains all the cases. |x|/|𝑇| is the weight for the branch, where |𝑇| is the total number of cases in the branch and |x| is the total number of cases in the branch that have value x. v is the number of different values in feature A. The total sum of all the weights is 1.

3.4.4 ID3 algorithm

The ID3 algorithm is given in Figure 4. The idea is to build the tree recursively by finding the best feature at each point until the tree is complete. The training data, the set of features and the information gain function are given as the input to the function. The Information Gain function is used in calculating the information gain for each feature in the set S. The root node is created from the best feature F. A branch from the node is created for every value in F. Then in each branch, the function is called recursively with the features excluding the one used. The

(20)

stopping criterion is reached when either there is only one feature left (a leaf node is created with the most common class label) or if there is only one class left in the data (a leaf is created with the class). As result a ready-built decision tree is returned.

Input

T = Training data S = The set of features

p = Information gain function

Output Decision tree ---

1. if: all the cases have the same label 2. -Create a leaf node with the label 3. else if: there are no features left

4. -Create a leaf with the most common label 5. else

6. -Apply information gain function p to choose the best feature F from the set of features S. Create a node from this feature.

7. -Create a branch for all the different values in F 8. -For each branch:

9. -Remove the chosen feature from F by the information gain function 10. -Make a recursive call with the rest of the features in F

Figure 4: ID3 algorithm [2014, Marsland]

3.5 Ensemble Learning

When multiple different machine learning models are combined, it is called ensemble learning.

However, it is common to only use one type of algorithm at a time, e.g., decision tree with random forest, but it is possible to also combine different kinds of models. The idea of ensemble

(21)

learning is that multiple learners that specialize each in their area assuming there is enough variance between them give better result when combined (aggregation) than only one learner.

The downside is that complexity is increased in the algorithms and models. The most popular ensemble learning methods include boosting, bagging, and stacking. Random forest and the underlying techniques bootstrapping and aggregation are explained in the following sections.

[Flach, 2012; Marsland, 2014]

3.5.1 Bootstrapping and aggregation

Bootstrapping is a statistical method which is utilized in ensemble learning. The bootstrap sample is chosen randomly with replacement, meaning that the same value can be chosen multiple times making the selection process random. The given bootstrap sample represents the population. The bootstrap sample is usually the same size as the original dataset and it is taken multiple times, at least 50 times minimum according to Marsland (2014). Each bootstrap sample is then used in training a model in the ensemble and the combination of all the trained models is called an ensemble of trees. The benefit of bootstrapping is to have many diverse decision trees in the ensemble of trees. After the ensemble of trees is ready, there are options how to aggregate them. The most common way is to take plurality vote of all the classifiers in the ensemble in a classification problem or average in a regression problem. [Kuhn, 2013;

Marsland, 2014]

3.5.2 Random Forest

Breiman’s random forest algorithm has gained popularity in recent years. It is a bagging algorithm based on bootstrapping and aggregation. It uses decision tree algorithm as the base classifier. The algorithm is given in Figure 5. The key idea is to train diverse decision trees with randomly created bootstrap samples to have multiple learners with the same distribution in each. In addition to bootstrapping, also the set of features are randomly selected for each node in the decision tree. Bootstrap aggregation and the random feature selection help in reducing the variance leaving the bias unaffected. Also, no pruning the trees is needed.

According to Breiman (2001), the recommended value for the number of random features is 𝑙𝑜𝑔₂𝑓 + 1 where f is the number of all the features in the dataset. [Breiman, 2001; Seni and Elder, 2010; Marsland, 2014]

(22)

Input

T = Training data

N = The number of trees in the ensemble f = The number of features

Output

Ensemble of trees Do for each tree N

1. Create a bootstrap sample from the dataset

2. Create a decision tree based on the bootstrap sample

3. Select f features in random from all the features for each node. Then calculate information gain on the set to choose the splitting feature.

Figure 5: Random forest algorithm [Marsland, 2014]

3.5.3 Evaluation of Random Forest

It is easy to calculate the accuracy of the bagging algorithm. In each bootstrap sample, the cases that were not chosen as the bootstrap sample are called out-of-bootstrap (OOB) examples. They can be used as novel test data in the validation. The OOB score is calculated from the OOB samples: the number of correct predictions divided by the number of samples. OOB error is the opposite: the number of incorrect predictions divided by the number of OOB samples. The OOB error and cross-validation produce similar kind of error estimate, but the validation can be performed at the same time when the model is being trained. Therefore, training all the trees is not even necessary and the training can be stopped when the OOB error stabilizes. [Hastie et al., 2008; Marsland, 2014; Kuhn et al., 2013].

3.5.4 Bias and variance error

Bias and variance describe two different kinds of errors machine learning algorithms can make.

Bias describes how big the error is for a model in average, in other words, how much the predicted values of a model differ from real values of the data. Variance tells how much there is difference between the current predictions and future predictions when the model is trained

(23)

with unseen data. For example, a too small dataset can cause variance to the result. The so- called bias-variance trade-off describes the connection between bias and variance: increase in bias decreases variance and vice versa. Careful balancing between minimizing bias error and variance error is required to have low total error.

Linear algorithms have high bias meaning they perform worse in complex problems than non-linear algorithms. Linear algorithms also have low variance which means that there are small changes to the model when the training data is changed. [Kuhn et al., 2013; Flach, 2012;

Marsland, 2009]

(24)

4 Model evaluation

It is up to the data scientist to choose the set of machine learning algorithms to be used in a classification problem. After the ML model is built, its performance should be tested. A test set that is independent from the actual training data is used in testing the accuracy. If the same data was used in both phases in training the model and testing the model, the results would be biased.

The dataset could be split into three different parts: a training set, a validation set, and a testing set. Common splits are 50:20:30 and 40:20:40. After each ML model is trained with the training data, the validation set can be used in evaluating their performance to compare which algorithm produces the best model. After that, the validation data can be joined with the test data to have one big set of training data. It can then be used to evaluate how the algorithm performs on unseen data. [Han et al., 2012; Kelleher and Tierney, 2018].

According to Kelleher and Tierney (2018), in most data science projects it is common that the first evaluation reveals problems in the data. A data scientist might wonder why a model works suspiciously well or badly and getting back to preprocessing might reveal problems in the data. It is not common that the model building and the preprocessing phases are repeated multiple times. [Kelleher and Tierney, 2018]

Evaluation measures are given for balanced and imbalanced datasets. Also, holdout method and k-fold cross validation and its special case leave-one-out cross-validation are given in the following sections.

4.1 Evaluation measures for balanced datasets

A confusion matrix is a good tool for analyzing the effectivity of a model. It gives a good overall look on how the classifier is performing. The confusion matrix for binary classification is given in Figure 6. The confusion matrix consists of true positives, true negatives, false positives, and false negatives. P is the number of positive tuples that are of the class ‘yes’ and N is the number of negative tuples that are of the class ‘no’. TP is the number of positive tuples that were correctly classified, and TN is the number of negative tuples that were correctly classified. Moreover, FP is the number of negative tuples that were incorrectly classified, and FN is the number of positive tuples that were incorrectly classified. [Han et al., 2012]. True positive rate and other measures can be calculated from these figures as given in Section 4.2.

The confusion matrix for multiclass classification for n classes (where n > 2) is more

(25)

complex. Then the size of the confusion matrix depends on the number of classes: for n classes, the size is n × n.

Predicted class

Actual class

yes no

yes TP (true

positives)

FN (false negatives) P

(positives)

no FP (false

positives)

TN (true negatives) N

(negatives)

P’ N’

Figure 6: An example of a confusion matrix

There are different metrics how to evaluate the accuracy of a model; how well the model can predict future cases, e.g., accuracy, sensitivity, specificity, and precision. The model should be trained and evaluated with a different set of tuples than were used building the model to avoid overfitting to the data. [Han et al., 2012].

The accuracy is defined in Eq. 20 as the sum of true positives and true negatives over all the cases. This is a very intuitive way to measure the accuracy of a model: it is simply the percentage of how well the model classifies the tuples into correct classes. [Han et al., 2012].

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑚𝑜𝑑𝑒𝑙) =𝑇𝑃 + 𝑇𝑁

𝑃 + 𝑁 (20)

The error rate is defined in Eq. 21 as the sum of false positives and false negatives over all the cases. The error rate is the opposite to accuracy as it is 1 – accuracy. [Han et al., 2012].

𝑒𝑟𝑟𝑜𝑟𝑟𝑎𝑡𝑒(𝑚𝑜𝑑𝑒𝑙) =𝐹𝑃 + 𝐹𝑁

𝑃 + 𝑁 (21)

(26)

There are also other ways to measure classifiers than using accuracy-based measures. The following aspects can be compared: speed, robustness, scalability, and interpretability. [Han et al., 2012].

4.2 Evaluation measures for imbalanced datasets

For imbalanced datasets where the class distribution is skewed, the previously mentioned accuracy measure could give a good score although the number of minority class labels are classified wrong. In this case, sensitivity and specificity measures could give a more reliable result. [Han et al., 2012].

Sensitivity and specificity are also called true positive rate (TPR) and true negative rate (TNR). The sensitivity and specificity measures are defined as the sum of true positives over positives and true negatives over negatives (Eqs. 22-23). For example, if there were 300 positive tuples and 90 out of them were classified correctly as positive, the sensitivity of the classifier would be 90/300 = 30%. [Han et al., 2012].

Precision is defined in Eq. 24 as true positives over the sum of true positives and false positives. Precision score tells only what proportion of tuples classified as positive are correct, whereas recall (also known as sensitivity) tells what proportion of the actual positive tuples were classified correctly. Precision and recall are related so that increase in precision lowers the recall and vice versa. Precision and recall measures are usually used together either by comparing values of one to a fixed value of another, or by combining them to a single measure called F1-score, which is given in Eq. 25.

False positive rate and false negative rate measures are given in Eqs. 26-27. [Han et al., 2012].

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦, 𝑟𝑒𝑐𝑎𝑙𝑙, 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 (𝑇𝑃𝑅) =𝑇𝑃

𝑃 = 𝑇𝑃

(𝑇𝑃 + 𝐹𝑁) (22)

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦, 𝑇𝑁𝑅 =𝑇𝑁

𝑁 (23)

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

(𝑇𝑃 + 𝐹𝑃) (24)

(27)

𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 (25)

𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 (𝐹𝑃𝑅) = 𝐹𝑃

𝑁 (26)

𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 (𝐹𝑁𝑅) = 𝐹𝑁

𝑃 (27)

A ROC curve (receiver operating characteristic curve) is a useful tool for vizualising the performance of different classifiers and comparing them. It is used with imbalanced datasets when the error cost is unequal. It is a graph of the ratio of sensitivity over FPR rate. Two example ROC curves denoted as blue and orange lines are given in Figure 7. The perfect classifier would be the green dashed line from the point (0,0) to the point (0,1) and from there to (1,0). The classifier denoted as dashed black line would be ineffective classifier. The more the classifier is weighted in the upper left corner area, the better is its performance. Both classifiers have a meaningful performance because the curves are located between the baseline (dashed line) and the upper left corner. However, it can be easily observed that the orange classifier is performing better.

An AUC (Area Under the Curve) score is a single measure that can be calculated for a ROC curve by calculating the space under the curve. The value is between 0 and 1. It tells how well the model can distuinguish between different classes. From Figure 7 we can observe that the score would be somewhere between 0.5 (which is the baseline) and 1.0 which is the upper left corner. [Chawla et al., 2002; Kuhn and Johnson, 2013].

Figure 7: An example of two ROC curves

0 0,25 0,5 0,75 1

0.25 0.5 0.75 1

SENSITIVITY / TPR

FPR

(28)

4.3 Other methods

4.3.1 Holdout method

In holdout method the data is randomly partitioned into a training set and a test set. The training set is used to build the model and the test set to evaluate the model. It is important that the test set is not used in training the model, because that could create a biased model. According to Han et al. (2012), usually the test set should be one-third of the data, and the training set two- thirds of the data, but other possible splits are also mentioned in the literature.

The holdout method works if the data set is sufficiently large. When this is not the case, finding the best split is problematic: The bigger the training set, the more realistic classifier accuracy, whereas the bigger the test set, the better error accuracy. In the worst scenario training samples from certain class are missing resulting in a biased model. Witten et al., (2011) mentions a stratified holdout method which ensures that all the classes will be equally presented in the test and training sets. The methods in the following sections give more reliable estimates than the holdout method.

Random subsampling is a common way to handle the possible bias caused by the holdout method. In random subsampling the split into a training set and a test set is done randomly and multiple times. Unlike in holdout method, the accuracy is calculated as the average over the iterations. Also, stratification can be utilized when the test set is chosen in each iteration. [Han et al., 2012; Witten et al., 2011; Kelleher et al., 2018].

4.3.2 K-fold cross validation

K-fold cross validation is used when the dataset is not sufficiently large. In k-fold cross validation the training data is partitioned into k subsets of equal size called folds. For example, in 3-fold cross validation three subsets of equal size of the data are formed. Then each of the three sets are once used for evaluating the model and the rest are used for building the model.

The accuracy for the k-fold cross-validation is calculated as the average of k iterations. [Witten et al., 2011].

In leave-one-out (LOO) method each tuple in the training set is used once as a test set and the remaining of the tuples are used to build the model. The process is repeated until all the tuples are used once as a test set. The accuracy is calculated in the same way as in the k-fold

(29)

cross validation. [Witten et al., 2011].

Compared to other methods LOO is deterministic: every time the same result is obtained, because there is no randomness involved in selecting the training set. Another advantage in LOO is that because all the data (except the one case that is left as a test case) is used building the model, the classifying accuracy is maximized. [Witten et al., 2011].

Stratification cannot be used in leave-one-out method because there is only one tuple in the test set and the class distribution must be uniform in the test and training sets. [Witten et al., 2011].

(30)

5 SMOTE (Synthetic Minority Over-sampling Technique)

A dataset is unbalanced when the class distribution is not uniform. This is a problem in machine learning. Machine learning algorithm could only classify all the cases as the dominating class and still get a very high accuracy. There are different ways how to deal with imbalanced datasets. One way is to associate a cost with a certain incorrect prediction. Another way is to either over-sample the minority class or under-sample the majority class in the training data.

SMOTE is a mix of these two techniques: under-sampling the majority class and over-sampling the minority class. [Chawla et al., 2002].

SMOTE is one of the most popular systematic algorithms for generating synthetic samples.

However, it is not much discussed in the literature. The SMOTE algorithm is given in Figure 8. In this approach, extra training data is generated by creating synthetic samples from the minor class. The number of new samples depends on the amount of over-sampling needed. For example, if the chosen sampling rate is 200 %, two nearest neighbours to the minor sample are chosen and one new sample is generated in the direction of each at a random point between the minor sample and the nearest neighbour. Therefore, SMOTE percentage of 200 % would double the amount of new generated samples. [Chawla et al., 2002].

(31)

Input:

S = number of samples

k = number of nearest neighbours n = SMOTE % as integer

Output:

Array of generated minority class examples ---

n := rate of SMOTE as integer # for example 2 = 200%

k := number of neighbours # for example 5 nr_of_attrs := number of attributes

last_idx := 0 # running index to the last generated sample sample[ ][ ] # array of the original minority class samples synthetic_sample[][] array of the new generated samples for i := 1 .. S

idx_array := indices of the k nearest neighbours of i Populate(n, i, idx_array)

# Function for generating the new samples

# For example, if n=5, the base of the function is run 5 times.

Populate(n, i, idx_array) while n != 0

rand_nearest_neighbour := get_random_int[1, k]

for attr := 1 .. nr_of_attrs

diff := Sample[idx_array[rand_nearest_neighbour]][attr] - Sample[i][attr]

noise := get_random_int[0, 1]

Synthetic_sample[last_idx][attr] = Sample[i][attr] + noise * diff last_idx := last_idx + 1

n := n-1

Figure 8: SMOTE algorithm [Chawla et al., 2002]

(32)

6 The project

The project is part of the research project of classification and incidence of patient injuries in the field of psychiatry in the Medical Faculty at Tampere University. The task is to classify data in the field of psychiatry and neurology by applying machine learning algorithms.

The project part consists of two different classification tasks: a classification of applications for compensation into six predefined categories and a binary classification problem into two categories - if the application for compensation was accepted or not. The data is offered by The Patient Insurance Centre of Finland.

6.1 The dataset

The data consists of reasons for the decisions and an expert’s opinions on the applications for compensation (N~350). The data is complemented by the actual applications for compensation data. The applications for compensation include the following information: medical field (psychiatry/neurology); applicant’s age; sex; university hospital district (specialised areas of responsibility), where treatment was obtained; year of the application; year of the decision;

decision (positive/negative). Applications were manually classified into six different classes in the pilot phase (the classes are given in Table 1). The objective of the classification was to find out those phrases in the decisions and statements that would lead each case to a certain class.

The phrases chosen from the applications include, e.g., diagnoses, descriptions of symptoms or other significant phrases. Some of the phrases include (translated from Finnish) for example:

”inappropriate medical treatment”, ”appropriate care during hospitalization”, ”anxiety”, and

”medication discontinuation”, which were categorized into groups ”nursing”, ”hospital care”,

”depression”, and ”drugs and medication (not psychosis)”. This classification is then repeated with the machine learning algorithms and the results are compared. [Kampman, 2021]

(33)

Class nr

Class label

1 Psychosis, involuntary treatment. Care or medication deemed unwarranted or harmful in the complaint.

2 A complaint about a suicide attempt or

completed suicide. Care is deemed insufficent or faulty.

3 A complaint about diagnostic error or a prolonged diagnostic process.

4 Harm due to medication or another form of biological treatment OR incorrect medication (not related to psychosis).

5 Harm due to some other aspect of treatment (e.g. therapy, problems in communication).

6 Incidents during hospitalization (e.g. falling down, errors in administering medication).

Table 1: The classes in the pilot phase

6.2 Preprocessing

In the data preprocessing phase, the data was studied and cleaned. The dataset consisted of 328 cases before preprocessing. Some cases were dropped out because of insufficient data. There are 3-15 phrases for each case in the multivalue attribute phrases in the dataset. Some of the phrases had spelling mistakes in them. There were phrases that were opposite in the meaning:

”use is not entirely appropriate” and ”use was properly instructed”. Some phrases could be split up in parts, e.g., ”falling serious concussion” into ”falling” and ”serious concussion”. Some of the phrases were quite similar, e.g., "clinical research and treatment procedure” and ”clinical research or treatment procedure”. Deciding how these phrases were split up or categorized is done by the help of the project group. After the preprocessing phase there were 308 cases left in the dataset.

There would not have been enough cases for all the phrases for the classification, so a list of 35 distinct abstract groups were created for this purpose. All the 1591 phrases were categorized into these groups by the help of the project group. The list of groups and classification criteria for each of them is given in Table 2. As an example the phrase group

”hospital care” (translated from Finnish) is given in Table 3. Some of the words for example

“hospitalization” are there multiple times, because of the declension of Finnish nouns.

“Osastohoito” and “osastohoidon” are both translated into “hospitalization”. Also, the words

“osastohoito” and “sairaalahoito” are both translated as “hospitalization” in the table. The

(34)

observation matrix is of size 308 x 36 in which the first 35 columns are the created numerical attributes formed from the created phrase groups and the 36th column is the class label for each case.

In addition, min-max normalization was performed to the data (explained in Section 3.1.1), meaning every attribute was transformed to a range between 0 and 1. This is essential especially for the k-NN classifier which otherwise would not treat every attribute equally.

(35)

Phrase group

Classification criteria Number of

phrases

1 The patient's demeanor/state 200

2 Psychosis, delusions 12

3 ADHD and other neurological diseases 52

4 The patient's behaviour 39

5 Interaction in a treatment setting 22 6 Brain tumors + other organic neurological

diseases and symptoms 23

7 Intoxicants 22

8 Bipolar disorder 19

9 Other organic diseases, symptoms 17

10 Electroconvulsive threapy 154

11 Neuroleptics and neuroleptic treatment 20

12 Suicide 38

13 Hospitalization 40

14 Suicidality 44

15 Therapy 16

16 Imaging 21

17 Monitoring 9

18 Tests and examination 21

19 Tests and treatment together 31

20 Diagnostics 43

21 Medicines and medication (not psychosis) 64 22 Other psychiatric diagnoses and symptoms 216

23 Depression 54

24 Death, decease 26

25 Anxiety, anxiousness 18

26 Treatment 12

27 Involuntary 156

28 Patient harm 53

29 Procedure 23

30 Adverse effects 14

31 Accident 25

32 Medicine in general 20

33 Other, unclassified 26

34 Compensation, damages 30

35 Otherwise related to patient treatment 11 Table 2: The number of phrases in each phrase group.