• Ei tuloksia

4.3.1 Holdout method

In holdout method the data is randomly partitioned into a training set and a test set. The training set is used to build the model and the test set to evaluate the model. It is important that the test set is not used in training the model, because that could create a biased model. According to Han et al. (2012), usually the test set should be one-third of the data, and the training set two-thirds of the data, but other possible splits are also mentioned in the literature.

The holdout method works if the data set is sufficiently large. When this is not the case, finding the best split is problematic: The bigger the training set, the more realistic classifier accuracy, whereas the bigger the test set, the better error accuracy. In the worst scenario training samples from certain class are missing resulting in a biased model. Witten et al., (2011) mentions a stratified holdout method which ensures that all the classes will be equally presented in the test and training sets. The methods in the following sections give more reliable estimates than the holdout method.

Random subsampling is a common way to handle the possible bias caused by the holdout method. In random subsampling the split into a training set and a test set is done randomly and multiple times. Unlike in holdout method, the accuracy is calculated as the average over the iterations. Also, stratification can be utilized when the test set is chosen in each iteration. [Han et al., 2012; Witten et al., 2011; Kelleher et al., 2018].

4.3.2 K-fold cross validation

K-fold cross validation is used when the dataset is not sufficiently large. In k-fold cross validation the training data is partitioned into k subsets of equal size called folds. For example, in 3-fold cross validation three subsets of equal size of the data are formed. Then each of the three sets are once used for evaluating the model and the rest are used for building the model.

The accuracy for the k-fold cross-validation is calculated as the average of k iterations. [Witten et al., 2011].

In leave-one-out (LOO) method each tuple in the training set is used once as a test set and the remaining of the tuples are used to build the model. The process is repeated until all the tuples are used once as a test set. The accuracy is calculated in the same way as in the k-fold

cross validation. [Witten et al., 2011].

Compared to other methods LOO is deterministic: every time the same result is obtained, because there is no randomness involved in selecting the training set. Another advantage in LOO is that because all the data (except the one case that is left as a test case) is used building the model, the classifying accuracy is maximized. [Witten et al., 2011].

Stratification cannot be used in leave-one-out method because there is only one tuple in the test set and the class distribution must be uniform in the test and training sets. [Witten et al., 2011].

5 SMOTE (Synthetic Minority Over-sampling Technique)

A dataset is unbalanced when the class distribution is not uniform. This is a problem in machine learning. Machine learning algorithm could only classify all the cases as the dominating class and still get a very high accuracy. There are different ways how to deal with imbalanced datasets. One way is to associate a cost with a certain incorrect prediction. Another way is to either over-sample the minority class or under-sample the majority class in the training data.

SMOTE is a mix of these two techniques: under-sampling the majority class and over-sampling the minority class. [Chawla et al., 2002].

SMOTE is one of the most popular systematic algorithms for generating synthetic samples.

However, it is not much discussed in the literature. The SMOTE algorithm is given in Figure 8. In this approach, extra training data is generated by creating synthetic samples from the minor class. The number of new samples depends on the amount of over-sampling needed. For example, if the chosen sampling rate is 200 %, two nearest neighbours to the minor sample are chosen and one new sample is generated in the direction of each at a random point between the minor sample and the nearest neighbour. Therefore, SMOTE percentage of 200 % would double the amount of new generated samples. [Chawla et al., 2002].

Input:

S = number of samples

k = number of nearest neighbours n = SMOTE % as integer

Output:

Array of generated minority class examples ---

n := rate of SMOTE as integer # for example 2 = 200%

k := number of neighbours # for example 5 nr_of_attrs := number of attributes

last_idx := 0 # running index to the last generated sample sample[ ][ ] # array of the original minority class samples synthetic_sample[][] array of the new generated samples for i := 1 .. S

idx_array := indices of the k nearest neighbours of i Populate(n, i, idx_array)

# Function for generating the new samples

# For example, if n=5, the base of the function is run 5 times.

Populate(n, i, idx_array) while n != 0

rand_nearest_neighbour := get_random_int[1, k]

for attr := 1 .. nr_of_attrs

diff := Sample[idx_array[rand_nearest_neighbour]][attr] - Sample[i][attr]

noise := get_random_int[0, 1]

Synthetic_sample[last_idx][attr] = Sample[i][attr] + noise * diff last_idx := last_idx + 1

n := n-1

Figure 8: SMOTE algorithm [Chawla et al., 2002]

6 The project

The project is part of the research project of classification and incidence of patient injuries in the field of psychiatry in the Medical Faculty at Tampere University. The task is to classify data in the field of psychiatry and neurology by applying machine learning algorithms.

The project part consists of two different classification tasks: a classification of applications for compensation into six predefined categories and a binary classification problem into two categories - if the application for compensation was accepted or not. The data is offered by The Patient Insurance Centre of Finland.