• Ei tuloksia

When multiple different machine learning models are combined, it is called ensemble learning.

However, it is common to only use one type of algorithm at a time, e.g., decision tree with random forest, but it is possible to also combine different kinds of models. The idea of ensemble

learning is that multiple learners that specialize each in their area assuming there is enough variance between them give better result when combined (aggregation) than only one learner.

The downside is that complexity is increased in the algorithms and models. The most popular ensemble learning methods include boosting, bagging, and stacking. Random forest and the underlying techniques bootstrapping and aggregation are explained in the following sections.

[Flach, 2012; Marsland, 2014]

3.5.1 Bootstrapping and aggregation

Bootstrapping is a statistical method which is utilized in ensemble learning. The bootstrap sample is chosen randomly with replacement, meaning that the same value can be chosen multiple times making the selection process random. The given bootstrap sample represents the population. The bootstrap sample is usually the same size as the original dataset and it is taken multiple times, at least 50 times minimum according to Marsland (2014). Each bootstrap sample is then used in training a model in the ensemble and the combination of all the trained models is called an ensemble of trees. The benefit of bootstrapping is to have many diverse decision trees in the ensemble of trees. After the ensemble of trees is ready, there are options how to aggregate them. The most common way is to take plurality vote of all the classifiers in the ensemble in a classification problem or average in a regression problem. [Kuhn, 2013;

Marsland, 2014]

3.5.2 Random Forest

Breiman’s random forest algorithm has gained popularity in recent years. It is a bagging algorithm based on bootstrapping and aggregation. It uses decision tree algorithm as the base classifier. The algorithm is given in Figure 5. The key idea is to train diverse decision trees with randomly created bootstrap samples to have multiple learners with the same distribution in each. In addition to bootstrapping, also the set of features are randomly selected for each node in the decision tree. Bootstrap aggregation and the random feature selection help in reducing the variance leaving the bias unaffected. Also, no pruning the trees is needed.

According to Breiman (2001), the recommended value for the number of random features is 𝑙𝑜𝑔2𝑓 + 1 where f is the number of all the features in the dataset. [Breiman, 2001; Seni and Elder, 2010; Marsland, 2014]

Input

T = Training data

N = The number of trees in the ensemble f = The number of features

Output

Ensemble of trees Do for each tree N

1. Create a bootstrap sample from the dataset

2. Create a decision tree based on the bootstrap sample

3. Select f features in random from all the features for each node. Then calculate information gain on the set to choose the splitting feature.

Figure 5: Random forest algorithm [Marsland, 2014]

3.5.3 Evaluation of Random Forest

It is easy to calculate the accuracy of the bagging algorithm. In each bootstrap sample, the cases that were not chosen as the bootstrap sample are called out-of-bootstrap (OOB) examples. They can be used as novel test data in the validation. The OOB score is calculated from the OOB samples: the number of correct predictions divided by the number of samples. OOB error is the opposite: the number of incorrect predictions divided by the number of OOB samples. The OOB error and cross-validation produce similar kind of error estimate, but the validation can be performed at the same time when the model is being trained. Therefore, training all the trees is not even necessary and the training can be stopped when the OOB error stabilizes. [Hastie et al., 2008; Marsland, 2014; Kuhn et al., 2013].

3.5.4 Bias and variance error

Bias and variance describe two different kinds of errors machine learning algorithms can make.

Bias describes how big the error is for a model in average, in other words, how much the predicted values of a model differ from real values of the data. Variance tells how much there is difference between the current predictions and future predictions when the model is trained

with unseen data. For example, a too small dataset can cause variance to the result. The so-called bias-variance trade-off describes the connection between bias and variance: increase in bias decreases variance and vice versa. Careful balancing between minimizing bias error and variance error is required to have low total error.

Linear algorithms have high bias meaning they perform worse in complex problems than non-linear algorithms. Linear algorithms also have low variance which means that there are small changes to the model when the training data is changed. [Kuhn et al., 2013; Flach, 2012;

Marsland, 2009]

4 Model evaluation

It is up to the data scientist to choose the set of machine learning algorithms to be used in a classification problem. After the ML model is built, its performance should be tested. A test set that is independent from the actual training data is used in testing the accuracy. If the same data was used in both phases in training the model and testing the model, the results would be biased.

The dataset could be split into three different parts: a training set, a validation set, and a testing set. Common splits are 50:20:30 and 40:20:40. After each ML model is trained with the training data, the validation set can be used in evaluating their performance to compare which algorithm produces the best model. After that, the validation data can be joined with the test data to have one big set of training data. It can then be used to evaluate how the algorithm performs on unseen data. [Han et al., 2012; Kelleher and Tierney, 2018].

According to Kelleher and Tierney (2018), in most data science projects it is common that the first evaluation reveals problems in the data. A data scientist might wonder why a model works suspiciously well or badly and getting back to preprocessing might reveal problems in the data. It is not common that the model building and the preprocessing phases are repeated multiple times. [Kelleher and Tierney, 2018]

Evaluation measures are given for balanced and imbalanced datasets. Also, holdout method and k-fold cross validation and its special case leave-one-out cross-validation are given in the following sections.