• Ei tuloksia

3. Data Analysis

3.2. Multivariate Analysis

In order to estimate the target variables multiple machine learning algorithms can be used. In this section, the basic idea of the machine learning algorithms used in this thesis, along with their hyperparameters that need to be fine-tuned to get the best predictive capability of the models, will be introduced.

20

3.2.1. Support Vector Machines

In 1992 Vapnik et al. introduced support vector machines (SVMs) as a classification algorithm84, five years later Druker et al. extended the algorithm’s capabilities to regressive tasks.85 The aim of SVMs is to find the optimal hyperplane that would divide the hyperspace between different classes, where optimal means having the maximum distance possible between the hyperplane and the training data points closest to it, which are termed support vectors.

SVM defines a cost parameter C, which defines a penalty for misclassifying support vectors in non-separable linear cases. High values of C generate complex decision boundaries to misclassify as few support vectors as possible86.

For cases with classes that are not linearly separable, SVM utilizes an implicit transformation of input variables using a kernel function (ex. Polynomial, radial basis function, sigmoid, etc..), which allows SVM to separate nonlinearly separable support vectors using a linear hyperplane87.

Table 1 shows a list of the most important hyperparameters of the Support vector machines regression algorithm, the list is not meant to be in-depth or exhaustive, for more information please refer to the sci-kit learn documentation of the function “SVR” .88,89

SVMs are effective when used in high dimensional spaces, especially if the number of features is more than the number of observations. However, its effectiveness is greatly reduced if the opposite happens. Other limitations include sensitivity to noise in observations and not being suitable for use with large datasets. Other than that, SVMs are considered to be a memory-efficient algorithm.

21

Table 1: Most Important Hyperparameters of SVMs

Parameter Range Explanation

Kernel

Linear

Kernel type to be used in the algorithm.

Polynomial Radial basis function

Sigmoid

Degree Integer value Degree of the polynomial kernel function, if used.

Gamma

1 / number of features

Kernel coefficient, except for the linear case.

1 / (number of features * Variance of X)

C Float Regularization parameter.

3.2.2. Random Forests

The first work on Random Forests goes back to Tin Kam Ho back in 1995 who introduced random decision forests90. Leo Breiman further developed the idea and introduced bagging and random selection of features together known as bootstrapping91 to the algorithm92 and later along with Adele Cutler registered the name Random Forests as a trademark93.

Random forest is an ensemble learning algorithm, of the bagging family. The main theory of random forest’ is that the aggregation of multiple uncorrelated base estimators (which suffer from high variance) will result in a low variance low bias estimator. The base estimator used is the decision tree algorithm hence the name Random Forest.

22

A Decision tree (DT) is a cascade of conditions starting from a root node, for each node (condition) the right child node represents the

“True” value, while the left represents the

“False”. The purpose of the decision tree is to route the inquiry (prediction) through its nodes to the last level (leaf) where a classification or a prediction verdict will be present as shown in Figure 7

Decision trees boast a plethora of advantages, including ease of interpretability, requiring

little effort for data pre-processing, the ability to work with both numerical and categorical data along the ability to predict nonlinear relations.

However, DTs suffer from overfitting when the trees grow too deep, high variance as small variations in training data can result in a totally different tree being generated, and lastly, the tree training process itself does not guarantee to reach the global minimum/maximum required in the search process itself.

Luckily, the Random Forests algorithm was designed to mitigate the aforementioned limitations of decision trees. The notion is the mean of multiple uncorrelated DTs will always outperform the best single tree.

The algorithm ensures that the different

DTs are uncorrelated by training each tree on a randomly selected subset of the original features which is known as bootstrapping. The final prediction is calculated by averaging the predictions of each individual DT in the forest which is known as aggregation. Together bootstrapping and aggregation are called bagging.

While random forests address the drawbacks of DTs, it suffers from a few disadvantages of its own, the biggest of which is being too slow in training and prediction in the case of using a large number of trees.

Figure 7: Decision Tree Hierarchy

Figure 8: Random Forest

23

Table 2 presents a list of the most important hyperparameters of the Random Forests algorithm, the list is not meant to be in-depth or exhaustive, for more information please refer to the sci-kit learn documentation of the function “RandomForestRegressor”89,94.

Table 2: Most Important Hyperparameters of RF

Parameter Explanation

Number of Estimators The number of trees in the forest.

Maximum Depth The maximum depth of the tree.

Maximum Number of

Features The number of features selected at each node Minimum Samples for

Split

The minimum number of samples required to split an internal node

Minimum Samples in A Leaf

The minimum number of samples required to be at a leaf node.

3.2.3. Partial Least Squares Regression

The development of Partial Least Squares is attributed to Herman Wold who developed it in the late 1960’s95. Partial least squares regression (PLSR) is a supervised learning algorithm, which is based - in great part, on Principal component analysis (PCA). In PCA the input features’ principal variance direction(s) are used to project the data points in a hyperspace where they can be classified or regressed. Principal component regression models utilize these principal variance directions to predict test data. It should be noted that in PCA and PCR, only the covariance of the training input features matrix (i.e., XTX) is used. On the other hand, PLSR utilizes principal variance directions of the input features along with their target values. In other words, PLSR estimates the target variables from the covariance structure of the input features and target matrices (i.e., XTY).

Therefore, PLSR can be described as a supervised learning technique, while PCR is an unsupervised one. The latter predicts using the input features only, while the former utilizes both the input features and the target data for predictions.96

PLSR has only one hyperparameter to configure which is the number of principal components used for prediction. The greater the number of components used; the less data is lost. On the other hand, if too many components are used, noise in the data will be passed to the model.

24

PLSR has many benefits, including the ability to robustly handle multicollinearity in input features, offering high predictive accuracy ,and a very low probability of chance correlation.

The key shortcomings are a higher risk of ignoring ‘real’ correlations and being sensitive to the relative scaling of the input features97.

3.2.4. Model Training and Validation

Using machine learning algorithms, we can develop models capable of producing data-driven predictions. However, to make accurate predictions for the target parameters in the real world, these mathematical models have to be fitted to some collected data. This stage in model development is called model training, and the dataset used for fitting is named the training dataset. Nevertheless, we need to evaluate how well the model will perform on unseen data to assess its prediction power. Here, we use another dataset called the test dataset, which is just a part of our collected data that was not involved in fitting the model.

In most cases, the model’s hyperparameters need to be fine-tuned to enhance its prediction capability. In this case, we will need to try different hyperparameters, evaluate the model’s performance using the test dataset and then compare which hyperparameter-set resulted in the best metrics. The problem with this approach is that the test dataset is now used to influence the model, resulting in the metrics being unrepresentative/ untrue.

To solve that problem, we divide our training dataset into K folds, use (K-1) folds for training, and the last fold for testing - although, in this context, it will be named validation as it is an intermediate step in developing the model. To ensure the folds will not influence the model’s metrics - and only the model hyperparameters do - the training-validation step repeated on all folds, each time leaving a different fold out. A schematic showing the folds’ assignments for K = 5 is present in Figure 9.

25

3.2.5. Feature Importance

Feature importance scores are considered insightful scores apart from the model prediction capabilities. Calculating the developed models’ feature importance score has mainly three advantages. First, they provide insights into the data correlations between each of the input features and the target observations. Second, studying the aforementioned correlations provides a deeper interpretation regarding the significant physical/biological relations between the features and the targets. And finally, using the two advantages mentioned earlier can result in providing better feature selection techniques. This can improve the developed models’

complexity, prediction, and generalization capabilities.

Feature importance for SVM models are calculated from coefficients of the support vectors in the decision function and the support vectors themselves, for PLSR models the feature importance are the coefficients of a linear model that maps the spectra to the target variables.

Lastly, for Random Forests, feature importance is computed from the mean decrease in impurity across all forest trees.

Figure 9: 5-Fold Cross – Validation

26