• Ei tuloksia

Model selection is a process of selecting appropriate machine learning techniques and parameter values for given data. Usual workflow starts with selecting a set of candidate models. Next, search strategy and an evaluation criterion are selected.

The candidate models are compared by building the models according to the search strategy and calculating the evaluation criterion values for each model. Finally, the model is selected which performs the best. Subsection 2.2.1 discuss grid search as an example of search strategy and Subsection 2.2.2 discusses cross-validation as an example of evaluation criterion.

In the model selection task the complexity of the candidate models is tuned so that the models will generalize, capture the essence of the data, as well as possi-ble. Complexity can mean, for example, describing the data with an equation with more parameters than actually needed. When the model is too complex, usually overfitting occurs. Overfitting is one of the central problems in machine learning.

2. Pattern Recognition and Machine Learning 9

Figure 2.3: Example of model selection in regression. Left: the model is too simple and underfitting occurs. Center: well generalized model capturing the essence of the signal. Right: the model is too complex and overfitting occurs.

Overfitted model describes noise or random error instead of the underlying signal in the data. On the contrary, underfitting occurs when the model fails to capture the underlying signal in the data. Underfitted model is usually too simple. Both un-derfitted and overfitted models will lead to poor predictions on unseen data. Figure 2.3 illustrates underfitted model, well-generalized model and overfitted model.

Bias and variance are important statistical properties describing the quality of the model. Their nature is easy to explain with an example. Let’s think of an archer shooting arrows at a target. Bias describes how much the archer systematically misses the bullseye in the same direction. Variance describes how scattered the arrows are. Underfitted model has typically high bias and low variance. Well-generalized model typically has low bias and low variance. Overfitted model has typically low bias and high variance.

A medieval monk, William of Occam, stated in the14th century: "Entities should not be multiplied unnecessarily.". The principle became known as Occam’s razor.

It proposes that a problem should be stated in its basic and simplest terms. It can be applied to model selection meaning the simplest model that fits the data is also the most plausible [25].

2.2.1 Grid Search

Grid search is a method for parameter optimization. It starts by selecting reason-able and usually exponentially growing sequences (e.g., 10−5,10−4, . . . ,104,105 or 2−10,2−9, . . . ,29,210) for each parameter. All combinations of parameter values are used one after another to build a new model. Performance metric values are calcu-lated for each model. As a result, the combination of these parameter values with the best score is chosen.

The downside of the grid search is the computational time required to find suitable

2. Pattern Recognition and Machine Learning 10

values for a given parameter. One solution to speed up the grid search is to do a coarse search first in order to identify "better" region of the grid. Next, finer grid search is performed in this small region.

2.2.2 Cross-Validation

Cross-validation (CV) is a model evaluation method. It is essential to be able to ensure the quality of a model and to indicate future model predictivity on unseen data. There are many types of CV techniques, but all of them have the same basic idea. First, data is split into multiple parts. In Figure 2.4, which illustrates cross-validation, data is divided into 5 parts. Then, one part of the data (training set, parts 1,2,4 and 5 in Figure 2.4) is used for training a model and other part of data that was not used in training of the particular model (validation set, part 3 in Figure 2.4) is used for testing the model. Testing means measuring performance, i.e., CV error. If the same data was used for training and testing, it would result in biased classification results. It would be the same thing as giving away the answers when arranging an exam. Using independent data set gives a hint of the level of generalization of the model and helps to avoid overfitting.

Train Train Validation Train Train

1 2 3 4 5

Figure 2.4: Illustration of cross-validation.

In k-fold cross-validation, the data is randomly split into k parts of equal size.

One of the parts is reserved for testing and the restk-1 parts will be used for training.

The CV process is repeated k times (the folds) so that on every iteration different part is used for testing. The final model that is selected is the one which produces the best performance averaged over all k folds.

Leave-one-out cross-validation (LOO-CV) is the same as k-fold cross-validation with the exception thatk=n, wheren is the total number of examples. The benefits of LOO-CV are avoiding random sampling and making maximum use of the data.

The disadvantage of LOO-CV is high computational cost becausen different models are trained on all the data except for one example.

11