• Ei tuloksia

2.3 Performance evaluation, model selection, and algorithm selection

2.3.2 Model selection

Model selection is one of the key problems in machine learning. Usually, the term model is used to refer to a hypothesis space F. Model selection is the process of determining which model best explains the data.

The problem can be illustrated using the following commonly adopted example from [6]. Figure 2.3 shows four cases of regression by polynomials of degreesm= 0, 1, 3, and 9 fitted by minimizing the sum of squared residuals. As can be seen, constant and linear function exhibits poor fit to the data. This phenomenon, known asunderfitting, arises whenever the hypothesis space is too restricted and does not include predictors being able to approximate complex dependencies. At the other extreme, regression by a polynomial of degree 9 is a perfect fit in the sense that the graph of the polynomial goes through all the data points. However, as in the previous case, this predictor will also do a poor job on previously unseen data.

This phenomenon is commonly known asoverfitting. Finally, polynomials of degree 3 keep the balance between being expressive enough to explain the pattern in the training set and being able to generalize well to unseen data. Hence, model selection is a procedure for finding a trade-off between the complexity of a predictor and how good the fit is.

In this example, hypothesis space can be parameterized by a single number: the degree of a polynomial,m. That is, model selection is reduced to finding a suitable value ofm.

The following decomposition of theexcess risk[80], defined as distance between the expected loss of f ∈ F and the Bayes’ risk, may provide an intuition behind the issue:

The first term is known asestimation errorand the second term is known as approxi-mation error. The approxiapproxi-mation error only depends on the choice ofF. It measures the performance degradation incurred by imposing restrictions onF. The estimation error expresses how well the function f performs in relation to the best possible f in the setF. This error reduces as the training set size increases because, by the law of large numbers [19], the empirical risk approaches to the true risk.

By allowing the larger set F approximation error can be reduced at the cost of incurring a large estimation error because richer hypothesis space increases the chance of overfitting. On the other hand, ifF is small the approximation error will be large, but the estimation error tends to be small. Therefore, model selection is a balancing act to ensure that F is large enough to make the approximation error small and small enough to make the estimation error small.

Model selection is related toOccam’s razor, a philosophical principle of “parsi-mony of explanations” stating, “among two hypotheses explaining data equally well one should prefer the simplest or the one making fewer assumptions” [81]. In connection to Figure 2.3, this principle suggests choosing a polynomial of the smallest degree that approximates data reasonably well.

The general procedure for model selection consists of the following steps:

1. Define a collection of models{F1, ...,FK}.

0 0.2 0.4 0.6 0.8 1 -1

-0.5 0 0.5 1

0 0.2 0.4 0.6 0.8 1

-1 -0.5 0 0.5 1

0 0.2 0.4 0.6 0.8 1

-1 -0.5 0 0.5 1

0 0.2 0.4 0.6 0.8 1

-1 -0.5 0 0.5 1

Figure 2.3:Polynomial regression models with the orders of 0,1,3 and 9.

2. For each model find the corresponding predictor fk = A(D;Fk) using a learning algorithmA.

3. Select the modelFk, wherek=arg minkJ(fk).

In the final step, J(f)denotes either an unbiased estimate or an upper bound of the expected loss for a decision rulef. In the former case, one of the aforementioned model evaluation techniques can be employed. For instance, if the hold-out method is used, then the dataset must be divided into two non-overlapping subsets. One is used by the learning algorithm in Step 2, and the other is used to estimate the expected loss in Step 3. In the latter case, if an upper bound does not depend on the data, the full dataset can be used in step 2.

Frequently, a collection of models is defined as (usually, an infinite) structure of nested function classes:F1⊂ F2⊂ F3⊂..., as illustrated on Figure 2.4.

Figure 2.4: Hierarchy of models with increasing complexity. The predictors fk= A(D;Fk)are found using a learning algorithmAfor each model.

An example of the natural hierarchy of models with increasing complexity is polynomials, where Fk may correspond to all polynomials of degree k. Another way to specify such a hierarchy of nested subsets is to associate models with level sets of a non-negative functionψ(f): Fτ ={f|ψ(f)≤τ}.

In the context of empirical risk minimization, 2.1.1, this idea leads to the follow-ing constrained optimization problem

minf Remp(f) (2.82)

s.t.ψ(f)≤τ.

By changing the non-negative constantτone defines a continuum of nested models.

In fact, it can be shown that a solution of this problem is also a solution of the following unconstrained optimization problem (but not vice versa)

minf Remp(f) +λψ(f) (2.83) for some nonnegative constant λ. The second term is known as a regularizer or a penalty functional. It is usually chosen to penalize too ‘complex’ functions f. Therefore, the value ofλdefines a trade-off between the goodness of fit, measured by the first term, and model complexity, measured by the regularization term. In (2.82),R(f)decreases asτincreases because the feasible set{f|ψ(f)≤τ}expands, whereas in (2.83), R(f) decreases as λ decreases because the second term is out-weighed by the first term in the minimization. Hence,λ in (2.83) is inversely pro-portional toτin (2.82). Because unconstrained optimization problems are easier to solve, the unconstrained formulation is commonly adopted and known as regular-ized risk minimization.

Typically, f belongs to some family of functions parametrized by a vector of parametersθ. A common strategy for defining a sequence of nested function classes is to constrain the (e.g. euclidean) norm of the vector θ, that is, ψ(θ) = kθk. This strategy is adopted for training binary classifiers with the linear decision function:

f(x) =sign(θTx). A popular classifier of this category is a support vector machine, trained by minimizing the average hinge loss (2.14) across all training samples plus a regularizer

min

θ

1 N

N i=1

max(0, 1−yiθTxi) +λ

2kθk2, (2.84)

where yi ∈ {−1,+1}denotes the class label. In a constrained formulation (2.82), this form of regularizer would define a sequence of concentric hyperspheres with a common center at the origin. This formulation makes it easier to see that the squared norm regularizer constrains the radius of a norm ball in the parameter space, defined by an inequality such askθk2τ2. Hence, model selection reduces to the selection of the radius of this ball,τ.

One widely known instance of the aforementioned model selection procedure is structural risk minimization (SRM) [17]. At the first step SRM suggests dividing the class of functions into a hierarchy of nested subsets in order of increasing capacity:

F1 ⊂ F2 ⊂ F3 ⊂ .... Second, it performs empirical risk minimization on each subset: fk = arg minf∈FkRemp(f). Finally, a model is selected by minimizing an upper bound of the risk of the form

R(fk)≤Remp(fk) +C(Fk,N,δ), (2.85) which holds with probability 1−δwith respect to the repeated sampling of training sets. The second term penalizes the complexity of the hypothesis space and does not depend on f. It increases in kand decreases in N, the size of the training set.

If Fk is fixed, the bound can be lowered by minimizing the first term leading to empirical risk minimization as a natural method to find fk.

In the context of probabilistic learning, based on estimating a probability density function of the unknown data distribution, the same procedure can be applied.

However, the probabilistic approach provides additional criteria for model selection.

One example is the logarithmicpredictive density[82, 83] that the model assigns to a held-out data point. Given the training setD={z1, ...,zN}, it can be computed in a holdout [84] or cross-validation fashion. A particular case, known as leave-one-out log predictive density, orpseudo-likelihood, is defined as

N i=1

logp(zi|D¬i), (2.86)

whereD¬i denotes the dataset with thei-th data point removed.

Another useful criterion for model selection can be obtained by adopting a Bayesian approach to probability. In the Bayesian framework, a model refers not only to a class of density functions p(·|θ) parametrized by a set of parameters θ but also a prior distribution over parameters p(θ). That is, a model is defined as M = {(p(·|θ),p(θ))|θ ∈ Θ}. Unlike the frequentist approach, there is no model selectionper se, since learning does not involve searching for optimal parameter val-ues, but averaging over a posterior distribution of models. Assume that there is a set of candidate models {M1, ...,MK}and their prior probabilities p(M1), ...,p(MK), along with a training setD. The predictive density for a new data pointz, then, can be found by summing (integrating) over all the models, a procedure called model averaging, where the posterior probability of a modelMk can be found using the Bayes rule:

p(Mk|D) = p(D|Mk)p(Mk)

p(D) . (2.88)

As can be seen, there is no selection of a single model, but a summation over mul-tiple models weighted by model posterior probabilities. If there is a continuum of models, the sum is replaced by an integral. However, this sum (or integral) is typically intractable or extremely costly to compute. Therefore, assuming that the posterior probability of the most probable model M dominates the others, this sum (integral) can be approximated as

p(z|D)≈ p(z|D,M) = Z

p(z|θ,M)p(θ|D,M)dθ. (2.89) In other words, one finds a point estimate of the model byreducing model averaging to model selection. If there are no prior beliefs in favor of one of the models (i.e. prior distribution over models is uniform), the maximum likelihood (also known astype-II maximum likelihood[6]) estimate is used:

M=arg max

M p(D|M) =arg max

M Z

p(D|θ,M)p(θ|M)dθ. (2.90) The quantityp(D|M), known asmarginal likelihood(marginal data density) or model evidence, has an interesting interpretation: it can be viewed as a density function that defines distribution over all possible datasets of size N. A simple model corre-sponds to a density function with its mass concentrated within a small region in the dataset space. For example, under the assumption of a sufficiently large N, it is very unlikely that a set of real numbers, having a histogram with multiple sep-arated peaks, was generated from a univariate Gaussian distribution. In contrast, the marginal density of very complex models has its mass spread over large re-gions of the space of datasets. Because of the normalization property of a density function, the marginal likelihood has a built-in penalty for model complexity, that automatically enables Occam’s razor. This effect is demonstrated in Figure 2.5.

Finally, the log marginal likelihood can be decomposed into a product of predic-tive densities similarly to (2.86) make predictions for unseen data points.

The computation of the marginal likelihood is a challenging task for most non-trivial models, as it involves a multi-dimensional integration over the parameter space. Hence, in practice, one resorts to various approximations such as theBayesian Information Criterion(BIC), defined as

logp(D|M)≈BIC=logp(D|θ,M) + R

2 logN, (2.92)

whereθis the maximum likelihood estimate of the model parameters andRis the total number of parameters. Another quantity used in place of marginal likelihood is the so-calledevidence lower bound(ELBO) which plays a key role in the variational inference discussed in Section 2.4. As its name suggests, ELBO lower bounds the marginal likelihood but if the bound is tight ELBO can be used as an approximation of it. Both BIC and ELBO are widely used in speaker recognition and speaker diarization. This includes using ELBO to approximate marginal likelihood ratios in speaker verification [69, 85] or using ELBO and BIC as criteria for speaker change point detection [2, 86, 87] and determining the number of speakers [42, 76, 88] in speaker diarization.

Figure 2.5: Schematic illustration of the distribution of data sets of sizeNfor three models of different complexity. Note that the distributionsp(D|M)are normalized.

Here, the horizontal axis is a one-dimensional representation of the space of all possible datasets for a regression problem with scalar input and scalar output. A few specific dataset are shown in the bottom. The modelsM1,M2andM3correspond to polynomial regressors of degree 1,2 and 3, respectively.