• Ei tuloksia

2.1 Machine learning

2.1.1 Supervised learning

Given an input object represented by a d-dimensional vector of features (or at-tributes)x= (x(1),x(2), ...,x(d))T∈ X, the aim of supervised learning is to approxi-mate an unknown relationship between the input (observation)x∈ X and the out-put (target)y∈ Yin order to build an accurate predictor that outputsy= f(x). The function f is known as adecision rule(orhypothesis) and it is estimated (learned) using a set consisting of input-output pairsD ={(x1,y1),(x2,y2), ...,(xN,yN)}known as thetraining set. To optimize the decision rule using Done needs a way to discern how close apredictionof the function, ˆy, is from a desired output or ground truth y. To this end, one defines a loss function `(y, ˆy). Here, the basic elements of a supervised learning task are formally specified:

• Input spaceX

• Output spaceY

• The space of decision rules (or hypothesis space)F, such that f : X → Y and f ∈ F

• Loss function L :Y × Y → R+ such thatL(y,y) = 0,i.e. no loss is incurred for correct predictions.

• Training set D = {(xi,yi)}i=1N , where adata point zi = (xi,yi)belongs to the product spaceZ =X × Y, known as theinput-output space.

Additionally, it is assumed that data points(x,y)are drawn independently from the setZ according to an unknown probability measurePoverZ. In the following, it will be assumed that the probability measure P is defined by some probability density function, denoted by p. Further, all distributions will be characterized by probability density functions or probability mass functions.

A learning algorithm is a way to select a decision rule f ∈ F. It is defined as mappingAthat maps training sampleD ∈ ZN =Z × Z ×...× Zto decision rule f. That is, different learning algorithms may output different functions f. A common assumption is that the learning algorithms ignores the ordering of elements in the training setD, that isA(z1,z2, ...,zN) =A(zπ1,zπ2, ...,zπN), wherezi∈ X × Y is an element ofDandπ is a permutation,i.e. a bijection between sets{1, ...,N}.

Supervised learning tasks are categorized by the type of output variables. In classification, the task is to predict a categorical, discrete class label. Assuming that each word in a vocabulary is a separate class, automatic speech recognition can be viewed as an instance of a classification task. Another learning task, regression, focuses on predicting a continuous quantity such as temperature, price or road traffic. Formally, if the cardinality of the output spaceY is finite, the learning task is referred to as classification. Otherwise, it is referred to as regression.

If the elements ofY are not scalar (discrete or real) values, then the learning task is referred to asstructured output learningorstructured prediction[10]. An example of structured learning islearning to rank[11] where the goal is to predict a permutation of a fixed list of items. An example of something that uses ranking algorithms are

web search engines whose goal is to find apermutationso that the most relevant web pages for a given query are placed at the top of a list.

Finally, there are some specialized tasks that are usually considered separately in the taxonomy of supervised learning tasks even if formally they can be placed under one of the afore-mentioned categories. For instance,ordinal ranking[12, 13] is a task of predicting the value of an ordinal variable, or a variable whose value exists on an arbitrary scale where only the relative ordering between different values matters.

Such values can be encoded as positive integers. That is, two different elements of Y can be compared by the ‘<’ operation and. These are usually referred to as ranks to distinguish them from the labels in the classification task. Examples of ordinal variables include age measured in years, level of agreement from “strongly disagree” to “strongly agree” and the size of an object. Even if predicting ordinal data can be cast as a classification problem, taking into account the natural order within categories allows more accurate and interpretable predictions to be made.

The goal of supervised learning is to find a function f ∈ F such that theexpected loss functional R, or risk, associated with f, defined as an expectation of the loss function, is minimized:

R(f),Ep(x,y)[`(y,f(x))] = Z Z

`(y,f(x))p(x,y)dydx→min

f∈F. (2.1) Using the relation between joint and conditional density functions known as the general multiplication rule [14], p(x,y) = p(y|x)p(x), the risk functional can be

where Rcond(f|x) is known asconditional risk. A decision rule that minimizes the conditional risk for each inputxis referred to as theBayes decision rule[15],

fBayes(x),arg min

f∈FRcond(f|x) =arg min

f∈FEp(y|x)[`(y,f(x))], (2.4) also known as the optimal Bayes classifier in the context of classification. By con-struction it minimizes the overall risk. The resulting minimum value, theBayes risk:

RBayes = R(fBayes), is the smallest achievable risk, assuming that the distribution p(x,y)is known. This makes the Bayes risk a useful theoretical construct for super-vised learning tasks.

It should be noted that fBayesmay not be in the set of possible decision rules,F. Therefore, the choice of the family of predictors has an influence on how close the obtained predictor is to the Bayes optimal solution.

The choice of the loss function has also a major impact on the Bayesian decision.

Ideally, the loss function should be selected in a way that it would reflect the nature of the specific problem at hand. When the loss function cannot be determined due to limited knowledge about the problem, one option is to resort to classic losses used in statistics such as thequadratic loss

L(y, ˆy) = (y−yˆ)2, (2.5)

or theabsolute loss

L(y, ˆy) =|y−yˆ|, (2.6) both of which are suited to real-valued (univariate) outputs. For the quadratic loss, the Bayes-optimal decision corresponds to theexpected valueof the output,

fBayes(x) =arg min

f∈FEp(y|x)[(y−f(x))2] =Ep(y|x)[y] (2.7) and for the absolute loss to the median [16]. For categorical outputs, in turn, a common choice is thezero-one loss, defined as

`(y, ˆy) =

(0, ify=yˆ

1, otherwise (2.8)

which assigns the same penalty for any incorrect decision. The Bayes optimal clas-sifier with this loss selects an output with the largest probability [16]:

fBayes(x) =arg max

y p(y|x). (2.9)

The problem of selecting an appropriate family of decision rules and loss func-tion is discussed in further Secfunc-tion 2.3.

There are two most commonly adopted approaches used in a large number of applications to find a decision rule f [17]:

• Empirical risk minimization

• Estimation of probability density function

The former is used if one adopts the directform of the predictor, that is, explicitly specifies the set of decision rulesF and its elements. In contrast, the latter approach constructs a decision rule by constructing a probabilistic model of the data and applying Bayesian decision theory [15] to obtain a decision rule. These approaches will be described in the following section.

Empirical risk minimization

In practice, the joint probability density function of the inputs and outputs, p(x,y), is unknown. But if one assumes that the training setDwas drawn from this distribu-tion, it is possible to useMonte-Carlomethods [18], a class of numerical algorithms based on stochastic simulation, to approximate the risk functional. By the law of large numbers [19], the expected value of a random variable can be approximated by taking the empirical mean of independent samples of the variable. This allows numerical computation of integrals of the form

Eq(x)[h(x)] = Z

h(x)q(x)dx≈ 1 N

N i=1

h(xi), (2.10) wherehis some function andxiare independently and identically distributed (i.i.d.) samples from a distribution defined by probability density functionq. With the help of (2.10), one may then replace the true risk in (2.1)Rbyempirical risk:

Remp(f) = 1 N

N i=1

`(yi,f(xi)). (2.11)

Empirical risk minimization(ERM) principle [17] suggests picking a decision rule f so that it minimizes the empirical risk.

In practice a decision rule f belongs to a parametric family parametrized by a fixed number of parameters represented as a vector of real numbersθ= (θ1, ...,θd)T. That is, empirical risk is a function ofθand can be minimized using one of several numerical optimization methods for finding a local minimum of a function such as gradient descent or Newton method [20]. If the loss function isconvexthen the empirical risk is also a convex function, so any local minimum is a global minimum.

Some instances of empirical risk minimization involve non-convex loss functions.

Examples include training of a binary linear classifier f(x) = sign(θTx)using the zero-one loss function

`0-1(s) =1[s≤0](s). (2.12) Here, 1[s≤0] denotes an indicator function that is one if s ≤ 0 and zero otherwise.

This results in a piece-wise constant objective function:

Remp(θ) = 1 N

N i=1

`0-1(yiθTxi), (2.13) whereyi ∈ {−1,+1}denotes a class label. Instead of direct optimization of (2.13), commonly adopted remedy is to use a proxy to the loss, called asurrogate loss func-tion. For computational reasons it is usually a convex function that upper bounds the original loss function. The rationale for this is that by pushing down an upper bound one pushes down the original loss as well. Two broadly used surrogates to the zero-one loss are thehinge lossand thelogistic loss[21]:

`hinge(s) =max(0, 1−s), (2.14)

`logistic(s) =log(1+exp(−s)). (2.15) Even if the form of a decision rule is the same in both cases, the former approach and its associated learning algorithm is known as a support vector machine (SVM) while the latter is known aslogistic regression [22]. Convexity of the resulting ob-jective functions leads to tractable optimization problems, making these techniques attractive for practical use. However, in replacing original loss function by a surro-gate, one raises the natural question of whether minimizing the new risk leads to a function that also minimizes the original risk. The answer turns out to be affirmative for so-called consistentsurrogate losses [21]. Formally, a surrogate loss, `s, is con-sistent if for any distribution p(x,y)and for any sequence of functions fn :X → Y such that

n→limEp(x,y)[`(y,fn(x))] =min

f Ep(x,y)[`(y,f(x))] (2.16) the following equality holds

n→limEp(x,y)[`s(y,fn(x))] =min

f Ep(x,y)[`s(y,f(x))]. (2.17) In other words, minimization of the expected surrogate loss guarantees minimiza-tion of the original risk (2.1). It has been shown that both the hinge and the logistic losses are consistent with respect to zero-one loss for binary classification [21].

A classifier for speaker verification proposed in PublicationIIcan be seen as a modification of a conventional SVM with an aim to compensate fordomain mismatch,

a common problem in predictive modeling occurring when the distribution of the training and test sets differ [23].

Advantages and disadvantages of empirical risk minimization approach are sum-marized in Table 2.1.

Table 2.1: Advantages and disadvantages of empirical risk minimization.

Benefits Drawbacks

• In the limit, the empirical bution tends to the correct distri-bution.

• The decision rule is found on the basis of minimal expected loss (risk), which is the quantity we are ultimately interested in.

• The decision rule must be re-trained if the loss function changes.

• There are no inherent ways to es-timate the confidence intervals of the predictions.

• There is no straightforward way to incorporate unlabeled data.

Density estimation

The main characteristic of the empirical risk minimization approach discussed in the previous section is that the functional form of a decision rule isexplicitly defined.

Typically, a decision rule f :X → Ybelongs to some parametric family of functions, therefore, ERM selects a member of that family by solving an optimization problem.

In contrast to ERM, the approach based on density estimation does not defines an explicit form of a decision rule. This approach aims at estimating the conditional density functionp(y|x)from the data and using the Bayes decision rule (2.4) to make prediction. There are two different types of probabilistic models to define p(y|x):

• discriminative models, and

• generative models.

Discriminative models directly approximate the conditional distribution p(y|x). Logistic regression [6] is an example of a discriminative model suitable for binary classification tasks, that is when Y = {+1,−1}. It defines the probability of an input vectorxbeing an instance of the classy=1 as follows:

p(y=1|x) =σ(−θTx) = 1

1+exp(−θTx), (2.18) where σ(·) denotes sigmoid function1 and θ is a vector of the parameters to be learned.

In contrast, generative models approximate the joint distributions of inputs and outputsp(x,y). The conditional density function,p(y|x), can be obtained as follows

p(y|x) = p(x,y)

p(x) = p(x,y)

R p(x,y)dy ∝p(x,y). (2.19)

1σ(s) =1/(1+exp(−s))

Often the joint density is defined as a product oflikelihoodandprior distribution(often simply called the prior): p(x,y) = p(x|y)p(y). Prior distribution p(y) expresses one’s a priori beliefs about the value of y before some evidence, x, is taken into account. The likelihoodp(x|y), in turn, is the conditional density of the observation xgiven the value of the target variabley. In the context of this factorizationp(y|x)is usually referred to as theposterior distributionbecause it expressesa posterioribeliefs about the value ofy, based on empirical evidence combined with prior beliefs.

One of the simplest examples of a generative model is Gaussian classifier (see Section 4.2 in [6]). Its name originates from the assumption that the class-conditional distributions are Gaussian, that is

p(x|y) =N(x|µy,Σy) = 1

(2π)d2|Σy|12 exp

1

2(xµy)TΣ−1y (xµy)

. (2.20) Denoting prior probability of the k-th class asp(y = k) =πk, the posterior proba-bility is found using Bayes theorem

p(y=k|x) = p(x|y=k)p(y=k)

p(x) = πkN(x|µk,Σk)

l=1πlN(x|µl,Σl). (2.21) A slightly more sophisticated version of this classifier is used in Publications III andIVto detect audio segments containing speech. More precisely, in that case the class-dependent densitiesp(x|y)are defined not as single Gaussian densities but as weighted sums of Gaussian densities.

One can see that, in contrast to empirical risk minimization, a loss function is not used in this approach to find a decision rule. Advantages and disadvantages of density estimation approach are summarized in Table 2.2.

Table 2.2: Advantages and disadvantages of density estimation.

Benefits Drawbacks

• If the estimated density is the

“true” model of the data, then this approach is optimal.

• Training is separated from pdiction. There is no need to re-train the model if the loss func-tion changes.

• If the model is poorly fitted, the prediction can be highly inaccu-rate.