• Ei tuloksia

The better the signal analysis and data modeling has been done, the easier is the job of the classifier. At simplest, classifier is a comparison function, which assigns a class label to an input sample by comparing two numbers against each other. In conjunction with detection, this stands for comparing the likelihood value obtained for the target category to a threshold value, which is set to decide about the detection. In conjunction with classification, the decision is made to the class for which an obtained likelihood value is the largest among classes. The likelihood values for classes are usually obtained using a class model, which has been trained using examples from the categories involved in the task. In Section 4.2.1 I present some of this kind of data models popularly used in classification. In Section 4.2.2 I discuss the functions for class discrimination, when class models are not utilized, but a classification function of data features provides the classification.

4.2.1 Modeling phenomena

Numerical models of real world phenomena are in the core of machine learning. Features are models of low level phenomena or general data appearance in specific measurement modality. In this section I talk about models intended for modeling high level phenom-ena, like face, human gait, car, airplane, musical genre, spoken language etc. . In the literature data models are categorized as eithergenerativeordiscriminative. A generative model structure expresses the properties of the phenomenon as faithfully as possible and it is useful for many machine intelligence tasks encompassing that particular phe-nomenon. Each category specific model is learned separately from data representing that phenomenon. A discriminative model on the other hand is specific to the task it is intended for. The discriminative character of the model is trained and utilized for segregating the phenomenon from the other phenomena present in the specific task.

Thus I associate discriminative models with classifies, which are discussed in the next Section. Here I consider only generative models.

Probably the most widely used technique to acquire a static generative data model is the Gaussian mixture model (GMM). GMM is a model for a probability distribution of feature vectors of samples representing a certain phenomenon. The distribution function is a sum of multiple (N) multivariate normal distributionsN(x;µn,Σn)forn= 1, ..., N., each with unique meanµn, covariance matrixΣnand a magnitudewn. The likelihood of a samplexto represent the modeled classcis given by

l(x) =

N

X

n=1

wnN(x;µn,Σn) =

N

X

n=1

wn

1

p(2π)dn|e12(x−µn−1n (x−µn)0, (4.7) whereddenotes the dimensionality of feature spacex ∈ Rd. An example of GMM probability distribution is illustrated in Figure 4.3. The model parametersN,wn, µnand

4.2. Classification functions 33

0

5

10

0 5 10

0 0.05 0.1

0 2 4 6 8 10

0 1 2 3 4 5 6 7 8 9 10

Figure 4.3:Gaussian mixture model of three two-variate Gaussian distributions for modeling the distribution of two element feature vectors that represent a certain phenomenon.

Σnforn= 1...N are learned from representative training data. The model weightswnare restricted to be positive and sum up to unity, that iswn≥0, n= 1...N andPN

n=1wn= 1, for the GMM to represent a probability distribution. Often the covariance matricesΣn

are approximated by only their diagonals, that is, using an assumption of independence among feature vector elements. The algorithm suitable for GMM parameter estimation is an EM-algorithm introduced by Dempster et al. (1977). GMMs have been heavily utilized for automatic speech recognition task for modeling characteristics of different phonemes in speech.

The statistical GMM models are often used for Bayesian decision making where the category decision, i.e. classification, is made comparing the posterior probabilities

P(c|x) = P(x|c)P(c)

P(x) (4.8)

of the classes, whereP(c)andP(x)are the prior probabilities of the categorycand the data vectorx, respectively, and the GMM likelihoodlc(x)(4.7) of a classcis used for providing the conditional probabilityP(x|c).

More expressive models than the static models presented above are needed for modeling events which happen in time with varying speed, e.g. speech, musical progression, human gait and gestures, weather, etc. . That is, to model the feature vector progression in time, dynamic models must be utilized.

A popular generative dynamic data model for modeling sequences of data samples is a hidden Markov model (HMM) (Duda et al. (2012)). A hidden Markov model consists of a number of hidden statessi, i= 1...S. Each data vector is assumed to represent one of the states. The states are designed specifically for the application to be such that each of them represents some phase of the episode to be modeled, e.g. phonemes in speech. The number of states depends on the length and variability of the episode, e.g. in automatic speech recognition the long and complex words are modeled with HMMs with more states than the short and simple words. Within a traditional ASR framework, presented in Section 6.1, a HMM state is modeled using a GMM model of MFCC features of audio frames (Rabiner and Juang (1993)). To model the dependence of consecutive data vectors of each other, a transition probabilityaij =P(sj|si)is assigned for each pair of states

34 Chapter 4. Classifying independent samples si, sj within the HMM. In HMM, the probability of the system to be in a certain state sjdepends only on the system state at the previous step and the data feature vector fit P(x|s)to the feature distributions of HMM states. Probability of HMM to transit from statesitosj, given that the HMM emitsxat statesjis

P(sj|si,x) = aijP(x|sj) PN

k=1aikP(x|sk). (4.9)

An example of HMM for speech recognition is depicted in Figure 4.4. An underlying assumption within the HMM, called Markov property, is that the state progression depends only on one step state transition probabilities. This is not always realistic.

However, due to the mathematical tractability, this model is extensively used within many disciplines.

4.2.2 Classification functions beyond value comparison

Beyond the comparison of class probabilities according to data compatibility to above discussed models, the simplest function for binary classification among classesc ∈ {−1,1} is a linear function on the features xof the sample with thresholding with signum function as

ˆ

c=sign(f(x)), where f(x) =w0x+b. (4.10) There are many algorithms for learning the feature mapping vectorwand biasbDuda et al. (2012).

Fisher’s discriminant and Linear discriminant analysis (LDA)findwandbvia statis-tics of data. That is, forc ∈ {0,1}, they utilize the meansµc and covariance matrices Σcfor the two classes. Fisher’s solution isw= (Σ0+Σ1)−11µ0)(Bishop (2006)), while LDA reaches the same solution via Bayesian optimization assuming the classes to be homoscedastic, that isΣ = Σ0 = Σ1. The threshold parameterbis left to be set in some other means. It may be set to the midpoint between the class centers as b= 1/2 · (µ0+µ1)0w.

Regressionalgorithms are used for solvingwandbof (4.10) by setting the class labels 1and−1to be the regression model target values. The ordinary least squares (OLS) regression Duda et al. (2012) finds the parameterswandbby minimizing the modeling

Figure 4.4:An illustration how a hidden Markov model is utilized for modeling sequences of feature vectors in terms of their phoneme contents in the context of automatic speech recognition.

The HMM states are shown as circles, each with a bi-phoneme label enclosed, e.g. ’tr. The arrows indicate possibilities of state progression, and each weightas1s2indicates the probability of transition from states1to states2. The thin lines indicate how the feature vectorsxtof the stream of audio frames might be aligned with the HMM states.

4.2. Classification functions 35 errorPN

n=1(f(xn)−cn)2forNtraining examples(xn, cn)with the closed-form solution [w, b]0 = (X0X)−1X0c, (4.11) where matrixXcontains the vectorsXn = [xn0,1]as its rows andc= [c1, c2, ..., cN]0 contains the corresponding class labels1or−1as its elements. A nonlinear decision boundary may be produced with regression analysis, if the vectorsXnare built using nonlinear functions of features inxn.

The partial least squares (PLS) regression, or as better called the projection to latent structures, finds the parameterswandbvia an iterative procedure. It picks projections of the feature vectors, via which the regression target scores are pursued, one by one.

The procedure allows regularization on the number of projections used and avoids computing the inverse ofx0x, thus bypassing the possible problem of collinearity of features. There are multiple algorithms to perform the computation for PLS parameters, comparison of which is presented in Andersson (2009).

Logistic regression (Cox (1958)) utilizes the sigmoid function σ(y) = 1/(1 +e−y)on the linear class estimatory =f(x)when looking for suitable parameterswandbfor linear classification with (4.10). The error of classification in logistic regression is given bye−1(y) =σ(y)for the samples from class -1, ande+1(y) = 1−σ(y)for samples of class 1. Now the error is close to zero for every sample falling clearly on the side of its own class on the regression line. This error function is naturally much more faithful to the task of finding the regression model for binary classification than the squared error used by OLS. The cost function to be minimized is formulated logarithmically as C =PN

n=1log(1/(1−ecn(yn)))to make it differentiable. The parameterswandbare then obtained using an iterative maximum likelihood estimation method, for example a Newton-Raphson -algorithm (Atkinson (1979)).

Regularization– To obtain either sparse or smooth distribution of values inwof (4.10), a technique calledregularizationmay be used for regression analysis. For LP regularization onw, the cost function to be minimized is

C= 1 N

N

X

n=1

e(xn, tn) +λ||w||PP, (4.12) wheree(xn, tn)is the error of the regression valuef(xn)in respect to target valuetnand λis a trade-off parameter between the regression error and regularization error. Similarly to classification function training with the other regression techniques, the targetstnare usually set to betn∈ {−1,1}ortn ∈ {0,1}. Regression using L1regularization promotes sparsity ofwand it is called LASSO. Regression with L2-regularization alleviates possible problems due to collinearity of data and promotes regression vectorwto have small Euclidean length. It is called Tikhonov regularization or more commonly as ridge regression and has a closed form solution similar to OLS solution (4.11) as [w, b]0 = (x0x+λΓ)−1x0c, whereΓis an identity matrix. So called elastic net regression combines both the L1and L2regularizers into the cost function.

In the work of Publications II and I I have utilized OLS with Tikhonov regularization as well as PLS regression for estimating speech state likelihoods within an automatic speech recognition framework.

Support vector machine (SVM)is a successful and abundantly utilized method for bi-nary classification (Steinwart and Christmann (2008)). Formulation of SVM may be used

36 Chapter 4. Classifying independent samples to find parameters for the linear model (4.10) as well as more expressive models which use so called kernel transformations for the distance between the decision boundary and the data points (Bishop (2006)). SVM algorithm sets the decision boundary parameters to maximize the margin between classes. That is, to find the best decision boundary parameters, SVM considers specifically those samples that reside close to a potential boundary. The samples that are classified correctly anyhow are regarded as redundant and are left out of calculations. A versatile and often utilized kernel function for SVM classifier, by which it is possible to produce highly complex partitions of the input space, is the radial basis function (RBF). The SVM with RBF -kernel utilizes e.g. a Gaussian distance functiond(p,x) = exp(−||xp||2/(2σ2))to express the deviation of a pointp from a feature vectorx. The parameterσaffects the smoothness of the resulting decision boundary and must be set via cross validation. Some of the data vectors are selected as support vectorsvi, i= 1...V and the class decision for a new data vectorxis then given byc(x) = sign

PV

i=1cid(x,vi)

whereci ∈ {−1,1}depending on the class label of each support vector.

Multifaceted, piece-wise linear decision boundariesfor separating two or more cat-egories, realize with K nearest neighbors (k-NN) -algorithm (Duda et al. (2012)) and different decision tree algorithms. k-NN functions, similarly to SVM, provides the class label via computing distances to known data samples. However, k-NN requires all the training data to be saved and used for classification of a new sample, while SVM utilizes only the samples selected to be support vectors. Although this is not efficient computa-tionally nor from storage point of view, k-NN is fairly popular due to its simplicity.

Decision tree classifieris a flowchart-like structure which has a root node as a starting point, multiple decision nodes, where intermediate decisions about which node to enter next are made, and plenty of leaf nodes, which finally assign a class label for the input.

For most of the decision tree building algorithms, each decision node is designed to be a function of only one variable. The univariate decision makers of a binary decision tree are selected such that the training data would be divided according to the node function into subsets as homogeneous as possible. The selection of an univariate decision maker means selecting an input feature to be evaluated within the node. In case of continuous valued features, also a threshold value for the feature must set to make a binary decision. Hihgly popular Algorithms for decision tree learning are for example ID3 (Quinlan (1986)), C4.5 (Quinlan (1993)) and CART (Breiman et al. (1984)). The ID3 and C4.5 algorithms utilize information gain based on data subset entropies for selecting the node data partition functions. The ID3 -algorithm has been developed for nominal data, where each input attribute may only have one of predefined labels. C4.5 is a successor of ID3, which has a capability of utilizing continuous valued data. Also the CART algorithm (Breiman et al. (1984)) has the capability of working with continuous valued data. It utilizes Gini-impurity or Twoing -criterion to decide about the attributes used in nodes of the tree.

Neural networksare the most versatile classifier functions. Multitude of non-linear intermediate functions within an NN makes the mathematical analysis of the entire NN-function intractable. Currently, research on DNNs pursue understanding the trans-formations performed by a complex DNN. The research community is furiously, through trial and error, trying to find out guidelines for determining effective DNN structures for each problem, as well as creating new training algorithms for them.

A discriminatively trained DNN is effectively an integrated feature extractor and classifier.

4.3. Utilizing multiple classifiers or detectors 37