Parameter estimation - Modeling complex densities

2.2 Modeling complex densities

2.2.3 Parameter estimation

Once one has chosen a way to represent an unknown distribution of interest, either by selecting a family of density functions (explicit density models) or by specifying a procedure to draw samples from a distribution (implicit models), the next step is estimating parameters to determine the best-fitting model. A few widely used esti-mation methods for explicit models will be described in the following paragraphs.

Let D = {z1, ...,zN} be a training set consisting of independent samples from some unknown distribution. The goal of parametric density estimation [55] is to estimate parameters,θ, so that the corresponding densityp(z|θ), from a pre-defined parametric family would be close (in some sense) to the true (unknown) density.

Maximum likelihood estimation

One of the most popular methods to estimate parameters θof a probability den-sity function p(z|θ)is themaximum likelihood estimation (MLE) [56]. The method of maximum likelihood finds the values of the model parameters, that maximize the likelihood function, which is the joint density of the training set viewed as a function ofθ. That is, MLE boils down to solving the following optimization problem:

p(D|θ) =p(z₁,z₂, ...,z_N|θ)→max

. (2.57)

In practice, it is more convenient to replace the likelihood function with its loga-rithm. If data points in the training set are assumed to be independent and identi-cally distributed, the resulting log-likelihood function decomposes as follows:

L(θ|D) =logp(z1,z2, ...,zN|θ) =log In discriminative approach the training sampleszicorrespond to outputs, for exam-ple, class labels, and the model is defined by the conditional distribution p(y|x,θ) of the outputy, given the inputx. The logarithm of the likelihood function (2.58) is then given by

L(θ|D) =logp(D|θ) =

∑

N i=1

logp(y_i|x_i,θ). (2.59)

In turn, a generative model is defined by specifying the joint distribution p(x,y|θ) of the input vector x and the class labely. Therefore, zi in (2.58) represents input-output pairs(x_i,y_i)yielding the following objective function for the MLE

L(θ|D) =logp(D|θ) =

∑

N i=1

logp(xi,y_i|θ). (2.60) In classifications tasks, in order to improve the predictive performance of generative models, the so-calleddiscriminative traininghas been proposed [57,58]. This involves maximizing (2.59), where the conditional likelihood is defined as

p(y_i|xi,θ) = ^p(x_i,y_i|θ)

∑yp(x_i,y|θ)^. ^(2.61) This method, known asconditional maximum likelihood estimation[59], is a special case of another discriminative training method, maximum mutual information estimation [60], widely used in speech recognition [61–63].

EM-algorithm

However, maximum likelihood estimation is challenging for latent variable models if the integral p(z|θ) =R

p(z|h,θ)p(h)dh does not have a closed form expression.

In this case, the objective function (2.58) cannot be computed in practice. A com-monly adopted remedy is theexpectation–maximization(EM) algorithm [6, 64] which itself is a special case of the MM-algorithm [65, 66], where MM stands for either Majorization-Minimization orMinorization-Maximization, depending on whether the desired optimization is a minimization or maximization, respectively. The MM-algorithm is an iterative optimization method to find the local extrema of a given objective function. Its formal description for the case of maximization problems is as follows. Let f(x)be the objective function to be maximized with respect tox. At the stepkof the algorithm, anauxiliary function g(x;x_k)is selected so that it forms a lower bound of the original objective function. Specifically, the following conditions must hold

g(x;xk)≤ f(x),∀x (2.62) g(x_k;x_k) = f(x_k). (2.63) Then, the lower boundg(x;x_k)is maximized to obtain the next approximate solution x_k+1:

x_k+1=arg max

x g(x;x_k). (2.64)

This process is illustrated on Figure 2.2. The algorithm makes progress if it can find x_k+1such thatg(x_k+1;xk)>g(xk;xk). Then the effect of the stepkis

f(x_k+1) =g(x_k+1;x_k+1)≥g(x_k+1;x_k)>g(x_k;x_k) = f(x_k). (2.65) As a result, the described iterative method yields a local maximum of the objective function f(x).

Typically, the lower bounding surrogate objective is found individually for a given problem, but there are a few general tactics regarding how to define it. One

Figure 2.2:The MM algorithm. The objective function is denoted as f(x)while the auxiliary function on thek-th iteration is denoted asg(x;x_k)

approach uses Jensen’s inequality [49], which leads to the well-known EM-algorithm.

The EM-algorithm is used to find maximum likelihood estimates of parameters of probabilistic models, that is, to maximize objectives of the form:

logp(D|θ) =

∑

N i=1

logp(z_i|θ) =

∑

N i=1

log Z

p(z_i|h_i,θ)p(h_i)dh_i. (2.66) The lower bound to this objective is obtained by introducing an auxiliary density q(h)and using Jensen’s inequality

logp(z) =log Z

p(z|h)p(h)dh=log Z

q(h)^p(z|h)p(h)

q(h) ^dh ^(2.67)

≥ Z

q(h)logp(z|h)p(h)

q(h) ^dh=E_q(h)[logp(z|h)]−KL(q(h)||p(h)) =L(q,θ), (2.68) where KL(q||p) is the Kullback-Leibler(KL) divergence [67] (also called relative en-tropy), a non-negative function that measures the distance between two distributions represented by probability density functionsqand p. One can see that L(q,θ) de-pends on the auxiliary density functionq(h), therefore, there are multiple possible lower bounds. In order to obtain a valid (i.e. satisfying conditions (2.62)-(2.63)) auxiliary function, the following decomposition can be used

logp(z|θ) =L(q,θ) +KL(q(h)||p(h|z,θ)). (2.69) Using the property of KL-divergence that KL(q||p) =0 if and only ifq≡p, one can define an auxiliary function for the EM-algorithm by settingq(h) =p(h|z,θ_k)in the lower boundL(q,θ):

g(θ;θ_k) =L(p(h|z,θ_k),θ), (2.70) where the posterior distribution p(h|z,θ_k)can be found using Bayes’ theorem:

p(h|z,θ_k) = ^p(z|h,θk)p(h)

p(z|θ_k) ^. ^(2.71)

Such choice ofq(h)drives the second term in the decomposition (2.69) to zero and, therefore, delivers a valid auxiliary function (2.70) that satisfies conditions

(2.62)-(2.63):

L(qk,θ)≤logp(z|θ),∀θ (2.72) L(q_k,θ_k) =logp(z|θ_k), (2.73) where q_k(h) = p(h|z,θ_k). Thus, an iteration of the EM-algorithm consists of, first, computing the posterior distributions (2.74) of latent variables, and, then, maximiz-ing the auxiliary function (2.70) with respect to the parametersθ. Interestingly, the first step can also be cast as an optimization problem:

p(h|z,θk) =arg min

q(h)KL(q(h)||p(h|z,θk)), (2.74) whose solution is obtained using the property that KL(q||p) = 0 if and only if q ≡ p. From (2.69) one can see that minimizing the second term is equivalent to maximizing the first term with respect toq(h)because the LHS does not depend on it. As a result the EM-algorithm can be seen as a coordinate ascent on L(q,θ), in which the first step (E-step) maximizes it with respect toq(h), and the second step (M-step) maximizes it with respect toθ.

There is a particular case where the M-step can be further decomposed into two independent steps. This is the case if the prior distribution over hidden variables also depends on the parameters to be estimated, that is p(z,h|θ) = p(z|h,θ)p(h|θ) whereθ= (θ,θ). This decomposition of the model parameters into two components allows the breaking of the maximization of the auxiliary function

L(q,θ) =E_q(h)[logp(z|h,θ)]−KL(q(h)||p(h|θ)) (2.75) into two independent problems. The first term is maximized with respect toθand the second term is minimized with respect toθ.

This observation is particularly useful for the models allowing over-parametrization, such as factor analysis. Let us assume that the model has parameters in the form θ = (θ,θ), and that this parametrization is redundant in the sense that different values of the parameters define the same model: p(z|θ₁,θ1) ≡ p(z|θ₂,θ2). For instance, in factor analysis (2.49) one can formulate the canonical model by re-placing the standard prior distribution, p(h) = N(h|0,I)with arbitrary Gaussian, e.g. p(h) =N(h|µ,Σ). This model is equivalent to the one with the standard prior, and the likelihood with parametersmnew = m+Vµand Vnew = Vchol(Σ) where chol(·) denotes the Cholesky decomposition [68]. Therefore, after minimizing the second term in (2.75) one can convert the model parameters back to the canoni-cal form. This procedure is known as aminimum divergence (MD) step within the EM-algorithm applied to over-parametrized models. It is routinely used to estimate parameters of FA-based models in speaker recognition studies [69] since it helps to achieve faster convergence and produces better estimates [70, 71].

The EM-algorithm is a standard method to estimate parameters of the many types of latent variable models including mixture models and probabilistic subspace models that are part of the standard speaker recognition pipeline. Among these are Gaussian mixture model, factor analysis, and probabilistic linear discriminant analysis. In particular, the EM-algorithm was used to estimate the parameters of the speech activity detector in the PublicationIV.

Noise-contrastive estimation

Maximum likelihood estimation is a standard method for the estimation of nor-malized probabilistic models. Unfortunately, it is not applicable to unnornor-malized models because the resulting optimization problem becomes intractable due to in-tractability of the partition function (normalizing constant). Therefore, alternative estimation methods are required. One of the popular methods is noise-contrastive estimation(NCE) [72], where the key idea is to transform an unsupervised estima-tion problem into a supervised learning problem. The method consists of solv-ing a binary classification problem to distsolv-inguish between the samples oftruedata, D_data={z1, ...,zN}, and artificialnoisedata,D_noise={zN+1, ...,zN+M}, drawn from some know distribution with densitypnoise(z). Specifically, NCE aims at estimating parameters θof an unnormalized density p(z|θ) =u(z|θ)/Z ∝ u(z|θ)by training a probabilistic classifier on the training set consisting of the union {D_data,D_noise} and the class labels {y1, ...,yN+M}where y = +1 corresponds to the real data and y = −1 corresponds to the artificial noise data. The conditional class probabilities can be calculated using the Bayes’ theorem

p(y= +1|z) = ^p(z|θ)p(y= +1)

p(z|θ)p(y= +1) +p_noise(z)p(y=−1) ^(2.76)

= ¹

1+exp

−log_p^p(z|θ)

noise(z)

= ¹

1+exp

−log_p^u(z|θ)

noise(z)+logZ (2.77)

if the normalizing constant Z is known. Since this is not the case, NCE treats it as an additional model parameter to be estimated. As a result the classifier can be seen as a non-linear logistic regression (2.18) and trained via the maximum likeli-hood method. One important property of NCE is that the estimation error becomes independent of the choice of the noise distributionp_noise(z)as the number of noise samples tends towards infinity [73]. However, since infinite data or noise sample are not available in practice, the choice of pnoise(z) will affect the performance of the estimator. Theoretically, the only requirement on the density function of the noise distribution pnoise(z) is that it must be nonzero wherever the data distribu-tion p(z|θ) is nonzero [74]. As a result, from a practical point of view, it should be computationally easy to evaluate. Since these mild constraints are met by many different distributions, the noise distribution is a matter of design choice. Intuitively, it should be selected to resemble the data distribution since no learning is required if the data and noise are easily distinguishable. NCE is implemented in the popular machine learning toolboxTensorFlow [75] and can be used to learn, for instance, word embeddings with theword2vecmodel [30].

Bayesian estimation

In contrast to afrequentistinterpretation of probability which defines the probabil-ity of an event as the limit of its relative frequency in a large number of trials, a Bayesian approach interprets probability as a state of knowledge or quantification of a personal belief [50]. To this end, one introduces apriordistribution p(θ) that encodes one’sa prioribeliefs about the values of the unknown parametersθ. Then, thepredictive distribution p(z|D) for the new observationz can be found by using

the Bayes formula,

p(θ|D) = ^p(D|θ)p(θ)

R p(D|θ)p(θ)dθ (2.78) p(z|D) =

p(z|θ)p(θ|D)dθ, (2.79) whereDrepresents the training data. If the posterior density is replaced by a point estimate θ∗, that is, p(θ|D) =δ(θ−θ∗), the predictive distribution is found using (2.33). When the posterior distribution is found, it becomes a new prior, enabling a natural mechanism forsequential learning. As a result, Bayesian estimation (learning) and prediction can be formulated as sequentially updating prior beliefs with new evidence.

One advantage of Bayesian approach over point estimation,e.g. MLE, is a built-in mechanism for determbuilt-inbuilt-ing the model complexity (or structure), the task, also known as model selection (see Section 2.3 for a detailed discussion). For instance, a commonly known problem with MLE is that the likelihood function is gener-ally higher for more complex model structures. As a result, having a sufficiently flexible model, MLE yields models erroneously fitted to the non-relevant aspects of data (noise) instead of the relevant patterns. Hence, predictions made by the best-fit model become suboptimal. The Bayesian approach overcomes this problem byaveragingthe likelihoods one would obtain using different settings of the model parametersθ, leading to themarginal likelihood,

p(D) = Z

p(D|θ)p(θ)dθ, (2.80) that is often used as a criterion for model selection. A few relevant examples of model selection tasks include testing whether or not two samples originate from the same distribution (speaker verification) and selecting the number of components in a mixture model (speaker diarization). In particular, in [42, 76] the problem of determining the number of speakers in the given speech file was formulated as one of determining the number of components in a mixture distribution where each mixture component is a speaker model.

2.3 PERFORMANCE EVALUATION, MODEL SELECTION, AND

In document Improving machine learning methods for speaker recognition and segmentation (sivua 39-44)