Explicit density models - Modeling complex densities

2.2 Modeling complex densities

2.2.1 Explicit density models

Aparametricstatistical model is a family of non-negative density functions

{p(·|θ)|θ∈ _Θ}, each of which is indexed by a finite-dimensional parameter vector θ∈_Θ⊆_R^d. By definition, given the parameters, a new data point,z, is independent of the training data,D={z₁, ...,zN}:

p(z|θ,D) =p(z|θ). (2.33)

Therefore,θcaptures all of the relevant information about the training set. Thus, the complexity of a parametric model is bounded even if the amount of training data is unbounded.

In contrast, non-parametric models assume that the data distribution cannot be defined by a finite set of parameters. That is, the amount of information that the model can capture about the data can grow as the total amount of data increases.

This allows one to automatically determine the complexity of the model. This ability of nonparametric models is often employed, for instance, to determine the number of groups in clustering task. In Publication I, a non-parametric model is used for clustering speech segments when the number of clusters is unknowna priori.

Probabilistic models are eithernormalizedorunnormalized. A statistical model is said to be normalized, if the condition

p(z|θ)dz=1, (2.34)

holds for allθ. In contrast, a model{u(·|θ)|θ∈_Θ}is said to be unnormalized if the above integral is finite but not necessarily equal to 1. Hence, for normalized models, p(z) are probability density functions but for unnormalized models, u(z) are not.

In theory, any non-negative functionu(z)associated with an unnormalized model can be converted to a probability density function by dividing it by its integral (or sum), referred to as apartition function:

Z(θ) = Z

p(z|θ)dz. (2.35)

The corresponding normalized model is then p(z|θ) = ^u(z|θ)

Z(θ) ^. ^(2.36)

However, this is rarely the case, because the integral in the partition function is generally intractable, that is, lacks analytic expression.

Explicit density modelsdefine an unnormalized model by explicitly constructing the functionu(z)∝ p(z). In some cases this unnormalized model can be converted to a normalized model with corresponding probability density function p(z). If one is able to find a closed form expression for the (unnormalized) density function the model is calledtractable, otherwise it is calledintractable.

Tractableexplicit density models can be constructed in several ways [47]:

• as a product of known conditional densities,

• by applying a nonlinear transform, or

• by introducing latent variables.

In the first case, a joint probability density function is represented as a product of several conditional densities in the following way:

p(z) =p(z1)

∏

d j=2

p(zj|z1, ...,zj−1) (2.37) Given such factorization, each of the factors, in turn, can be constructed in one of the three aforementioned ways. This representation of density function is used in

WaveNet [48], a probabilistic model of raw audio waveforms, where each condi-tional density is defined using a convolucondi-tional neural network.

Another way to construct a probability density function is to use continuous invertible transformation g between two vector spaces. If s ∼ q(h) is a vector of latent random variables, then a distribution overzcan be defined as follows [47]:

p(z) =q(g⁻¹(z))J_g−1(z), (2.38) where J_g−1(z) denotes the Jacobian of g⁻¹ evaluated at z. The density p(z) is tractable if the density q(s) and the determinant of the Jacobian of g⁻¹ are both tractable. Unfortunately, this is rarely the case. In addition, the invertibility of g requires that the latent variablehmust have the same dimensionality asz.

Finally, latent variable models can be seen as a middle ground between two pre-vious approaches to construct densities. As suggested by its name, this approach is based on introducing a vector of latent random variables h ∼ p(h). However, in contrast to the previous approach, instead of a deterministic transformation, one defines a conditional density,p(z|h). It can be seen as a generalization of a determin-istic transformation, so an equalityz= g(h)is equivalent to p(z|h) = δ(z−g(h)), whereδ(·)is the Diracdelta function. Assumingp(z|h) =δ(z−g(h)), the previous approach is recovered by using the substitution rule for integrals (see Theorem 7.26 in [49]). This allows a joint density to be defined for the observed and unobserved (latent) variables as a product of given density functions: p(z,h) = p(z|h)p(h). Finally, one can obtainp(z)by marginalizing out the latent variables:

p(z) = Z

p(z,h)dh= Z

p(z|h)p(h)dh. (2.39) Since many probabilistic models in speaker recognition, including those proposed in Publications I-IV, are instances of latent variable models this topic deserves a more detailed discussion.

Latent variable models

While modeling capacities of basic distributions, such as the Gaussian, Gamma, Beta, Chi-squared, Dirichlet, or Poisson distributions, are limited, one can build complex distributions from simpler building blocks. To model a complex probability density over the observed variable z, one introduces thehidden(or latent) variable h, which may be discrete or continuous. The density of interestp(z)is expressed as the marginalization of the joint densityp(z,h)so that:

p(z) = Z

p(z,h)dh. (2.40)

Usually the joint density p(z,h)is defined as a product of the conditional density p(z|h)and the marginal density of the hidden variablep(h)such that both factors are represented by simple distributions. In turn, the densityp(h)can also be defined by means of hidden variables leading to so-calledhierarchical models[50].

While, in general, a latent variable model may not be tractable — that is, there is no closed form expression forR

p(z,h)dh— the most popular models are tractable.

In the next section two relevant classes of tractable models, namelymixture models and subspace models, will be described in more detail. Both of these models are frequently used to represent the class-conditional densities in generative classifiers.

-10 -5 0 5 10

Figure 2.1: Gaussian mixture model with two components having different com-ponent weights:π₁=0.7 andπ₂=0.3.

Mixture models

One of the simplest way to construct a complex distribution is to take a convex combination of some simple probability densities p_k(x):

p(x) =

∑

K k=1

π_kp_k(x), (2.41)

whereπ_k is the mixture weight andKis the number of components in the mixture.

It can be shown that a mixture model can be obtained within the latent variable framework described earlier. Any mixture model can be expressed as a marginal-ization of the joint density between an observed data point x and a discrete latent variableh∈ {1, ...,K}: For instance, aGaussian mixture model(GMM) is defined as follows:

p(h) =Cat(π)

p(x|h) =N(x|µ_h,Σh), (2.43) where Cat(h)denotes the categorical (or discrete) distribution so thatp(h=k) =π_k. The density of GMM can be recovered by summing overh:

p(x) =

∑

K k=1

π_kN(x|µ_k,Σk) =GMM(x|{π_k,µ_k,Σk}^K_k=1). (2.44) An example of GMM with two components is shown on Figure 2.1.

The value of the hidden variable hhas a clear interpretation in the case of mix-ture models: it determines to which of the components the data point belongs. This interpretation makes mixture models a natural tool for solving problems involving the partitioning data points into groups,e.g. clustering. Since a mixture model asso-ciates each unit of observations with a single cluster, solving a clustering problem can be seen as finding the most probable values of the latent variables indicating cluster membership. However, in general latent variables may not have any reason-able interpretation in the domain of a problem at hand.

There is a useful notation based on so-called one-hot encoding that is used to define the conditional density p(x|h). One-hot encoding is an alternative represen-tation of elements of a finite set{1, ...,K}byK-dimensional binary vectors. Thek-th

element of the set is represented by the vector with a single non-zero element on the k-th position — that is, thek-th vector of the standard basis. This encoding can be used to represent cluster labels of data points in a clustering task. For example, if K=4 and the cluster label equals 3, then its one-hot encoding is theK-dimensional vectorh= (0, 0, 1, 0). One-hot encoding allowsp(x|h)to be rewritten in the form:

p(x|h) =

∏

K k=1

pk(x)^h^k, (2.45)

which is commonly used to define mixture models [6]. Given a datasetX={x1, ...,xN} any configuration of latent variable H = {h1, ...,hN} defines a partition of data points into non-overlapping groups. Using one-hot encoding the likelihood func-tion of the partifunc-tionHcan be written as

p(X|H) =

∏

K k=1

∏

N i=1

pk(xi)^h^ik. (2.46) Therefore, a clustering task (e.g.speaker diarization) can be formulated as a discrete optimization problem of maximizing the likelihood function

H∗=arg max

H p(X|H). (2.47)

If some partitions area prioribelieved to be more probable than the others, then the clustering task can be formulated as a maximization of the posterior probability of the partition:

H∗=arg max

H p(H|X) =arg max

H p(X|H)p(H), (2.48) where p(H)is the prior distribution over partitions. Since these discrete optimiza-tion problems have a huge number of possible soluoptimiza-tions, different approximate tech-niques are commonly adopted.

Mixture models are widely used in many speech analysis tasks including speech activity detection, speaker recognition, and speaker diarization. In PublicationIV, GMMs are used to approximate class-specific distributions to construct a probabilis-tic speech activity detector. In the PublicationIVa mixture-like model was proposed to cluster speech segments according to speaker identity. Moreover, viewing mix-ture models as latent variable models provides a simple way to formulate general speaker partitioning problems that include speaker recognition and diarization as special cases.

Subspace models

Another powerful model that can be constructed within the latent variable frame-work isfactor analysis(FA) [6]. Factor analysis can be seen as a compromise between a multivariate Gaussian distribution with a diagonal covariance matrix (which as-sumes that all random variables are statistically independent) and a full covariance matrix, which assumes no independence among any groups of random variables.

In specific, the probability density function of afactor analyzeris given by:

p(x) =N(x|µ,VV^T+Ψ) =FA(x|µ,V,Ψ), (2.49)

where the covariance matrix VV^T+Ψ contains two terms. The first term, VV^T, corresponds to the full covariance matrix over the subspace spanned by columns of V, which is called the loading matrix. The second term, the noise matrix, Ψ, is the diagonal matrix containing variances for individual components of a feature vector x. This density can also be obtained as the result of integrating out the latent variable h∈_R^min the following latent variable model:

p(h) =N(h|0,I) (2.50)

p(x|h) =N(x|µ+Vh,Ψ) (2.51) p(x) =

p(x|h)p(h)dh=N(x|µ,VV^T+Ψ), (2.52) where the dimensionality of the latent vector his lower than the dimensionality of the input vectorx. Commonly, factor analysis is used as a method fordimensionality reduction: the latent vector is a compact representation of the higher dimensional in-put. The model assumes that the input can be represented as the linear combination of columns of the matrixV(with elements of the vector has coefficients) plus the vector of Gaussian noisee∼ N(0,Ψ)and the constant offsetµ:

x=µ+Vh+e. (2.53)

Since columns of the matrixVspan a linear subspace in the input space FA belongs to the class of so-calledsubspace models (see Chapter 12 in [6] and the overview of linear Gaussian models[51]). In fact, FA can be seen as a generalization of another subspace model, probabilistic principal component analysis (PPCA) [41, 52], a proba-bilistic counterpart of the traditional PCA. PPCA is recovered if the noise matrix,Ψ, is constrained to be isotropic — that is, proportional to the identity matrix. Another instance of a probabilistic subspace model closely related to FA isprobabilistic linear discriminant analysis(PLDA) [34] widely studied in speaker recognition. PLDA can be seen as FA in which different subgroups of latent variables have different inter-pretations in the context of the application. Finally, one of the most popular features in speaker recognition, i-vectors, are also extracted using the FA-like model.

In document Improving machine learning methods for speaker recognition and segmentation (sivua 33-38)