Discriminative models - Speaker partitioning problem

5.2 Speaker partitioning problem

5.3.2 Discriminative models

Unlike generative models, defined by the likelihood function p(X|M), discrimina-tive models directly define the posterior distribution over the hypotheses p(M|X). In the case of the speaker verification problem, there are only two possible out-comes. Thus one only needs to compute the probability p(Mtar|X) for given trial.

Discriminative models can be constructed from scratch or derived from generative models. The latter approach is adopted to derive adiscriminativecounterpart to the PLDA introduced in [35]. This model can be constructed by observing that the pos-terior probability p(Mtar|X) of a pair of feature vectorsX = {xe,xt}belonging to the same speaker can be expressed in terms of the log-likelihood ratio (LLR) of the PLDA model. This can be shown by writing the Bayes’ theorem in theoddsform,

p(Mtar|X)

p(Mnon|X) = ^p(X|M_tar) p(X|Mnon)

p(Mtar)

p(Mnon)^, ^(5.63)

and applying logarithm to both sides of the equation to arrive at, log p(Mtar|X)

1−p(Mtar|X) =log p(X|M_tar)

p(X|Mnon)+log p(Mtar)

1−p(Mtar)^. ^(5.64) Then, after applying a sigmoid function to both sides of the equation, one gets the expression for the posterior probability of the target hypothesis as a function of the log-likelihood ratio:

p(Mtar|X) =σ(LLR(xe,xt) +logit(π)), (5.65) whereσ(·)is a sigmoid function and

LLR(xe,xt) =log p(xe,xt|M_tar)

p(xe,xt|M_non)^. ^(5.66) Further, to keep derivations simple, let us consider the following special case of PLDA, known as thetwo-covariance model[3]:

p(y) =N(y|0,B) (5.67)

p(x_i|y) =N(x_i|m+y,Σ). (5.68) This model is parameterized by the mean vector and two covariance matrices θ={m,B,Σ}. Nevertheless, the following analysis is valid for any PLDA model as well [180].

Under the target speaker hypothesis, M_tar, one assumes that xe and xt share the same latent identity variabley. Then, the target likelihoodp(xe,xt|M_tar)can be found by integrating outy from the joint density of all variables corresponding to the target hypothesis. Namely,

p(xe,xt|M_tar) = Z

p(xe,xt,y)dy (5.69)

= Z

p(xe,xt|y)p(y)dy= Z

p(xe|y)p(xt|y)p(y)dy. (5.70)

By substituting the expressions for p(y) and p(x|y), it can be shown [34] that the

On the contrary, under the hypothesisM_nonone assumes that the feature vectorsxe

and xt correspond to different latent identity variables, say,y_e andy_t, respectively.

In this case we integrate out both:

p(xe,xt|Mnon) = assump-tion of statistical independence of utterances originating from different speakers implying the factorization (5.11) of the likelihood over speakers. Therefore, the like-lihood of the nontarget hypothesis is

p(xe,xt|M_non) =N After combining the two likelihoods, the log-likelihood ratio becomes,

LLR(xe,xt) =

Further, by expanding the log of Gaussian distributions using the identity for matrix inversion in block form [6, p. 87] and simplifying the expressions, the LLR can be written as the following quadratic form,

LLR(xe,xt) =x^T_ePxt+x^T_t Pxe+x^T_eQxe+x^T_tQxt+c^T(xt+xe) +k, (5.75) where the vectorc=−2(P+Q)m, scalar

k= ¹

2(log|B+Σ| −log|B+Σ−B(B+Σ)⁻¹B|)−c^Tm, (5.76) and symmetric matrices Pand Q are determined by the PLDA model parameters [181]:

Q= ¹

2((B+Σ)⁻¹−(B+Σ−B(B+Σ)⁻¹B)⁻¹) (5.77) P= ¹

2(B+Σ)⁻¹B(B+Σ−B(B+Σ)⁻¹B)⁻¹ (5.78)

One may notice a similarity between (5.75) and the squared Mahalanobis distance, defined as:

d²_M(xe,xt),(xe−xt)^TM(xe−xt) =x^T_eMxe+x^T_t Mxt−x^T_eMxt−x^T_tMxe, (5.79) whereMis some positive semi-definite matrix. The LLR of the PLDA model reduces to a (negative) Mahalanobis distance if P = −Q. Straightforward algebra leads to the condition Σ = 0. This corresponds to a model with no residual term (no uncertainty of within-person variation) and leads to a decision rule that assigns two feature vectorsxe and xt to the same person only ifxe = xtholds exactly. As this generally cannot hold, the LLR of the PLDA model does not define a Mahalanobis-style distance.

It can be shown [35, 182] that the quadratic form (5.75) can be re-written in an inner product form:

where vec(·) is a column stacking operator andP,Q,c and k can be expressed in terms of the PLDA model parametersm,BandΣ. That is, the log-likelihood ratio of the PLDA model can be expressed as a dot product between a vector of parameters, ϑ, and a vector representing a trial,ϕ(X):

log p(X|Mtar)

p(X|M_non) =ϑ^Tϕ(X), (5.81) which is a linear classifier in the space of vectorsϕ(X). This yields a well-known linearlogistic regression(2.18) classifier that can be trained on a collection of labeled target and non-target trials:

p(Mtar|X) =p(Mtar|X,θ) =σ(θ^Tφ(X)) = ¹ 1+exp

−θ^Tφ(X)^, ^(5.82) where θ = [ϑ^Tβ]^T, β = logit(π) andφ(X) = [ϕ(X)^T1]^T. Given a trained model, the Bayes rule (5.15) suggests selecting the target hypothesis if

θ^Tφ(X)>log Cfa

Cmiss

. (5.83)

Since estimating the parameters of logistic regression using the maximum likeli-hood method coincides with training binary linear classifier by minimizing the em-pirical risk with thelogistic loss(2.15), alternative loss functions can be considered.

For instance, [36] found that using thehinge loss(2.14) to train the linear classifier (5.81) leads to improved performance. By analogy with the conventional support vector machine (SVM) the classifier proposed in [36] and further studied in [182]

is often refereed to as apairwise support vector machine(PSVM) because it classifies pairs of feature vectors.

-15 -10 -5 0 5 10 15 20 25 -20

-15 -10 -5 0 5 10 15

Figure 5.3: The distributions of projections of the 600-dimensional i-vectors from the Switchboard corpus. The projection matrix was obtained by PCA applied to the entire dataset. Colors correspond to subsets of the Switchboard corpus.

In PublicationII, the author of this thesis proposes a new training method that can learn across multiple datasets by extending the conventional PSVM accord-ing to multi-task learning (MTL) [183] and multi-domain learning (MDL) [184, 185]

paradigms. Multi-task learning aims at leveraging useful information from multiple related learning tasks, which can originate from supervised, unsupervised or rein-forcement learning, to improve the generalization performance of all the tasks. It is based on the observation that, sometimes, learning multiple related tasks jointly can lead to performance gain over learning them individually.

In turn, the MDL setting arises when the same task is solved in different but relateddomains,i.e.datasets with differing statistical properties but a shared feature space. That is, MDL refers to learning thesametask acrossdifferentdomains, while MTL addresses learning different tasks in the same domain. Therefore, the MDL scenario assumes that a predictive model from one domain can be directly applied to another domain, while in MTL a model designed for one task cannot be mean-ingfully be applied to another task because of the different output spaces. Since the domain/task distinction is subtle in some cases, and some methods designed for MDL can be applied to MTL and vice-versa, these two settings are sometimes loosely used interchangeably.

In the context of speaker recognition domains might correspond to different languages or transmission channels (e.g. telephone and microphone). For instance, Figure 5.3 demonstrates domain mismatch between i-vector distributions of two parts of the Switchboard (SWB) corpus [186]. This illustration was produced by,

Figure 5.4: Illustration of the intuition behind the mt-SVM. The parameters of the task-specific classifierθt are represented as a sum of the task-independent termθ0

and the residualvt.

first, estimating a two-dimensional subspace using principal component analysis on the whole SWB dataset, and ,then, by projecting the i-vectors extracted from two subsets defined according to the different LDC distributions [186].

The training method proposed in PublicationIIfollows the multi-task extension of SVM (mt-SVM) introduced in [187]. In this formulation, we are givenTdatasets that share the same feature and label spaces but their distributions differ. Accord-ingly, the input data available for training is,

D={(X₁,Y₁),(X₂,Y₂), ...,(X_T,Y_T)},

whereXtis a set of feature vectors andYtrepresents the corresponding class labels with x_it ∈ _Rⁿ, y_it ∈ {−1, 1} for i running over thet-th subset. This setup corre-sponds to MDL withTdifferent domains, however, similar to [187], in the following text, the term “task” is used to refer to the domain.

In this approach, the decision rule is different for each task (domain). That is, there are T linear binary classifiers jointly trained onD. To take into account the relatedness of tasks mt-SVM assumes that the vector of the parameters of thet-th classifier is decomposed as

θt=θ0+vt. (5.84)

Intuitively, if the tasks aresimilar, then the task-specific modelsθtshould be close to each other. That is,θtshould not deviate much from the shared “task-independent”

modelθ0. This means thatkvtk²should be small. Otherwise, if the tasks areweakly related, thenkθ0k²should be small. In other words, the task-specific modelθtadapts to a particular task whileθ0shares common information across the tasks. Figure 5.4 provides a conceptual illustration of mt-SVM in two-dimensional case.

In [187], the parameter vectors θt and θ0 are estimated jointly by solving the following optimization problem:

min

θ₀,{v_t}

nλ₀

2 kθ₀k²+

∑

λt

2 kvtk²+

∑

max(0, 1−yitx^T_it(θ₀+vt))^o.

Here, yit ∈ Yt denotes the label of the i-th training vector xit ∈ Xt from the t-th subset. The latter term in the objective function is a sum of the T unregularized objective functions of the conventional SVMs [188] with parameters decomposed in a specific way (5.84). Thus, the problem is analogous to SVMs used for learning single tasks but additionally includes task-coupling mechanism specified through regularization. The regularization parameters, λt, determine the trade-off between the optimality of these models for the corresponding tasks and the closeness of each task-specific model to the shared model. In the extreme case, if λt → +∞, then vt = 0 and all the tasks share the same decision rule with parameters θ0. In the other extreme, ifλ0→+∞, thenθ0=0and all tasks are decoupled.

Using the observation that the LLR (5.81) can be written in the form of (5.80), a multi-task formulation of the PSVM (mt-PSVM) was proposed by the author of this thesis in PublicationIIby replacing the feature vectorsxin (5.3.2) byϕ(X)from (5.80). Similar to mt-SVM, the proposed training method can be used to obtain a classifier that is robust to inter-domain variability. Assuming that the shared pa-rameters capture the domain-independent structure of the data, this can be done by using the vectorθ₀to construct a classifier.

The multi-task PSVM can be seen as generalization of multi-task metric learning from [189], where a multi-task learning paradigm was adopted for learning a Maha-lanobis metric, (5.79). As in mt-SVM, the proposed method assumes that parameters of each task-specific metricd²_t(·,·)are decomposed in a way similar to (5.84):

d²_t(xe,xt) = (xe−xt)^TMt(xe−xt)

= (xe−xt)^T(M0+Vt)(xe−xt),

where the matrixVtis task-specific and the matrixM0is shared across tasks. Thus, the metric defined byM0discovers general similarities between feature vectors and Vtspecializes the metric further for the taskt.

5.4 SPEAKER DIARIZATION

Speaker diarization is one of most general and difficult instances of the speaker par-titioning problem. The reason for that is astronomic number of possible hypotheses (partitions) in even the most constrained scenario of dialogue conversation (when the number of speakers is K = 2). In this case the number of possible partitions equals L = 2^N−1, which is still prohibitively large for enumerating all the candi-date solutions. Therefore, in practice different heuristic (e.g. greedy) approaches are adopted. Moreover, since speaker diarization is essentially a clustering prob-lem, many general-purpose clustering methods, (including k-means), are directly applicable.

In the sequel, I will focus mostly on probabilistic models for clustering used in speaker diarization. Clustering methods based on probabilistic models have a num-ber of advantages over classical clustering methods such ask-means and hierarchical clustering. Probabilistic approach allows prior knowledge on cluster structure, as-signment constraints, and the number of clusters to be incorporated in a principled way. For instance, one can design a prior for partitions [190], [191] that encour-ages observations of adjacent temporal points to belong to the same cluster. This approach also provides probabilistic assignments to clusters.

Like in classification task, probabilistic models for clustering can be either gen-erative [6] or discriminative [192–194]. In the former case, one needs to specify the

likelihood function for partitions,L(Z|X)∝p(X|Z), and the prior distribution over possible partitions, p(Z). In the latter case, the posterior distribution, p(Z|X), is di-rectly defined. Since cost-sensitive setup in clustering problems is rare, the solution is found by maximizing theunnormalizedposterior distribution

Z∗=arg max

Z p(Z|X) =arg max

Z p(X|Z)p(Z). (5.85) However, to the author’s knowledge, discriminative models for clustering have not been studied in the context of speaker diarization.

There are two widely used approaches to speaker diarization that are based on generative probabilistic models. One uses generative models as a means within the iterative scheme of agglomerative hierarchical clustering (AHC) [195], [15]. Agglom-erative hierarchical clustering is one of the most widely used clustering methods in speaker diarization. It iteratively generates clusters by a sequence of merge opera-tions. Starting from each cluster consisting of a single data point, on each iteration it selects a pair of the closest (in some sense) clusters and merges them. This process continues until a stopping criterion is met.

In the model-based AHC, elements of the k-th cluster, Xk = {x₁, ...,x_T}, are assumed to be exchangeable and the cluster is represented by their joint density function, p_k(X) = p_k(x₁, ...,x_T). On each iteration the algorithm makes a deci-sion on merging two clusters, X1and X2, by performing binary hypothesis testing that is equivalent to the speaker verification problem. A positive decision means that elements of both clusters belong to the same speaker. To this end, the likeli-hood p(X₁,X2|M_tar) = p(X₁,X2)is compared to p(X₁,X2|M_non) = p(X₁)p(X2). If p(X₁,X2|M_tar)is greater than p(X₁,X2|M_non)then the latter term is replaced with the former in the factorization of the overall likelihood p(X|Z) =_∏_kp(X_k). Hence, the merging process in AHC can be seen as a greedy maximization of the likelihood p(X|Z)with respect to the partition Z, or a maximization of the posterior in (5.85) when the prior is uniform. Typically, clusters are modeled by GMMs and Bayesian information criterion is used as a proxy for the (marginal) likelihoods of hypotheses M_tarandM_non[196], [197], [198], [199].

Another approach directly defines a generative model for the input data. If the input data is represented by a sequence of feature vectors (e.g. i-vectors), then mixture models become a natural choice. Examples include a mixture of Gaus-sians [43], a mixture of mixtures of GausGaus-sians (i.e. speaker-specific distributions are represented by GMMs) [148], and a mixture of mixtures of Gaussians with eigen-voice priors on speaker-specific mean supervectors [42, 200]. In the last example the prior for speaker-specific distributions is learned on a large independent dataset of speakers.

Alternatively, the input data can be represented by a similarity (affinity) matrix, the elements of which are real-valued similarity scores from pairwise comparisons between objects. Speaker diarization systems based on pairwise similarity scores were studied in PublicationIand in [201], [138], [202].

In PublicationIthe author of this thesis proposed a new clustering method suit-able for input data represented by a similarity matrix. The proposed method is based on the generative probabilistic model of similarity matrices which approxi-mates distributions of scores by Gaussians. This model can be seen as a modifi-cation of the stochastic block model from [203] designed to detect communities in networks represented by undirected graphs. The stochastic block model [204, 205]

is a model for network data in which each node of the graph is a member of single

cluster (community), and the probability of an edge occurring between two individ-ual nodes depends solely on their cluster membership. Given a graph, the problem of community detection is to recover the cluster labels and also to determine the number of clusters. The authors of [203] addressed this problem by modifying the infinite relational model [206] so that the following assumption would be satisfied:

the number of edges connecting nodes of the same cluster should be relatively larger than the number of edges connecting nodes of different clusters. Similar to IRM, the model in [203] assumes a Chinese restaurant process (CRP) [164] prior over parti-tions of nodes to automatically infer the number of clusters.

The model proposed in Publication I can be seen as an alternative to the one in [203] for the real-valued data. While in [203] the input data is an adjacency matrix of an undirected graph with binary elements modeled by Bernoulli distributions, in PublicationIthe input data is similarity matrix whose elements, scores, are modeled by Gaussian distributions.

To describe the model in PublicationI, let us assume that the input data is repre-sented by the similarity matrixDof sizeN×Nwith entries{d_ij}_i,j=1^N whered_ij de-notes a similarity score for thei-th andj-th segments. AssumingKnon-overlapping clusters in the underlying but unobserved objects, this matrix includesK(K−1)/2 groups of between-cluster (non-target) scores andKgroups of within-cluster (target) scores.

If one rearranges the rows and columns of D according to cluster labels, the similarity matrix would have a block structure, in which the block (l,m) contains scores for all pairs of objects belonging to l-th and m-th clusters. Then, diagonal blocks would contain the targets scores, and off-diagonal blocks would contain the non-target scores.

In the proposed model, the distribution for the elements of the similarity matrix is defined as

whereθ_lmare the parameters of the score distribution for block(l,m)andz_idenotes theone-hotencoding of cluster label for thei-th segment. Parametersθ_kkcorrespond to diagonal blocks, that is, to the target scores, and parametersθ_lm correspond to the non-target scores. Since the similarity matrix is assumed to be symmetric, z^l_iz^m_j +z^m_i z^l_j means that there are two possibilities for scored_ij to be in the lm-th block: either thei-th segment belongs to thel-th cluster and the j-th segment be-longs to the m-th cluster, or vice versa. In principle, this model does not restrict the possible form of the score distributions, but, since it is designed for real-valued PLDA scores, Gaussian distribution seems the most natural choice. For instance, pa-rametersθ_lm={µ_lm,σ_lm² }include the meanµ_lmand the varianceσ_lm² of the Gaussian representing the distribution of scores for block(l,m).

In summary, there is an individual distribution for each pair of clusters and one distribution corresponding to each block of within-class scores. Thus, the model can be viewed as a GMM with K(K+1)/2 = K(K−1)/2+Kcomponents. However, PublicationIsuggests using a simplified model where the pooled target scores are modeled by a single Gaussian. As can be seen in Figure 5.5, the same assumption for non-target scores is unlikely to be correct.

Figure 5.5: An example of a similarity matrix extracted from a conversation be-tween four speakers. The(i,j)-th entry of the matrix corresponds to a similarity score between the i-th and thej-th segments. Each segment has a length 0.5 sec.

The speech segments are sorted according to the ground-truth speaker labels so that diagonal blocks correspond to target scores.

Similar to [203], the key assumption behind the proposed model is that the within-cluster scores should have larger values than the between-cluster scores, on average. This assumption can be encoded by introducing the following hard con-straints on Gaussian means: ∀m : µ_mk = µ_km < µ_k. However, the soft constraints imposed by introducing appropriate prior distributions on the parameters of Gaus-sians were found to be sufficient (see PublicationIfor details).

Finally, since the number of clusters is unknown in general, PublicationIadopts Dirichlet processes (DPs) [207] to allow a theoretically unbounded number of clus-ters. The DP is a distribution over discrete probability measures, that is, distribu-tions drawn from a DP are countable convex combinadistribu-tions of point masses. Since mixture models can be parameterized by discrete measures, DP enables a natural way to construct an “infinite” alternative to the finite mixture model [208]. Gen-erally, discrete random measures induce a distribution on partitions and can be characterized in terms of the exchangeable random partitions they imply. Let Z denote an exchangeable random partition of the set {1, ...,N}, that is, a collection of disjoint non-empty blocks {B1, ...,BK} with ∑^K_k=1|Bk| = N. Then, the proba-bility p(Z = {B1, ...,BK}) depends only on the number and sizes of the blocks B_k, regardless of their order. Therefore, there exists a function ρ that is symmet-ric in its arguments such that, for any specific partition Z = {B₁, ...,B_K} one has p(Z = {B₁, ...,B_K}) = ρ(|B₁|, ...,|B_K|) = . . . = ρ(|B_K|, ...,|B₁|). The function ρ is called an exchangeable partition probability function (EPPF). For Dirichlet pro-cesses, the EPPF is available in closed form. Specifically, the Chinese restaurant process [164] with scalar parameterα is the EPPF for DP parameterized by a con-centration parameter α. In [207], it was shown that a DP induces the distribution over partitions such that

p(Z={B1, ...,BK}) = ^α

KΓ(α) Γ(α+N)

∏

K k=1

Γ(|B_k|), (5.87)

whereΓ(·)denotes a Gamma function [209]. In PublicationI, this distribution was used as prior for partitions of speakers in speaker diarization problem. The optimal partition was estimated as (5.85):

Z∗=arg max

Z p(Z|D) =arg max

Z p(D|Z)p(Z). (5.88) One of the commonly adopted strategies (e.g. [42]) for solving this discrete opti-mization problem is based on mean-field approximation and variational inference (see Section 2.4). The idea is to replace the joint posterior distribution p(Z,θ|D) with a family of factorized distributions

p(Z,θ|D)≈q(Z,θ) =q(Z)q(θ) =q(θ)

∏

N i=1

q(zi). (5.89) This allows the original objective function to be approximated by another one that

In document Improving machine learning methods for speaker recognition and segmentation (sivua 86-99)