• Ei tuloksia

4.2 Back-end

4.2.2 Speaker diarization

There are two major classes of speaker diarization methods that differ in their back-ends [107]. Methods from the first class consist of two steps: (1)speaker change point detection (CPD) and (2) speaker clustering. As suggested by its name, the first step splits the conversation into smaller segments at the points where speaker changes.

These segments are assumed to contain the speech of only one the speaker since they are bounded by speaker change points. The second step, speaker clustering, assigns labels to these segments according to speaker identity.

It should be noted that the segments bounded by change points generally have different lengths depending on the sensitivity of the detector. A high false alarm rate,i.e.when a change point detector splits speech of thesamespeaker into several parts, would lead to a large number of short segments. However, if a segment is too short, speaker characteristics cannot be extracted accurately, which would yield unstable results after the clustering step. Therefore, in practice, there is a trade-off between two objectives: producing speaker-homogeneous segments that are as long as possible (which simplifies the subsequent clustering task), and avoiding segments that contain more than one speaker since this leads to incorrect results. As one can see, the latter objective is more important: the presence of unnecessary splits makes the clustering task more difficult, while missing a speaker change point leads to inevitable mistakes.

The traditional approach to speaker change detection is formulated in terms of a binary hypothesis testing framework. Hypothesis testing is performed on a pair of consecutive segments across the whole audio stream in a sliding-window fashion and returns a sequence of speaker change points. For each hypothetical change point, there are two possible hypotheses. The first states that both segments come from the same speaker, and second states that the segments come from two different speakers. Assuming that each speaker is characterized by a probability distribution in a feature space, the test can be formulated as deciding whether two sets of feature vectors originate from the same or from two different distributions.

The most popular speaker change detection methods are based on Bayesian In-formation Criterion (BIC) [87, 143, 144]. The standard Bayesian solution to a binary hypothesis testing problem is to represent both hypotheses as parametric probabil-ity models and to compute the Bayes factor (or the ratio of marginal likelihoods) for one against the other. In practice, the BIC score or its modifications are used to approximate marginal likelihoods in the Bayes factor. There are a few popu-lar alternatives to BIC score such as the generalized likelihood ratio (GLR) [145], the Kullback-Leibler (KL) divergence [146], and VQ distortion [147], to name a few.

Methods from the second class, known asfixed length segmentation[107], do not use a change-point detector. Rather, they operate with short segments of equal length, typically ranging from 50 ms to 500 ms. This approach is based on the observation that the duration of a phrase in human speech is relatively large with respect to the segment length [148]. Hence, the majority of segments will contain the speech of only one speaker. But, as before, the choice of segments that are too short would not allow speaker characteristics to be extracted accurately. On the other hand, increasing the segment length would result in a higher fraction of segments containing speech of more than one speaker.

Unlike the aforementioned two-step approach based on CPD, methods in this class include only one step: speaker clustering. Thus, this approach can be seen as a special case of the CPD based scheme in which there are many falsely detected change points. In addition, each short segment is often represented by a single feature vector, e.g. by i-vector in [43, 149, 150] or by averaged frame-level feature vectors [148]. This is also the case for the speaker diarization system proposed in PublicationI.

Finally, given feature vectors corresponding to presumably speaker-homogeneous segments, a clustering algorithm is used to partition these vectors into groups ac-cording to speaker identity.

4.3 PERFORMANCE EVALUATION 4.3.1 Speaker verification

Since an automatic speaker verification (ASV) system is basically a binary classifier intended to either accept or reject an identity claim, misses and false alarms are the two types of errors that characterize a system. A miss occurs if a target identity is mistakenly rejected, and a false alarm occurs if a non-target identity is mistakenly accepted. In order to quantify the performance of such a recognizer, one needs a set of labeled verification trials. In general, a verification trial is composed of two sets of speech segments, such that all segments in individual sets belong to a single speaker. If each trial is associated with a binary label corresponding to the target of a non-target hypothesis, then it is possible to estimate the probabilities of misses and false alarms. The probability ofmiss, Pmiss, is estimated as the ratio of the number of falsely rejected target speakers to the total number of target trials. Similarly, the probability of false alarm, Pfa, is estimated as the ratio of the number of falsely accepted impostor speakers to the total number of non-target trials. Typically, a hard decision of the recognizer is made by comparing a continuous similarity score against a decision threshold. By moving the threshold, the miss and false alarm probabilities can be changed in opposing directions.

Given a fixed decision threshold, τ, the so-called detection cost function (DCF) [151] can be used as a single performance measure of a speaker verification system.

It is defined as a weighted sum of the miss and false alarm probabilities:

DCF(τ) =πCmissPmiss(τ) + (1−π)CfaPfa(τ), (4.1) where Cmiss and Cfa are the relative costs of detection errors and π is the prior probability of the target hypothesis. As one can see, DCF is essentially an empir-ical estimate of the expected loss (2.1) of a binary classifier with a loss function determined by the values ofCmiss andCfa. The DCF was endorsed by the National

Institute of Standards and Technology (NIST) as a primary metric for evaluating ASV systems [152]. Another performance measure used by the NIST is the min-imum possible DCF, found by setting the optimal decision threshold for a given evaluation set:

min DCF=min

τ DCF(τ). (4.2)

To make DCF independent of the choice of absolute values of Cmiss and Cfa, it is normalized byCnorm — the besta priori cost that can be obtained by accepting or rejecting all trials, whichever is smaller: Cnorm=min(πCmiss,(1−π)Cfa).

Theequal error rate (EER) is another widely used criterion to compare the per-formance of speaker verification systems. It corresponds to a specific choice of the decision threshold so that the false alarm probability is equal to the miss probability and equals the joint value of the error rates. Formally, it can be defined as

EER=Pmiss(τ) =Pfa(τ), (4.3) τ=arg min

τ |Pmiss(τ)−Pfa(τ)|. (4.4) In addition to scalar measures such as DCF or EER, thedetection error trade-off (DET) curve [153] is often used to illustrate the discriminative ability of ASV systems. The DET curve is the curve of miss probability, Pmiss, plotted against the false alarm probability, Pfa, across different decision thresholds. The EER can be found at the intersection of the DET curve and the diagonal defined by the equation Pmiss=Pfa. 4.3.2 Speaker diarization

The commonly adopted definition of speaker diarization task includesbothspeech activity detection and speaker segmentation as subtasks. Formally, a speaker di-arization system aims at annotating an input audio with information that attributes temporal regions of the signal to their respective labels, including the non-speech label. For speech regions, the diarization system assigns relative speaker labels to each homogeneous segment of speech. As the result, the primary performance mea-sure for speaker diarization systems, or thediarization error rate(DER), is defined as a sum of errors coming from different sources. These error include:

• Emiss(missed speech) the percentage of time that a speech segment is labeled as

“non-speech”.

• Efa(false alarm speech) the percentage of time that a speaker ID is assigned to a non-speech region.

• Espk(speaker error) the percentage of time that a speaker ID is assigned to the wrong speaker.

Given these errors DER is defined as

DER=Emiss+Efa+Espk. (4.5) Depending on how the one-to-one mapping between reference and hypothesized speakers is defined, the DER can be computed in either anoptimalor agreedy man-ner [154]. The optimal version uses the so-called Hungarian algorithm [155] (also known asMunkres’ assignment algorithm) to compute the mapping that minimizes the speaker error term, Espk, while the greedy version operates in greedy fashion,

refining the mapping iteratively, decreasing the value of the co-occurrence duration between reference and hypothesized speakers. The greedy version is much faster than the optimal one, but it may slightly over-estimate the value of the DER.

In practice, when evaluating performance, a forgiveness collar is set before and after each reference boundary. It accounts for the inexact nature of the labeling process performed by human annotators. A collar of 0.25 sec, suggested by NIST in the rich transcription (RT) evaluations [156], is typically used to compute DER in speaker diarization experiments.

The speech activity detection error and the speaker clustering error are not indi-vidually distinguishable given a value of DER. Therefore, many speaker diarization systems address these two tasks separately. First, SAD is used to obtain the bound-aries of speech segments. Second, these segments are split into homogeneous seg-ments according to the speaker identity. This is the case for all diarization systems described in this thesis. Therefore, the term “speaker diarization” will be used to refer to speaker segmentation, assuming that speech and non-speech regions have already been resolved by a SAD.

5 SPEAKER RECOGNITION AND DIARIZATION — A MODERN MACHINE LEARNING PERSPECTIVE

5.1 I-VECTOR FEATURES

During the past decade,i-vectorshave been the most widely used features in speaker recognition research and application. The i-vector extractor [130] is a generative probabilistic model forsequencesof frame-level feature vectors,X={x1, ...,xN}, xiRd, representing asinglespeech utterance. According to the i-vector model, ev-ery sequence is assumed to have been generated independently by adifferent Gaus-sian mixture model (GMM) withKcomponents:

p(x) =

K k=1

πkN(x|µk,Σk) =GMM(x|{πk,µk,Σk}Kk=1), (5.1) whereπk,µkandΣk denote the mixing weight, thed-dimensional mean vector and thed×d covariance matrix of thek-th Gaussian component, respectively. That is, feature sequences corresponding to different utterances are generated by GMMs withdifferent mean vectors (but shared weights and covariances) parameterized by theR-dimensional vectorφ, known as an i-vector, in the following way:

µk=mk+Tkφ. (5.2)

Here, the vectors {mk}Kk=1and the matrices {Tk}Kk=1 are parameters of the model.

They are trained off-line on a large speech corpus and held fixed when an i-vector is extracted from a new utterance. A toy illustration of the i-vector model is presented in Figure 5.1.

The i-vector extractor computes a maximum a posteriori (MAP) estimate of the latent vectorφby specifying the standard normal prior on the i-vectors,

p(φ) =N(φ|0,I):

φ=arg max

φ p(X|φ)p(φ) = (5.3)

=arg max

φ

N(φ|0,I)

N i=1

GMM(xi|{πk,mk+Tkφ,Σk}Kk=1) = (5.4)

=arg max

φ

logN(φ|0,I) +

N i=1

log

K k=1

N(xi|mk+Tkφ,Σk). (5.5) Unfortunately, this optimization problem lacks a closed-form solution and requires some iterative, and therefore slow, algorithms (e.g.a gradient descent) to find a local maximum. Therefore, in practice, an alternative approach is usually adopted. This approach is based on the formulation of GMM as a latent variable model and using variational Bayesian inference to approximate the posterior distribution of the latent variables.

I-vector space

Figure 5.1: Illustration of the i-vector model. The process of generating d-dimensional observations, xi, involves, first, drawing an R-dimensional vector φ (i-vector) from the standard multivariate normal distribution (first row), then, using i-vector to compute means of the utterance-specific GMM with shared mixture weights and covariances, and, finally, obtaining samples from the resulting GMM (last row). Here i-vectors of dimensionR=1 are used to define a GMM withK=3 components. Thus, each mean vector belongs to the one-dimensional affine sub-space (visualized by straight lines) of the(d= 2)-dimensional observation space.

These subspaces are determined by parameters of the i-vector model, namelymk andTk. As a result, different i-vectors define different distribution of the observed feature vectors. Here, dotted ellipses represent Gaussian components of GMMs with meansmk, while solid ellipses correspond to Gaussians with meansµk.

Following (2.43), let us re-write the i-vector model as

p(φ) =N(φ|0,I) (5.6)

p(zi) =Cat(π) (5.7)

p(x |z ) =

K N(x|m + Σ )z

whereziis aone-hotvector representing an (unknown) assignment of the observed feature vectorxi to one ofKGMM components. Let the whole sequence of compo-nent labels be denoted asZ={z1, ...,zN}. Then, the so-calledmean-fieldvariational approximation [93] to the posterior distribution,

p(φ,Z|X)≈q(φ)q(Z) =q(φ)

N i=1

q(zi) (5.9)

can be obtained by an iterative scheme alternating between the following updates [157]

logq(Z) =const+ Z

q(φ)logp(X,Z,φ)dφ logq(φ) =const+

Z

q(Z)logp(X,Z,φ)

until convergence. Finally, an estimate of the i-vector can be obtained as

φ = arg maxφq(φ). In practice, however, even this algorithm is not used due to the vastly increased computational overhead relative to the theoretical benefit in overall performance of a speaker recognition system. The commonly adopted i-vector extractor [130] approximates this scheme by using a non-iterative shortcut:

1. logq(Z) =const+logp(X,Z,φ=0) 2. logq(φ) =const+∑Zq(Z)logp(X,Z,φ) 3. φ=arg maxφq(φ).

This i-vector extractor is trained on a large set of unlabeled speech utterances in an unsupervised fashion using the EM-algorithm. While the principled approach suggests estimating all of the parameters jointly, the strategy adopted in practice is to, first, train the GMM, then, keep its parameters fixed and train the matricesTk. One can see that this model reduces to the traditional factor analysis (2.49) if the GMM consists of asingleGaussian with a diagonal covariance matrix.

At first, i-vector features were proposed in the context of the speaker recognition problem to represent long utterances [130]. Later, they were found to be useful for speaker diarization [149] and speech activity detection [158] as well, where one operates with short segments, typically of a length less than 1 second. In particular, this approach is adopted in the speaker diarization system proposed in Publication I.