Template Models - Optimizing spectral feature based text-Independent speaker recognition

The simplest template model isno model at all [93, 47]. In other words, the features extracted in the training phase serve as the template for the speaker. Although this represents the largest amount of information, it can lead to excessive matching times and to overfitting. For this reason, it is common to reduce the number of test vectors by clustering such asK-means [129]. Even simpler approach is to represent speaker by a single mean vector [139].

In the following, the test template is denoted as X = {x₁, . . . ,x_T} and the reference template as R = {r₁, . . . ,r_K}. Theory of vector quantization (VQ) [69]

can be applied in template matching. The average quantization distortion of X, using Ras the quantizer is defined as

D_Q(X,R) = 1 T

XT t=1

1≤k≤Kmin d(x_t,r_k), (2.2) whered(·,·) is a distance measure for vectors, e.g. the Euclidean distance or some measure tailored for certain type of features (see [178]). In [36], nearest neighbor distance is replaced by the minimum distance to the projection between all vec-tor pairs, and improvement was obtained, especially for small template sizes. Soft quantization (orfuzzy VQ) has also been used [208, 207].

For the vector distance d(·,·), weighted distance measures of the following form are commonly used:

d²_W(x,y) = (x−y)⁰W(x−y), (2.3) in which W is a weighting matrix used for variance normalization or emphasizing discriminative features. Euclidean distance is a special case when W is an identity matrix. TheMahalanobis distance [50] is obtained from (2.3) whenW is the inverse covariance matrix. The covariance matrix can be same for all speakers or it can be speaker-depended. In [180], the covariance matrix is partition-depended. Diagonal covariance matrices are typically used because of numerical reasons.

2.6.1 Properties of D_Q

The dissimilarity measure (2.2) is intuitively reasonable: for each test vector, the nearest template vector is found and the minimum distances are summed. Thus, if most of the test vectors are close to reference vectors, the distance will be small, indicating high similarity. It is easy to show that D_Q(X,R) = 0 if and only if X ⊆ R, given that d is a distance function [107]. However, D_Q is not symmetric because in generalD_Q(X,R) 6=D_Q(R,X), which arises a question what should be quantized with which one?

Symmetrization of (2.2) was recently proposed in [107] by computing the asym-metric measures D_Q(X,R) and D_Q(R,X), and combining them using sum, max, min and product operators. The maximum and sum are the most attractive ones since they define a distance function. However, according to the experiments in [107], neither one could beat out the nonsymmetric measure (2.2), which arises suspicion whether symmetrization is needed after all.

Our answer is conditional. In principle, the measure should be symmetric by in-tuition. However, due to imperfections in the measurement process, features are not free from context, but they contain mixed information about the speaker, text, and other factors. In text-independent recognition, the asymmetry might be advanta-geous because of mismatched texts. However, there is experimental evidence in favor of symmetrization. Bimbotet al. [23] studied symmetrization procedures for mono-gaussian speaker modeling, and in the case of limited data for either modeling or matching, symmetrization was found to be useful. In [107], rather long training and test segments were used, which might explain the difference. The symmetrization deserves more attention.

2.6.2 Alternative Measures

Higginset al. [93] have proposed the following dissimilarity measure:

D_H(X,R) = 1 T

XT t=1

1≤k≤Kmin d(x_t,r_k)²+ 1 K

XK k=1

1≤t≤Tmin d(x_t,r_k)²

−1 T

XT t=1

1≤k≤K,k6=tmin d(x_t,x_k)²− 1 K

XK k=1

1≤t≤T,t6=kmin d(r_t,r_k)²,(2.4) in whichd² is the squared Euclidean distance. They also show that, under certain assumptions, the expected value ofD_H is proportional to the divergence between the continuous probability distributions. Divergence is the total average information for discriminating one class from another, and can be considered as a “distance” between

two probability distributions [41]. The first two sum terms in (2.4) correspond to cross-entropies and the last two terms to self-entropies.

Several other heuristic distance and similarity measures have been proposed [143, 91, 10, 111, 116]. Matsui and Furui [143] eliminate outliers and perform matching in the intersecting region ofX andRto increase robustness. In [91], the discrimination power of individual vectors is utilized. Each vector is matched against other speakers using a linear discriminant designed in the training phase to separate these two speakers. Discriminant values are then converted into votes, and the number of votes for the target serves as the match score.

Heuristic weighting utilizing discriminatory information of the reference vectors was proposed in [111, 116]. In the training phase, a weight for each reference vector is determined, signifying its distance from the other speakers’ vectors. For vectors away from other classes, higher contribution is given in the matching phase. In the matching phase, the weight of the nearest neighbor is retrieved and used in the dissimilarity [111] or similarity [116] measure.

2.6.3 Clustering

The size of speaker template can be reduced by clustering [200]. The result of clustering is acodebook C of K code vectors, denoted as C ={c₁, . . . ,c_K}. There are two design issues in the codebook generation: (1) themethod for generating the codebook, and (2) thesize of the codebook.

General and non-surprising result is that increasing the codebook size reduces recognition error rates [200, 58, 83, 116][P1]. A general rule of thumb is to use a codebook of size 64-512 to model spectral parameters of dimensionality 10-50. If the codebook size is set too high, the model gets overfit to the training data and increases errors [202][P5]. Larger codebooks increase also matching time. Usually speaker codebooks are equal size for all speakers, but the sizes can also be optimized for each speaker [60].

The most well-known codebook generation algorithm is thegeneralized Lloyd al-gorithm(GLA) [129], also known as theLinde-Buzo-Gray(LBG), or as theK-means algorithm depending on the context; the names will be used here interchangeably.

The algorithm minimizes the mean square error locally by starting from an initial codebook, which is iteratively refined in two successive steps until the codebook does not change. The codebook is initialized by selectingKdisjoint random vectors from the training set.

He et al. [83] proposed a discriminative codebook training algorithm. In this method, codebooks are first initialized by the LBG algorithm, and then the code vectors are fine-tuned using learning vector quantization (LVQ) principle [120]. In LVQ, individual vectors are classified using template vectors, and the template

vec-tors are moved either towards (correct classification) or away (misclassification) from the tuning set vectors. In speaker recognition, the task is to classify a sequence of vectors rather than individual vectors. For this reason, Heet al. modified the LVQ so that agroup of vectors is classified (using average quantization distortion), and they call their methodgroup vector quantization (GVQ). The code vectors are tuned like in standard LVQ. The GVQ method, when combined with thepartition-normalized distance measure[180], was reported to give the best results among several VQ-based methods compared in [56].

In document Optimizing spectral feature based text-Independent speaker recognition (sivua 22-25)