Linear Prediction - Optimizing spectral feature based text-Independent speaker recognition

3.4.1 Linear Model of Speech Production

Linear prediction (LP) [137] is an alternative spectrum estimation method to DFT.

LP can be considered as a rough formulation for thesource-filter theory of speech production [57]. The “filter” of the LP model represents a transfer function of an all-pole model, consisting of a set of spectral peaks that are more or less related to the resonance structure of the vocal tract, as well as to the spectral properties of the excitation signal. The prediction residual signal represents temporal properties of the signal that are not captured by the all-pole model.

The linear speech production model is given in the time domain by the following equation [178]:

s[n] = Xp k=1

a_ks[n−k] +G u[n], (3.2)

wheres[n] is the observed signal,a_k are thepredictor coefficients,u[n] is the source signal andGis the gain. The predictor equation of LP is given as follows:

˜ s[n] =

Xp k=1

a_ks[n−k]. (3.3)

Equation (3.3) states that current speech sample can be predicted from a linear combination of past p samples, which is an intuitively reasonable assumption in short term (within the analysis frame). The predictor coefficientsa_kare determined so that the square error is minimized:

(a1min,...,ap)

Ã s[n]−

Xp k=1

a_ks[n−k]

!₂

(3.4) The coefficients are typically solved using the Levinson-Durbin algorithm [178, 94, 80]. Frequency-domain interpretation of the model (3.2) is obtained by taking Z -transforms of both sides of (3.2) and solving for the filter transfer function:

H(z) = S(z)

U(z) = G

1−P_p

k=1 a_kz^−k, (3.5)

whereS(z) and U(z) are the Z-transforms of s[n] and u[n], respectively. This is a transfer function of an all-pole filter. The poles are the roots of the denominator, and they correspond to local maxima in the spectrum. Examples of all-pole spectra (bold line) are shown in Fig. 3.5. The DFT spectrum (thin line) is shown for comparison.

0 1000 2000 3000

−60

−40

−20 0

LP order p = 3

0 1000 2000 3000

−60

−40

−20 0

LP order p = 10

0 1000 2000 3000

−60

−40

−20 0

Frequency [Hz]

Magnitude [dB]

LP order p = 30

FFT spectrum All−pole spectrum

Figure 3.5: FFT versus all-pole spectrum estimation.

It is interesting to note that when the actual process that generated the speech signal is close to (3.2), theprediction residual e[n] =s[n]−˜s[n] should be close to the scaled excitation signalG u[n]. Thus, the residual signal can be used for extracting voice-source related features. In speech recognition, the residual signal is considered as noise, but it has been shown to contain some speaker related information [206, 61, 175]. For instance, in [175], fine structure of the glottal waveform is estimated by finding the parameters of a parametric glottal flow model.

Selection of correct analysis order is crucial [145]. For a low-order analysis, say p= 4, . . . ,10, the LP envelope represent mainly linguistic information, and due to low dimensionality, the discrimination between speakers is low. For higher orders, sayp >15, the LP spectrum represents a mixture of linguistic and speaker informa-tion. Although increasing the order makes speaker differences more apparent, for

too high an order, LP model starts to capture individual harmonic peaks, and the model becomes more close to Fourier spectrum.

3.4.2 LP-Based Features

In addition to the predictor coefficients, the Levinson-Durbin algorithm produces intermediate variables calledreflection coefficientsk[i], i= 1, . . . , pas a side product.

These are interpreted as the reflection coefficients between the tubes in the lossless tube model of the vocal tract [45]. From the reflection coefficients, log area ratios (LAR) or arcus sine reflection coefficients [27] can be also computed. Formant frequencies and bandwidths can be estimated from the polesz₁, . . . , z_pof the transfer function as follows [45]:

Fˆ_i = F_s 2πtan⁻¹

ÃImz_i Rez_i

(3.6) Bˆ_i = −F_s

π ln|z_i|. (3.7)

Among a large number of parameters, LPC-derived formant frequencies were ex-perimentally studied in [110]. Formants were observed to perform slightly poorer compared to other spectral features, but they are nevertheless an interesting fea-ture set. The filterbank and cepstral feafea-tures arecontinuous parameters describing the distribution of amplitudes of all frequencies. Formants, on the other hand, are a discrete parameter set that picks discrete feature points from the spectrum, the locations of resonances, and not their amplitudes.

Given the predictor coefficientsa_k,linear predictive cepstral coefficients(LPCC) can be computed as follows [94]:

c_n=







a_n+P_n−1

k=1 k

nc_ka_n−k, 1≤n≤p P_n−1

k=n−p k

nc_ka_n−k, n > p.

(3.8)

An equivalent presentation of the predictor coefficients are so-called line spectral frequencies (LSF) [45, 94, 68]. Unlike other LP-based features listed here, LSFs have a special property of being ordered according to frequency. In other words, LSFs are not fullband features, and some LSFs can be still usable if some frequency bands are contaminated by noise. LSFs have been applied to speaker recognition in [132, 131, 27, 226, 152, 110].

Perceptual linear prediction (PLP) [88] exploits three psychoacoustic principles, namely, critical band analysis (Bark), equal loudness pre-emphasis, and intensity-loudness relationship. PLP and its variants have been used succesfully in speaker

0 50 100 150 200 250 300

−0.2 0 0.2

c[1]

0 50 100 150 200 250 300

−0.2 0 0.2

∆−c[1]

0 50 100 150 200 250 300

−0.5 0 0.5

Frame number

∆∆−c[1]

Figure 3.6: Time trajectories of first MFCC coefficient (c₁) and its delta and double-delta coefficients.

recognition [220, 159, 181, 211, 79]. Scanning the literature, it seems that conven-tional features like MFCC can outperform PLP in clean environment, but PLP gives better results in noisy and mismatched conditions.

Atal [13] compared the performance of the LPCC parameters with the following parameters for speaker recognition: LPC coefficients, impulse response of the filter specified by the LPC coefficients, autocorrelation function, and area function. From these features, the LPC cepstral coefficients performed the best. Unfortunately, Atal’s data consists only of 10 speakers. In [110], a large number of DFT- and LP-derived spectral features were experimentally compared using two corpora of 110 speakers (Finnish) and 100 speakers (English). Cube root compressed filterbank cepstral coefficients, LPCC, LSF and arcus sine reflection coefficients performed the best when modeled using vector quantization.

In document Optimizing spectral feature based text-Independent speaker recognition (sivua 40-44)