• Ei tuloksia

Feature extraction is necessary for several reasons. First, speech is a highly complex signal which carries several features mixed together [189]. In speaker recognition we are interested in the features that correlate with the physiological and behavioral

characteristics of the speaker. Other information sources are considered as undesir-able noise whose effect must be minimized. The second reason is a mathematical one, and relates to the phenomenon known ascurse of dimensionality[25, 101, 102], which implies that the number of needed training vectors increases exponentially with the dimensionality. Furthermore, low-dimensional representations lead to com-putational and storage savings.

2.2.1 Criteria for Feature Selection

In [216, 189], desired properties for an ideal feature for speaker recognition are listed.

The ideal feature should

have large between-speaker and small within-speaker variability

be difficult to impersonate/mimic

not be affected by the speaker’s health or long-term variations in voice

occur frequently and naturally in speech

be robust against noises and distortions

It is unlikely that a single feature would fulfill all the listed requirements. For-tunately, due to the complexity of speech signals, a large number of complemen-tary features can be extracted and combined to improve accuracy. For instance, short-term spectral features are highly discriminative and, in general, they can be reliably measured from short segments (1-5 seconds) [151], but will be easily cor-rupted when transmitted over a noisy channel. In contrast,F0 statistics are robust against technical mismatches but require rather long speech segments and are not as discriminative. Formant frequencies are also rather noise robust, and formant ra-tios, relating to the relative sizes of resonant cavities, are expected to be something that is not easily under the speaker’s voluntary control. The selection of features depends largely on the application (co-operative/non co-operative speakers, desired security/convenience balance, database size, amount of environmental noise).

2.2.2 Types of Features

A vast number of features have been proposed for speaker recognition. We divide them into the following classes:

Spectral features

Dynamic features

Source features

Suprasegmental features

High-level features

Table 2.1 shows examples from each class. Spectral features are descriptors of the short-term speech spectrum, and they reflect more or less the physical characteris-tics of the vocal tract. Dynamic features relate to time evolution of spectral (and other) features. Source features refer to the features of the glottal voice source.

Suprasegmental features span over several segments. Finally,high-level features re-fer to symbolic type of information, such as characteristic word usage.

Table 2.1: Examples of features for speaker recognition.

Feature type Examples

Spectral features MFCC, LPCC, LSF

Long-term average spectrum (LTAS) Formant frequencies and bandwidths Dynamic features Delta features

Modulation frequencies

Vector autoregressive coefficients Source features F0 mean

Glottal pulse shape Suprasegmental features F0 contours

Intensity contours Microprosody

High-level features Idiosyncratic word usage Pronounciation

An alternative classification of features could bephonetic-computationaldichotomy.

Phonetic features are based on the acoustic-phonetic knowledge and they often have a direct physical meaning (such as vibration frequency of vocal folds or resonances of the vocal tract). In contrast, by computational features we refer to features that aim at finding good presentation in the terms of small correlations and/or high dis-crimination between speakers. These do not necessarily have any physical meaning, but for automatic recognition this does not matter.

2.2.3 Dimension Reduction by Feature Mapping

Byfeature mapping we refer to any function producing a linear or nonlinear combi-nation of the original features. Well-known linear feature mapping methods include

principal component analysis (PCA) [50], independent component analysis (ICA) [97] and linear discriminant analysis (LDA) [65, 50]. An example of a nonlinear method is the multilayer perceptron (MLP) [25].

PCA finds the directions of largest variances and can be used for eliminating (linear) correlations between the features. ICA goes further by aiming at finding statistically independent components. LDA utilizes class labels and finds the direc-tions, on which the linear separability is maximized.

ICA can be used when it can be assumed that the observed vector is a linear mixture of some underlying sources. This is the basic assumption in the source-filter theory of speech production [57], in which the spectra of the observed signal is assumed to be a product of the spectra of excitation source, vocal tract filter and lip radiation. In cepstral domain, these are additive, which motivated the authors of [104] to apply ICA on the cepstral features. ICA-derived basis function have also been proposed as an alternative to discrete Fourier transform in feature extraction [103].

MLP can be used as a feature extractor when trained for autoassociation task.

This means that the desired output vector is the same as the input vector, and the network is trained to learn the reconstruction mapping through nonlinear hidden layer(s) having a small number of neurons. In this way, the high-dimensional input space is represented using a small number of hidden units performing nonlinear PCA.

Neural networks can also be used as an integrated feature extractor and speaker model [86].

2.2.4 Dimension Reduction by Feature Selection

An alternative to feature mapping is feature selection [102], which was introduced to speaker recognition in 1970s [39, 191]. The difference with feature mapping is that in feature selection, the selected features are a subset, and not a combination, of the original features. The subset is selected to maximize a separability criterion, see [27] for a detailed discussion.

In addition to the optimization criterion, the search algorithm needs to be spec-ified, and for this several methods exist. Naive selection takes the individually best-performing features. Better approaches include bottom-up and top-down search algorithms, dynamic programming, and genetic algorithms. For a general overview and comparison, refer to [102], and for comparison in speaker recognition, see [32].

In [32], it is noted that the feature selection can be considered as a special case of weightingthe features in the matching phase with binary weights{0,1}(0=feature is not selected, 1=feature is selected). Thus, a natural extension is to consider weights from a continuous set. The authors applied a genetic algorithm for optimizing the weights, and there was only a minor improvement over the feature selection.

In an interesting approach presented in [166] and later applied in [46, 40], per-sonal features are selected for each speaker. This allows efficient exploitation of features that might be bad speaker discriminators on average, but discriminative for a certain individual.