• Ei tuloksia

Presentation attack detection in automatic speaker verification with deep learning

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Presentation attack detection in automatic speaker verification with deep learning"

Copied!
81
0
0

Kokoteksti

(1)

Presentation attack detection in automatic speaker verification with deep learning

Juhani Seppälä

Master’s Thesis

School of Computing Computer Science

April 2019

(2)

ITÄ-SUOMEN YLIOPISTO, Luonnontieteiden ja metsätieteiden tiedekunta, Joensuu School of Computing

Tietojenkäsittelytiede

Opiskelija, Juhani Seppälä: Replay attack detection in speaker verification with deep learning

Pro gradu -tutkielma,74p.

Pro gradu -tutkielman ohjaajat: FT Tomi Kinnunen Huhtikuu 2019

Tiivistelmä: tietoturvallisuuden kontekstissa perinteiset käyttäjän tunnistamisen menetelmät perustuvat joko tietämykseen tai fyysiseen objektiin. On kuitenkin ole- massa erilaisia tilanteita, joissa perinteiset tunnistautumisen menetelmät eivät joko ole riittäviä sellaisenaan tai niitä ei voida soveltaa lainkaan. Tarve vaihtoehtoisille biometriikkaan perustuville tunnistautumismenetelmille on ollut jatkuvasti kasvussa ja nykypäivän käyttäjät vaativat järjestelmiltä sekä turvallisuutta että käytettävyyttä.

Yritykset ja viranomaiset kaipaavat työkaluja huijaamisen sekä väärinkäytösten hillit- semiseksi. Automaattinen puhujan verifiointi (ASV) on biometrisen tunnistautumisen menetelmä, jossa hyödynnetään puhetta. Yrityksille ASV antaa keinon petosten ennal- taehkäisyyn ja viranomaisille se tarjoaa uusia työkaluja esimkerkiksi rikospaikkatutk- intaa varten. Ääniohjattujen älykkäiden järjestelmien yleistyessä kasvaa tarve myös ääneen perustuvalle tunnistautumiselle. ISO/IEC 30107-1:2016 -standardi määrit- telee niin sanotun presentaatiohyökkäyksen biometrisille järjestelmille. Presentaatio- hyökkäys on ongelma kaikentyyppisissä biometrisen tunnistautumisen järjestelmissä.

Eräs tapa toteuttaa tällainen hyökkäys on toistaa kohdehenkilön nauhoitettua puhetta biometrisen tunnistautumisen järjestelmälle. Useassa eri itsenäisessä tutkimuksessa on havaittu, että järjestelmien toimintakykyä voidaan heikentää toistetuilla äänitteillä. Ke- hittyneimmät nykyiset järjestelmät hyödyntävät niin sanottujai-vektoreita, kun perin- teiset ASV -järjestelmät perustuivat Gaussin mixtuurimalleihin ja akustisiin piirteisiin.

Tässä työssä tutkimme ns. syväoppimismenetelmien toimintaa presentaatiohyökkäyk- sen tunnistamiseen.

Avainsanat: biometriikka, automaattinen puhujan tunnistus, väärennös, väärennösten tunnistus, toistohyökkäys, ASVspoof, DNN, CNN

CCS -luokat (ACM Computing Classification System, 2012 version): Security and privacy→Biometrics, Computing methodologies→Neural networks, Computing methodologies→Supervised learning by classification

(3)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Student, Juhani Seppälä: Replay attack detection in speaker verification with deep learning

Master’s Thesis,74p.

Supervisors of the Master’s Thesis: PhD Tomi Kinnunen April 2019

Abstract: In the context of information security, the traditional means of user authen- tication involves either knowledge (password) or physical token (badge or key). There are situations, however, where these are either not applicable or insufficient alone. The demand foralternateforms of authentication based onbiometricshas been increasing and today’s users demand both security and convenience. Businesses and governments demand tools to combat fraud and abuse. Automatic speaker verification (ASV) is a a biometric authentication method utilising speech data. For businesses ASV allows for early fraud detection, while for law-enforcement, techniques from ASV may be of use in forensics. And, as more voice-operated, intelligent systems become mainstream in society, the need for voice-based authentication increases. The ISO/IEC 30107-1:2016 standard defines a so-called presentation attack for biometric systems. Presentation attacks present a problem for all biometric systems. One method to perform a presen- tation attack against an ASV system is by replaying a recording of the target speaker’s speech to the biometric authentication system. Multiple independent studies have iden- tified that ASV system performance can be degraded when replay samples are intro- duced. Current state-of-the-art systems for ASV utilise the so-calledi-vectors, while the classical systems were based on Gaussian mixture modelling of acoustic speech features. In this work we investigate so-called deep learning approaches to replay at- tack detection.

Keywords: biometrics, speaker verification, spoofing, anti-spoofing, replay-attack, ASVspoof, DNN, CNN

CCS concepts (ACM Computing Classification System, 2012 version): Security and privacy→Biometrics, Computing methodologies→Neural networks, Computing methodologies→Supervised learning by classification

(4)

Acronyms and abbreviations

ADC Analog-to-digital conversion ASR Automatic speaker recognition ASV Automatic speaker verification CNN Convolutional neural network CQCC Constant-Q cepstral coefficients CQT Constant-Q transform

DCT Discrete cosine transform DNN Deep neural network DFT Discrete Fourier transform GMM Gaussian mixture model EER Equal error rate

EM Expectation-maximisation FFT Fast Fourier transform

HFCC High frequency cepstral coefficients LCNN Light convolutional neural network LDA Linear discriminant analysis

LFCC Linear frequency cepstral coefficients LPC Linear prediction coefficients

LPCC Linear prediction cepstrum coefficients LVM Latent variable model

MAP Maximum a posteriori

MFCC Mel-frequency cepstral coefficients MFM Max-feature-map

MLE Maximum likelihood estimate/estimation MLP Multiple-layer Perceptron

MSE Mean squared error

PLDA Probabilistic linear discriminant analysis RASTA Relative spectral processing

ReLU Rectified linear unit

STFT Short-time Fourier transform SVM Support vector machine TVM Total variability model UBM Universal background model

(5)

Mathematical notation

a The vector a.

M The Matrix M.

p(x) Probability density on x.

N(µ,Σ) Multivariate normal density with meanµ, and covarianceΣ I The identity or unit matrix.

(6)

Contents

1 Introduction 1

2 Machine learning 4

2.1 Supervised learning . . . 5

2.2 Unsupervised learning . . . 6

2.3 Some properties of machine learning . . . 6

2.4 Latent variable models and the Gaussian mixture model . . . 8

2.4.1 Discussion . . . 13

2.5 Factor analysis . . . 14

2.6 Deep learning . . . 14

2.6.1 The classic multi-layer perceptron . . . 15

2.6.2 Implications of (32) and discussion . . . 21

2.6.3 Convolutional neural networks . . . 24

3 Speaker recognition and verification 29 3.1 Biometrics. . . 29

3.1.1 Authentication and biometrics . . . 31

3.1.2 Speaker verification . . . 32

3.2 Speech processing and feature extraction. . . 32

3.2.1 Mel-frequency cepstral coefficients (MFCCs) . . . 34

3.2.2 Constant Q cepstral coefficients (CQCCs) . . . 41

3.2.3 The spectrogram . . . 43

3.2.4 Other features . . . 45

3.3 Speaker modelling and classification . . . 45

3.3.1 Gaussian mixture model approaches to ASV . . . 46

3.3.2 Linear statistical model approaches . . . 48

3.3.3 Deep learning approaches to ASV . . . 51

3.4 Vulnerabilities and countermeasures . . . 52

3.5 Deep learning in replay attack detection . . . 52

3.6 System measurement and evaluation . . . 56

4 Experimental set-up and results 58 4.1 The ASVspoof 2017 v2 data . . . 58

4.2 The results and discussion . . . 60

(7)

5 Conclusion 64

(8)

1 Introduction

The expectation for high usability and the demand for alternative forms of user authen- tication have been ever increasing and today’s users expect seamless and hassle-free access to various services. Similarly, due to the increasing demand for services de- manding high degree of security traditionally handled on-site, such as banking and various governmental services, service providers have long been interested in the abil- ity to enhance and, in some cases, simplify their user authentication schemes. Auto- matic speaker verification(ASV), as an application of automatic speaker recognition (ASR), allows for the possibility of using relatively easily and, more importantly non- intrusively, collectable biometric – human speech. The field of ASV is still under rapid change, with recent developments pointing towards an interesting future from the perspective of the so-called deep learningevolution of machine learning, where deep neural architectures are studied for modelling and inference.

Key use-cases for ASV can be found in off-site electronic services, business customer service, as well as in law-enforcement and forensics [1]. Untapped potential remain in biometric authentication (including speech), considering all the modern day personal computing devices. Similarly, voice-based interfaces, such as Apple Siri, have been rising in popularity. Many modern smart devices already include biometrics, either in the form of facial recognition, or in the form of fingerprint detection. ASV adds another possible layer of security of these decides.

The threat of replay attacks on AVS systems have been identified in separate studies [2,3,4,5]. A replay attack against an ASV system is initiated by playing a recording of a person’s speech while interfacing with the target ASV system. Figure1demonstrates the relative ease of obtaining a high-quality samples of the speaker of interest in the age of the smartphone. Figure 2 illustrates the relative ease of launching a replay attack. In [3], it was shown that systems do not fare well under the possibility of high- quality impostor attacks, where even the state-of-art systems were heavily impacted by replay data. A replay attack can be thought to be of special interest due to the low requirements of performing this kind of an attack. Other types of attacks are noted to require either high-level technical sophistication or special skills in impersonation.

Speech synthesis and voice conversion for the purpose of fooling ASV systems have been studied more extensively in comparison to the relatively simple replay attack, but the recent ASVSpoof 2017 challenge [2] has refocused the interest of researchers in

(9)

Figure 1: A mock scenario where high-quality recording of the target’s speech is ob- tained. The attacker (right) lures the victim into a (preferably) lengthy conversation, during which they collect speech data for later use during an attack. Disclaimer: the depicted situation is unrelated to the ASVspoof data collection process.

Figure 2: Two off-the-shelf devices are needed to launch a replay attack against a telephone-based ASV system. The recording and playback device (left) is used to play back pre-recorded audio from the target speaker, while the device on the right is used to make the call.

(10)

this area, underlying the criticality of the problem.

Classical ASV systems [6,7] are based on generative models known asGaussian mix- ture models(GMMs), and acoustic features, the most common of which are theMel- frequency cepstral coefficients(MFCCs) [8]. More recently, so-calledContant-Q cep- stral coefficients(CQCCs) [9] have been shown to be successful in ASV replay attack detection. The current state-of-the-art system in ASV is the factor analysis -basedi- vector system [10], in which the idea of the GMMsupervector is utilised. This work is ultimately motivated by some results obtained from the ASVspoof 2017 challenge.

Several so-called deep learning approaches were presented for ASV replay attack de- tection [11,12]. In the former, a so-called light convolutional neural network (LCNN) [13] was used to directly distinguish replay samples from genuine samples, while in the latter, the focus was on combining a support vector machine (SVM) [14] classifier with a CNN feature extractor.

This thesis is structured as follows. Section2covers some of the most important ma- chine learning topics at the heart of ASV. Section 3offers a contextualisation of the ASV problem within automatic speaker recognition (ASR) and discusses some of the popular speech processing, feature extraction and speaker modelling techniques. Sec- tion 4 describes the system, including the parameters and the overall set-up of the experiments. Finally, section5concludes.

(11)

2 Machine learning

This section serves the purpose of an introduction to the key concepts and terminology behind the methods used in automatic speaker recognition. The reasoning here is to steer the speaker recognition and verification discussion towards certain important in- novations that led to the forming of particular methods and to focus on the differences, strengths and weaknesses of these methods. This section starts with the general ideas and terminology around machine learning. Subsection2.4 covers the necessary back- ground on a specific unsupervised model, known as theGaussian mixture model, while subsection2.5 covers the background onfactor analysis. Subsection2.6 coversdeep learning. These three background sections each constitute the necessary backgrounds to three major different approaches to speaker recognition and spoofing detection, each of which will be discussed in section3.

Machine learningrefers to a broad class of methods and principles for the purpose of making future predictions based on observed data and for the purpose of uncovering underlying structures within data [15]. There are several different types of machine learning problems [15]. Insupervisedlearning, the task is to learn a mapping function between the input space and the output space, while inunsupervisedlearning the learn- ing task is performed without any mapping (orlabel) information. Some methods may combine aspects of both (semi-supervised learning). Finally, there is reinforcement learning, in which the learning involves reward and punishment mechanisms that guide the model towards the wanted response determined by the designer. In the context of this thesis, the two relevant classes of machine learning are supervised and unsuper- vised learning, both of which will will review in the following two subsections.

Figure 3: Conceptual illustration of supervised (left) and unsupervised (right) learning (based on similar ideas in multiple sources, including [16])

(12)

2.1 Supervised learning

Supervised learning tasks can be broadly split intoregressionandclassificationtasks.

In the case of classification, given the inputsxi ∈ X, i ∈ {1, ..., N}, drawn from a trainingsetX, training outputsyi ∈ {1, ..., C}, the task is to learn a mapping between the inputs and the outputs [15]. Here, C denotes the number of classes and N the number of training observations. In the case that the training outputs are continuous, the task becomes a regression problem. An individualxiis also referred to asfeatures, or to as afeature vector, with thedimensionalityof the training data often exceeding 1.

The dimensionality of the training data depends on the learning task and availability of suitable data, ranging from single-digit to hundreds or even thousands. As an example of low-dimensional training data, consider that we would record the height, weight and the shoe size of a person. On the other hand, a high-dimensional feature vector could contain, for example, the raw pixel values of a digital image. The desired outputs are known as class (orlabel) information. The number of classes determines whether a classification problem is abinary classification or multi-class classificationproblem.

Additionally, if an input may be classified into multiple classes, the task is known as multi-label classification. As an example of a binary classification problem, con- sider a situation where the goal is to classify a speech sample into either pre-recorded (spoofed) speech or live speech, which is also the learning task this thesis focuses on.

An example of multi-class classification could be a situation where the goal is to predict the labels of digital images.

Figure 4: An imaginary image labelling scenario, where the most probable labellings for the two example images are shown.

Both supervised and unsupervised learning have been widely utilised in speaker recog- nition, and subsequently, in spoofing detection, and often we will see these used to-

(13)

gether in some manner to build the actualclassifierthat solves the problem of interest.

The general speaker recognition problem implies a multi-class classification task, given that the goal is to tie an incoming speech sample to a single class,i.e., the most likely speaker. Spoofing detection, in turn is a binary classification problem, as the task is to determine whether a given speech sample is from a pre-recorded speech, played from an artificial audio device, or given by an actual human. In this work, spoofing detection is discussed as an important sub-problem within the wider context of speaker recogni- tion and, as we will see, the methods for spoofing detection largely rely on the findings from work done on the general speaker recognition task. A key distinction between these two problems in terms of the classification task itself is the way in which the label information is used: in the general task, labels are used to tie samples to individ- uals, whereas in spoofing detection labels are used to differentiate samples based on the emitting source (artificial device or human speaker).

2.2 Unsupervised learning

Unsupervised learning differs fundamentally from traditional supervised learning in that the label information is may be completely omitted from the process [16]. Instead of thinking in terms of pairs of training samples and the corresponding target values or classes, in unsupervised learning we are interested in the underlying structure of the data (clustering) or the hypothesised statistical process that is thought to be related to the observed data (density estimation) [17]. Dimensionality reduction is also an important field utilising the concept of unsupervised learning, where the idea is to find a lower-dimensionality representation for the data that still retains the important variance within the data [16]. Examples of unsupervised learning include the classic clustering techniques such as K-means,factor analysisandmixture modelssuch as the Gaussian mixture model, which is discussed later.

2.3 Some properties of machine learning

While some aspects of machine learning can be difficult to categorise, certain attributes can be assigned for each learning problem at hand. In [17], it is noted that each learning problem may be described in terms of three main attributes: 1) the type of the learning

(14)

task, 2) the metric that is used to evaluate the performance of the model, and 3) the nature of the learning process: supervised or unsupervised. We have already discussed two of these: the task can be anything from predicting a single real number (regression) to density estimation, where we try to approximate an unknown probability density, and the nature of the learning, or type of experience [17] used for the learning, the latter of which is often categorised into either supervised or unsupervised learning.

Finally, we have the metric, which in classification is usually related to the accuracy of the model predictions, and for real-valued predictions in regression-like tasks may be some sort of distance metric. This is related to theempirical risk minimisation(ERM) principle discussed below.

Most supervised learning is based on empirical risk minimisation(ERM) [18,19], in which the idea is to minimise the expected lossgiven our model predictions and the training data [18]:

E(w) = 1 N

N

X

i=1

L(f(xi,w),yi), (1) whereLis aloss function,f(xi,w)the model’s prediction for a particular input train- ing sample, given the model parameters w, and yi the training label associated with that training sample. Under the principle, we should choose a model for which (1) is lowest. The loss function, often also referred to asobjective function orcost func- tionis a mathematical expression of the wrongness we want associate with a particular training sample and label pair, and the form of the loss depends on the learning task.

The key issue that machine learning, and especially supervised learning, is confronted with is the issue ofgeneralisation[17]. In specific, we have a set of training samples that give us a glimpse into some unknown statistical process, and now we want to build a model that is able to give reasonable guesses when confronted with completely new data. The measure that gives us an idea of the goodness of our model in this setting is thegeneralisation error(or test error), which is measured in terms a disjoint test set.

In most machine learning settings, the development of a model involves first using the training data to reduce the training error towards zero and then testing the model on the separate test set. We want both to be as small as possible, and two errors should follow each other in a good model [17]. In the case that the training error is lower than the test error, we have a model thatunderfits, and in the opposite case we have a model thatoverfits.

(15)

2.4 Latent variable models and the Gaussian mixture model

In alatent variable model(LVM), the observed data is assumed to be affected by one more more unobservable or latent variables or factors [20, 15]. Normally, in LVMs and the related topic offactor analysis, the goal is to explain the variance of the ob- served data with an arbitrary number of unobserved variables, which are found by the means of factor analysis. The ideas behind LVMs were originally developed in the social sciences for behavioural modelling [20,21], but have also found widespread use within machine learning. LVMs and factor analysis will be discussed in more detail in subsection 2.4.

The Gaussian mixture model (GMM), or mixture of Gaussians [15, 16], consists of multiple multivariate Gaussian distributions, each with a mean µk and a covariance matrixΣk. A GMM is formalised as a weighted sum of its mixture components:

p(xi|θ) =

K

X

k=1

πkN(xikk), (2)

whereN is theprobability density function(pdf) of the multivariate Gaussian (Normal) distribution (MVN):

N (x|µ,Σ), 1

D/2|Σ|1/2 exp

−1

2(x−µ)TΣ−1(x−µ)

(3) In the MVN, the mean vector, µ, defines the centre point of the distribution within theD-dimensional space that it lies, and the covariance matrix, Σ ∈ RD×D, defines how the probability density behaves around the mean, i.e., how the probability mass is distributed. |Σ| is the determinant of Σ. The covariance of a multivariate random variable is given by

cov[x] =E

(x−E)(x−E)T

, (4)

where for the Gaussian distribution we haveE[x] = µ. The covariance can be either general, diagonal or isotropic (or spherical) [16], of which the first one is the least restricted in terms of number of parameters. The general covariance matrix is as its definition in (4), however, we may restrict it to only be the diagonal matrix diag(σ2i) so that we have a matrix with only the individual variances of each dimension of x.

Lastly, in the most restrictive option, we haveΣ=σ2I, so that the covariance is simply the identity matrix scaled by a variance parameter. The differences between these are

(16)

illustrated in Figure5. The restricted covariances are useful due to the requirement of computing the inverse of the covariance inside the exponent of (3), which is also known as theprecision matrix[15]. In (2),Kdenotes the number of mixture components and

Figure 5: Three multivariate normal distributions in 3D and 2D plots. The leftmost distribution is one with the general case covariance matrix, while the central and right- most distributions have spherical and diagonal covariances, respectively. We see that the general case covariance may have an arbitrary angle for direction of the highest variance, while the diagonal one is restricted to follow one of the axes for its shape of the variance. Lastly, the spherical covariance has clearly a stright circle shape. Plots generated with GNU Octave (https://www.gnu.org/software/octave/) mesh and contour plots.

Dthe dimensionality of the vectors (xis). Further,πks are known as themixing weights orcomponent priors. Two probabilistic constraints, 0 ≤ πk ≤ 1 andPK

k=1πk = 1, must be satisfied. Theπk represent the Bayesianprior knowledge we have regarding the training data. The model encapsulates a 1-of-K coded latent variablezk, also known as aone-hot vector, where one of the vector values is one and the rest zeroes. This one- hot vector can be thought of as tying each observation in to a mixture component, with the index of the one in the vector pointing to thekthmixture component. The constraint

(17)

P

kzk= 1must hold. For the mixing weighs,πk, we have

p(zk = 1) =πk. (5)

The parameters of a GMM can be trained with an iterative algorithm known as the expectation-maximisation(EM) algorithm [22, 15,16]. The algorithm consists of two steps, computed at each iteration. TheE-step(expectation) step comes first in the iter- ation. Here, data inference is done based on the current model parameter values. What inference means with a generative model such as the GMM, is that we use the current parameters of the distribution to sample arbitrary number of new samples. The second, maximisation, orM-step, is an optimisation of the parameters given the previously up- dated data. The purpose of the EM algorithm is to maximise thelog-likelihood of the

Figure 6: GMM training process via the EM-algorithm: here, starting from a random initial position, the mixture component parameters are shifted towards the clusters. In this toy example, we conveniently have three data clusters and three mixture compo- nents.

observed data,xi, themissingorhiddendatazi, given the parametersθ. The likelihood is a function that indicates how reasonable our model is given the data we observe (our training data) where a higher likelihood is in favour of our model, and vice versa, and it is

`(θ) =

N

X

i=1

logp(xi|θ) =

N

X

i=1

log

"

X

zi

p(xi,zi|θ)

#

(6)

(18)

As noted in [15], due to the logarithm in front of the sum, the complete data log- likelihoodis used:

`c(θ) =

N

X

i=1

logp(xi,zi|θ) (7)

Thecomplete datais the set of random variables{X, Z}, whereXis the observed data andZthe unobserved data (xi ∈Xandzi ∈Z). WhileZ is not directly observable, it is assumed that there is exactly one Gaussian component in the mixture that generated a particular observed sample inX. This data-generating process is encoded inZ. As the zi are latent and therefore unknown, this log-likelihood cannot be directly evaluated.

The EM algorithm solves this problem such that, instead of evaluating (7), we evaluate itsexpectation[15]:

Q(θ,θt−1),E

lc(θ)|X,θt−1

=E

"

X

i

logp(xi,zi|θ)

#

=X

i

X

k

riklogπ+X

i

X

k

riklogp(xik).

(8)

Note the introduction of the variablezi for the assumed unobserved data. Instead of evaluating a likelihood, in the EM algorithm we evaluate the expectation of the joint distribution of the observed data (thexis) and the unobserved data, represented by the zis. We see that the sufficient statistics, means and covariances, appear only in the right-hand sum. This is utilised in the steps of the algorithm. Here, t refers to the current iteration of the algorithm and θt−1 refers to the parameters of the previous iteration. Qis a new function known as the auxiliary function. The parameters θtin the M step are optimised via the following:

θt = argmax

θ

Q(θ,θt−1), (9)

that is, we want to find the model parameters that maximise the expected log-likelihood in (8). In the EM-algorithm,maximum likelihood estimation(MLE) [23] is used to find the parameters. The basic principle of MLE is that we take the log-likelihood function, the data, and obtain an estimate (maximum likelihood estimate) for each parameter of the model. Theestimatorfor a given parameter is obtained by taking the derivative of the likelihood function with respect to that parameter at zero. This estimator can then used to compute the estimate for the parameter of interest.

(19)

Now, the E-step in the EM iteration, based on (8), is done via [15]:

rik = πkN xit−1k PK

k0=1πk0N xit−1k0

. (10) That is, for each observed data pointi, theresponsibilitythat thekthcomponent takes for this data point is obtained with the Bayes rule. The responsibility computation can be seen as a soft clustering of the data points, in contrast to hard clustering in the K-means algorithm [16]. Thus, if we wanted to use the EM-GMM framework for clustering purposes, we could select the data point that has the largest responsibility value given each cluster to obtain cluster memberships. We will discuss this similarity briefly later. The M-step is done in multiple phases. First, the mixing weight for the kthcomponent is set to be the proportion of the observed data that has been assigned to this component [15]:

πk= 1 N

X

i

rik = rk

N (11)

The new mean vector for thekth component is obtained by taking the mean over all data points weighted by the responsibility of the kth component for each data point [15]:

µk= P

irikxi

rk (12)

The updated covariance matrix is obtained by following the MLE-reasoning for the covariance parameter, in which we set the derivative of the log-likelihood, l(xikkk), with respect to the parameter of interest (hereΣk) to zero and solve for that parameter. This results in the form [15]:

Σk = P

irik(xi−µk)(xi −µk)T rk

= P

irikxixTi

rk −µkµTk

(13)

Assuming that the convergence threshold is set appropriately, the algorithm is guar- anteed to end up in a local maximum in terms of the log-likelihood [15, 16]. For the initialisation of the model in terms of the means and covariances that describe the mixture components, it is possible to use the K-means algorithm [24, 15, 16]. Other approaches include random initialisation andfarthest point clustering[15].

(20)

Algorithm 1The naive EM algorithm for GMMs

1: Initialise eachµkkk, convergence thresholdt, update variableuand evaluate the initial log-likelihood.

2: whileu≥tdo

3: Evaluate theriks .E-step

4: Using the newriks, updateπk(11),µk(12) andΣk(13). .M-step

5: Re-evaluate the log-likelihood for the updated model and set the increase in log-likelihood to beu

6: end while

2.4.1 Discussion

The EM-GMM scheme described earlier can be seen as a more robust way of doing clustering that is conceptually close to the K-means algorithm (in fact, [16] notes that K-means is a special case of GMM-EM). In the case of GMMs, we get a probabilistic alignment (soft alignment) of each data point belonging to each mixture component, whereas in the case of K-means, each data point either belongs to a cluster or does not (hard alignment). These are often called soft clustering and hard clustering, respec- tively.

In K-means clustering [25, 26], we first select kpoints at random and then iterate two steps, in the first of which we assign data points to their nearestcluster centreaccording to some distance metric (usually the euclidean metric, or L2-metric), and then in the second step we set each cluster centre to be the mean all the points assigned to it. The first step can be formalised as follows [16]:

rik =

1 ifk = arg minjkxi−µjk2 0 otherwise,

(14)

where we get a one-hot encoding for the cluster assignments, anagolously to the re- sponsibilities in EM-GMM. The updated cluster means are obtained such that [16]

µk = P

irikxi P

irik . (15)

Both (14) and (15) result from the objective or loss function [16]:

J = arg min

rikk N

X

i=1 K

X

k=1

rikkxn−µkk2, (16)

(21)

with (15) being the MLE for µk. Now, suppose that in the EM-GMM computation we set each Gaussian component’s covariance matrix to beσID, use a constant πk = 1/K, and use an indicator function to set the responsibilities in E-step such that for a given data point, the component responsibility ends up either being one for the most likely component, or zero otherwise [15]. The result is the same clustering that is obtained from K-means as only the means of the components are relevant in the E-step due to the way we defined the covariances. EM-GMM can be thus seen as a more robust way of modelling in comparison to K-means due to the possibility of modelling the individual component variances. Another important difference arising from the probabilistic responsibility assignments in EM-GMM is that it captures the uncertainty of each component responsibility, obtained via1−maxk(rik) [15], i.e., we can look at the relative magnitude of the highest assigned component responsibility to obtain a measure of the uncertainty related to the component (or cluster) assignment.

2.5 Factor analysis

Factor analysis is a field of statistics where the goal is to understand the effect of un- observed (latent) variables on the observed data and to estimate these latent variables or factors. The origins of the field are in behavioural sciences, specifically in the study of human intelligence [21]. Factor analysis is a popular statistical tool in fields where there are hard-to-define and obtain quantities, such as marketing and economics, but has also seen wider adoption in terms of its core ideas due to the relation to thedimen- sionality reductiontechnique known asprincipal component analysis. Factor analysis models are discussed in more detail in Section3.3.2.

2.6 Deep learning

Many classical machine learning techniques require the use of often complex prepro- cessing of data in the form of feature extraction, before any classification or regres- sion task becomes feasible. Deep learning is a branch of machine learning, where the aim is to move away from complex feature extraction processes into what is known asrepresentation learning (with multiple layers of representations), where the learn- ing architecture can be thought of as performing feature extraction on the data [17].

These architectures usually combine together arbitrarily many non-linear components

(22)

in a layered structure, each layer thought as corresponding to a different level of ab- straction [17]. We note, however, that this kind of description can be seen as rather idealised, considering the difficulty of actually interpreting the learned parameters of a deep neural network. Ideally, this kind of thinking would be reducing the need for in- depth domain knowledge in adopting a deep architecture for some learning problem.

Deep learning, however, presents an entirely new set of challenges, from interpreta- tion of the models to computational feasibility due to high parameter count. Finally, classical machine learning techniques are still utilised in many deep learning systems.

The fundamental ideas behind deep learning can be traced back to the original ideas ofneural networks[27,28, 29,30] and the computational modelling of neurons. The most widely known deep learning model is the classic multi-layer perceptron, which utilises the simple model neuron, perceptron. While the ideas for training such models were studied in the 1970s and 1980s, they did not gain wide interest in the machine learning and pattern recognition communities until relatively recently, because in the 1990s it was widely thought that the training of such models would be difficult in practice. For this thesis, two highly related deep learning models are presented – the classical perceptron-based model (discussed next), and its training process, as well as the so-calledconvolutional neural network.

2.6.1 The classic multi-layer perceptron

Themulti-layer perceptron[27, 16] (deep feedforward network, feedworward neural network), often abbreviated MLP, can be described in terms of non-linear function approximation, where arbitrarily many layers of non-linear vector-valued functions, each with modifiable (to be trained) parameters, are combined to a composite. A well- known property of the MLP-networks is related to theuniversal approximation theo- rem[31,32,17], which states that these kind of models can be used to model arbitrary processes, given enough parameters. Of note is however the fact that this idea does not consider the optimisation of such models [17], which is still an active area of research.

The first layer of a MLP can be written as [16]:

σ aj =

D

X

i=1

w(1)ij xi+b(1)j0

!

, (17)

(23)

where aj is the jth activation of this layer. D denotes the dimensionality of thein- put layer, while wij is the weight term for the ith input layer component connected to the jth hidden layer component. bj0 is the jth bias term for the first (and only) layer of this model. Finally, σ denotes the activation function. The neural network nomenclature for this kind of model becomes apparent when the weights are consid- ered to be edges connecting the previous layer to the second and eachjth component a neuron. Moreover, the weighted sum with the bias term and the activation is the perceptron, with the important distinction that in the MLP we use differentiable non- linearity: z = σ

PD

j=1wij +b

. Thus, the basic component of this kind of a model is a linear combination, with adjustable parameters, followed by a non-linearity. (17) can be extended to include multiple layers as follows [16]:

yk(x,w) = σ

M

X

j=1

wkj(2)h

D

X

i=1

wji(1)xi+b(1)j0

! +b(2)k0

!

, (18)

whereM denotes the number of components on the added second layer and the acti- vation function, h, is the activation function used for the first hidden layer. The bias terms can be merged into the into the input vectors such that an input x becomes x= (1, x1, ..., xd)T [16] and this simplification is used in the following discussion. We could equivalently and more simply write our model in the matrix form:

G(x) = σ(h(xW1+b1)W2+b2), (19) where W1 is a weight-matrix containing weights for the first layer and b1 the bias vector for that layer. The activation functions are not necessarily chosen to be the

Figure 7: A high level view of a MLP-network. Here we see the input layer, with dimensionalityD, an arbitrary number of hidden layers, and finally the output layer, with sizeT.

same, as seen here. Traditionally, thelogistic sigmoid functionhas been presented as a

(24)

suitable activation function for the hidden layers:

σ(x) = 1

1 + exp(−x). (20)

However, the sigmoid function can lead to a problem known asvanishing/exploding gradient[33, 34] during the training process, where the error signal used for the pa- rameter update tends to either go too small for sensible updates or explode entirely.

This issue was noted to be especially significant for deeper networks, such as so-called recurrent neural networks(RNNs). The the hyperbolic tangent,tanh, is another com- monly used activation function. Generally, the requirements for the hidden layer ac- tivation functions are such that the function must be differentiable and monotonically increasing [16]. In the context of modern deep learning, a widely accepted replacement for the activation function is the so-calledrectified linear unit(ReLU) [35,17] and its variants:

a(x) = max(0, x). (21)

ReLU is a piecewise function with two sections. This property, coupled with the non- linearity of the function, makes it a desirable activation function from the perspective of parameter optimisation [17]. In classification networks, the activation function at the output layer is typically thesoftmaxfunction [16]:

yi(x) = eai P

jeaj, (22)

whereaiis theithactivation at the output layer, and the sum is taken over all the activa- tions at the output layer. The softmax activation constrains the output layer predictions into probabilities, such that0≤yi ≤ 1andP

iyi = 1, which is what we want to have in a classification network.

The parameters of an MLP network can be trained with various strategies utilising gradient descent andback-propagation. In MLP training using these two techniques, the parameters are adjusted in small steps towards the negative gradient [36, 37, 30].

Back-propagation is used to evaluate the errors for each network parameter, which are then used to adjust the parameters accordingly. The gradient descent parameter optimisation can be formalised as follows [16]:

w(τ+1) =w(τ)−η∇E(w(τ)), (23)

(25)

Figure 8: The rectified linear unit (ReLU) activation function

wherew(τ)is theflattenedparameter vector of the current time-step, where each layer’s weights have been concatenated into a single vector of parameters. ηis known as the learning rateand∇Eis the vector of partial derivatives (a gradient) of the error (or ob- jective or loss) functionE with respect to all the network parameters, calculated with back-propagation. The precise formulation of the error function depends on the acti- vation function used at the output layer of the network as well as on the classification task. For regression, mean-squared-error (MSE) can be used [16]:

E(w) = 1 2

N

X

i=1

kyi−tik2, (24)

where wis the vector of weight and bias parameters, yi the predicted output and ti the corresponding label ortargetvector. The error or cost functions presented here for neural networks are based on the principle maximum likelihood [17]. In the case that our labelling information is in the form of one-hot-coded vectors, and the task is to perform binary or multi-class classification, binary or categorical cross-entropy can be used. Binary cross-entropy can be written as follows [16]:

E(w) =−

N

X

i=1

(tilog(yi) + (1−ti) log(1−yi)). (25) Categorical cross-entropy, as presented in [16], can be written as

E(w) = −

N

X

i=1 M

X

c=1

ti,clog(yi,c), (26)

(26)

whereM denotes the number of predicted classes,ti,c the target label andyic the pre- dicted probability ofithsample belonging to thecthclass. The cross-entropy losses are motivated byinformation theory, and cause the network to bring the model predictions closer the training data, which represents the underlying truedistribution we want to generalise for.

In the back-propagation technique, the partial derivative of the error function for the nthinput vector,En(θ), with respect to a weight parameterwji, is obtained by utilising the chain rule for partial derivatives [16]:

∂En

∂wji = ∂En

∂aj

∂aj

∂wji, (27)

wherewji is theithweight parameter connected to thejthactivation,aj, of some layer of our network. Theaj is the weighted sumP

iwjizi, whereziis theithoutput coming from a previous layer of the network to this component. Let

δj ≡ ∂En

∂aj . (28)

The δjs are known as the errors [16]. In (28) the error for each network activation is defined as the gradient of the objective function for the current (nth) sample with respect to thejthactivation at some layer of the network. The errors at the output layer of the network are simply the differences between the activations at the output layer and the values of the target vector, or δk = yk− tk. It is now possible to write the partial derivative of thejth activation with respect to theith weight connected to that activation as follows:

∂aj

∂wji =zi. (29)

Following the derivation in [16] and substituting the previous two into (27), we get

∂En

∂wjijzi. (30)

The errors for a unit of a hidden layer of the network are obtained as follows [16]:

δj ≡ ∂En

∂aj =X

k

∂En

∂ak

∂ak

∂aj. (31)

The sum is evaluated over all hidden layer units that thekthoutput unit has connection

(27)

Figure 9: Components of MLP error calculation (based on the illustration in [16]).

The green arrow signifies the direction of activation calculation, often referred to as

"forward pass". The red arrow shows the direction of the error back-propagation com- putation, starting from the output layer.

with. Finally, we can follow [16], and substitute (28) into (31) and utilise the definition of the activation, aj = P

iwjizi and zj = h(aj), to obtain the general form of the back-propagation for a unit on a hidden layer of the network [16]:

δj =h0(aj)X

k

wkjδk, (32)

whereh0is the derivative of the hidden activation function of the layer this unit resides at. From (32) we see that, after a forward pass through the network, one may iterate backwards through the network using the immediately previously computed errors at some layer to compute the errors for the next layer of the backwards pass through the network. Importantly, there is no complex dependency between the errors of units within the same layer of the network, as only the previous layer error values are needed to be taken into account. Of note is also the fact that the computations in a MLP- network can be represented as a computational graph [17], as it can be easily seen that there are no cycles in the forward pass nor in the back-propagation step, and most modern software tools utilise some kind of a graph.

The algorithm for naive gradient descent and back-propagation over the whole data is described below. While our basic algorithm utilises the entire dataset before a update, the size of the batch does not necessarily need to set in this manner. One strategy is, for example, to select some batch size smaller thanN to either partition the data or sample

(28)

Algorithm 2Gradient descent and back-propagation

1: whilei < N do

2: Forward-propagate an input vectorxi through the network, calculating the ac- tivations of each hidden and output unit of the network.

3: Evaluate the output-layer errors usingδk =yk−tk.

4: .forward pass

5: Using the output-layer errors,δks, back-propagate through the network by util- ising (31) and (30), summing and maintaining the errors for each parameter.

6: .back-propagation

7: Adjust the network parameters using (23). .gradient descent

8: end while

randomly from it (mini-batch gradient descent). The other extreme is an on-line[16]

version of the algorithm which calculates the gradient in terms of a single input vector, and adjusts the network parameters each time, instead of going through the entire data before adjustments (sequential/stochastic gradient descent). The mini-batch approach is noted to be a popular compromise between using all of the data and a single data point during error computation [17], where it is noted also that in [38] it was shown that in terms of the generalisation performance using a single sample is desirable, but as noted in [17] this often problematic from optimisation perspective. Additionally, the choice of the mini-batch size may be affected by hardware and parallerisation consid- erations [17].

2.6.2 Implications of (32) and discussion

From (32), we can see that the iterative computation of the errors in these kinds of networks can be achieved merely by knowing the derivative of the chosen activation function(s). However, as noted in [16], due to the multiple nonlinearities found within neural networks (the activation functions), the overall error function is non-convex.

This implies that only local optima of the error function can be found by utilising the gradient descent parameter update in (23). For the generalisation performance of the model, i.e., when the model is tested on unknown new data beyond the training data, the global optimum may not be a desirable [16].

It is noted in [39] that the stochastic gradient descent (SGD) has been widely popu- lar as an optimisation technique for neural networks, even as the fine details of the techniques used in practice have seen refinement in the recent years. These include the ideas ofmomemtum [40], Nesterov momemtum or Nesterov accelerated gradient

(29)

(NAG) [41],RMSsprop[42], ADAM [43], Adamax [43], Adagrad [44], and Adadelta [45], to name a few. For now, let us assume that we are using the mini-batch vari- ant of back-propagation and gradient descent. The update rule [39] using momemtum becomes

vt=γvt−1+η∇E(w) w=w−vt,

(33)

whereγ is an adjustable hyperparameter that controls the amount of thepreviousup- date values used in computing the new update. Here we omitted the time-step index for the parameterswfor simplicity and conciseness. The intuition behind momemtum is that it helps the parameter optimisation process in getting over valleys and hills in the gradient "landscape" by taking into account the history of past update values during the update. Many of the variants of SGD discussed here include something similar in nature.

In the Nesterov variant (34), the previous momemtum values are used together with the current parameters to obtain an estimate of the future parameter values such that the historical direction of the parameter updates is taken into account during the gradient computation itself.

vt=γvt−1+η∇E(w−γvt−1) w=w−vt.

(34)

Practical challenges in getting the standard SGD to converge into desirable results have further led into the development of so-calledadaptiveversions of SGD, where usually each parameter is updated individually based on some statistic of said parameter. The update rule for the adaptive method Adagrad is the following [39]:

wt+1,i=wt,i− η

pGt,ii+ ·gt,i, (35) wherewt,icorresponds to the parameteriat time-steptandgt,ito the derivative of the parameter. Gt∈Rd×dis a diagonal matrix of the sum of squares of the past gradients.

Finally, is added to the denominator for numerical stability. The intuition of (35) is that we scale the learning rate parameterη according to the size of the past errors, which allows for the optimisation to adapt to the size of the errors on a per-parameter

(30)

basis. One motivation for this scaling is suggested [39] to be the problem of sparse data, e.g., high-dimensional data where the variability is concentrated on particular dimensions, which is common in many real-world situations. While adaptive meth- ods have been found to be successful empirically in certain tasks such as in training Generative Adversarial Networks(GANs) [46], in [47] it was suggested that adaptive methods may not be well-justified as drop-in replacements for standard SGD as these were shown to give widely differing results from standard SGD.

Beyond the choice of the update rule during optimisation, a strategy for the initial values for the network parameters must be decided upon. Common approaches include drawing from normal and uniform distributions, with some heuristics depending on the choice of non-linearity in the model. One of these heuristical random initialisation methods was presented in [48], where the initial parameter weights for a particular layer are drawn such that

W ∼U(−

√6

√nj +nj+1

,

√6

√nj+nj+1

), (36)

wherenj is the number of incoming connections ("fan in") to this layer and nj+1 the number of outgoing connections ("fan out"). The normal variant of this method draws the parameter values such that

W ∼N(0, σ), (37)

where

σ =

s 2

nj+nj+1. (38)

The above initialisation is known asXavierorGlorotinitialisation according to one of the authors and has been found to be effective with deep networks.

MLP-networks are particularly prone to a common problem in machine learning known as overfitting, where the model’s neurons memorise some parts of the train- ing set. The result is degraded inference performance when the model is introduced to previously unseen data. Strategies for guarding against overfitting vary, and include parameter weight decay [17], data augmentation [17], noise injection [17], and dropout [49,17]. In weight decay strategies, the norm of all of the model parameters are con- strained to a norm (L1, L2) via the use of a regularisation term in the error function. As overfitting may often arise from insufficient data, data augmentation may be sometimes used to generate new training samples by applying suitable transformations on the real

(31)

samples. In noise injection, we add random noise to the input data for the purpose of making the model more robust to small disturbances in the input. Dropout is a method for applying noise to the weights of the network, where we randomly "drop" connec- tions in the network,i.e., we set some of the activations at a particular layer to zero at random. Finally, we note that controlling for the capacity of the network, in terms of the number of layers as well as the size of the layers themselves, is an important part of regularisation.

2.6.3 Convolutional neural networks

Convolutional neural networks(CNNs) [50] constitute a class of neural architectures for the purpose of allowing fortranslational invariance[17] for structures in the input data (especially in the visual domain, such invariance is critically important since ob- jects may change their location within an image, but they still need to be recognised by the model.). A CNN is a neural network that utilises an operation known asconvolu- tionin its discrete form at some point of the architecture [17]. The discrete convolution operation is defined as follows:

s[t] = [x∗w][t] =

X

a=−∞

x[a]w[t−a]. (39)

For a 2D array-like input, such as a grayscale (single layer) digital image, convolution becomes [17]

S(i, j) = (K∗D)(i, j) =X

m

X

n

D(i−m, j −n)K(m, n), (40) where D is the input matrix and K a two-dimensional kernelthat is convolved with the input. Taken through an activation function, such as a ReLU, the output of this operation is often called afeature map. [17] notes that, in practice,cross-correlationis often utilised:

S(i, j) = (K∗D)(i, j) =X

m

X

n

D(i+m, j +n)K(m, n). (41) Typically, convolutional neural architectures dealing with digital images have to ac- count for the three layers of a RGB-image, however, for our purpose of adapting such a model for speech data in the form of aspectrogram, the above form of convolution

(32)

Figure 10: Convolution with a 2x2 kernel (Adapted from [17])

applies. Figure10is a simplification of the convolution operation. In practice, two de-

Figure 11: Zero-padded input

sign parameters are used for convolution in CNNs:strideandpadding[17]. When the kernel is convolved with the input, the kernel has its centre at the current input value.

In order to make the operation feasible at the borders, with varying input sizes, zeroes are added to the input (Figure 11). The stride-parameter determines how many input values the kernel centre moves at each step of the computation in each direction (width and height) (Figure12). The dimensions of the output are given by the following [51]:

output width = W −Fw+ 2P

Sw + 1, (42)

and

output height = W −Fh+ 2P

S + 1, (43)

(33)

whereW is the width and height of the input,Fw andFh the width and height of the kernel respectively andP the amount of zero-padding used.

Figure 12: Height and width stride of 2 with 3x3 kernel and zero-padding

CNNs contain three desirable properties –sparse interactions (orsparse connectivity or sparse weights), parameter sharing and equivariant representations [17], the last of which we already hinted at earlier. In a traditional neural network, assuming that nodropout, where some of the connections are dropped is utilised, each hidden node is connected to every node in the previous and next layer. In a CNN, because the convolution kernel used at each network layer is smaller in size than the input, the parameter count is greatly reduced. Parameter sharing is achieved, similarly, due to the use of the kernel. Generally speaking, parameter sharing implies the use of the same parameter for multiple functions within the model. The intuition in the case of the CNN is that the kernel of each network level ties multiple positions of each input to its values, while in a MLP-network, each position in the input is only tied to a specific network node. Finally, we have the equivariance property. Function f is said to be equivariant with respect to the functiongif the following holds:

f(g(x)) =g(f(x)). (44)

The convolutional layer of a CNN has equivariance with respect totranslation, where,

(34)

as the input data is shifted wholly into a certain direction, the result of the convolution operation changes in a predictable manner.

In addition to the convolutional layers, CNNs normally contain another distinct compo- nent – thepoolingfunction or layer. A pooling layer can be thought of as performing downsampling on the input, approximately representing it with a smaller number of values. A common type of a pooling layer is the max-pool layer, which has some similarity to our convolution operation. Unlike in the normal convolution operation, where essentially a dot-product is computed between the kernel and a section of the input, in max-pooling, a local maximum is simply taken around a section of the input determined by the kernel and the stride-parameter (Figure13). Other pooling functions have been studied, such as taking the average, theL2-norm, or average distance from the kernel centre position [17]. While an architectural description of a CNN may sepa-

Figure 13: Max-pool with 2x2 kernel and stride 1 (Adapted from [51])

rate these into distinct network layers, a convolutional layer of a CNN can be thought of as consisting of three distinguishable stages [17], each of which are illustrated in Fig- ure14below. The final layers of a CNN are typically so-called fully connected(FC) layers, which are exactly like the traditional neural network layers discussed in Section 2.6.1. A complete CNN architecture is illustrated in Figure14below. Both CNNs and MLPs (both types are often mixed in modern networks) are commonly optimised with some variant of SGD discussed earlier.

In the recent years, a plethora of different higher level frameworks for developing and training such models have seen rise, including Tensorflow1, Keras2, PyTorch3, Caffe

1tensorflow.org

2keras.io

3pytorch.org

(35)

Figure 14: The basic building blocks of a convolutional neural network. The convo- lution layers compute local activations of the input, while the pooling layers compute summarisation of the input at various stages. The max-pooling layer displayed here reduces the size of the channel outputs, but retains the number of channels.

4, and Microsoft Cognitive Toolkit5. The recent explosion of neural network research can be largely attributed to the rise of practical and affordable parallel computation of backpropagation and SGD via the use of graphics processing units (GPUs), the most popular lower level framework being NVIDIA’s CUDA platform6. Moreover, certain higher level frameworks such as the historically more research-oriented PyTorch, make the development and deployment of various kinds of networks straightforward with standardised implementations for many layer types, loss functions, and optimisation methods.

4caffe.berkeleyvision.org

5microsoft.com/en-us/cognitive-toolkit

6developer.nvidia.com/cuda-zone

(36)

3 Speaker recognition and verification

Speaker recognitionrefers to the study of a number of tasks that utilise speech data with the aim of tying the speech data to individuals in some manner [52]. These involve speaker identification, speaker verification, speaker or event classification, speaker segmentation,speaker trackingandspeaker detection. Automatic speaker verification (ASV), as the focus of this work, refers to a process where the identity of an user of a system is verified using speech data, while identification refers to the process of iden- tifying a person from an audio stream, possibly containing speech from multiple per- sons. Speaker/event classification includes problems such as speaker age classification or, for example in the case of event classification, classifying an audio event to relate to music (e.g., singing) or a car (not speech). Speaker segmentation involves the prob- lem of separating different speakers within an audio sample. Finally, we have speaker detection and speaker tracking. The former involves detecting the presence of a par- ticular speaker from an audio source, while the latter involves tracking a person across multiple audio sources. A distinction should be made withautomatic speech recogni- tion(ASR), which is a broader and older problem, and refers to linguistic analysis of speech, where the task is to predict the message that was spoken. Machine language translation falls into the purview of speech recognition. Voice recognition has been historically used as a synonym for both speech recognition and speaker recognition. In this thesis, unless otherwise mentioned, we focus on ASV.

ASV can be further split into two different problems depending on what kind of in- formation is used in the recognition task. Intext-dependentspeaker recognition addi- tional linguistic information is used in conjunction with speech data, whereas intext- independent speaker recognition, speech data may be to some degree used indepen- dently of any linguistic information. [52]

3.1 Biometrics

We have seen that speaker recognition by itself is a wide area of study. This subsection serves for the purpose of contextualising the problem of interest of this work, namely, the detection of a spoofed audio sample within a speaker verification system. It is one of many different forms of biometric authentication.

(37)

Biometrics is the science of identifying or verifying persons based on physiological or behavioural characteristics, while the word biometric refers to a specific modeof biometrics (e.g., fingerprint is a biometric.) [53]. The most well-known biometric is the fingerprint, which has until recently been somewhat of a synonym for biometrics general. Examples of behavioural characteristics include handwritten signatures and indeed certain properties of voice (we will later discuss how human voice contains actually both physiological and behaviour traits.). The key difference of these two types of characteristics is the fact that physiological characteristics are, at least to some degree, directly measurable and less dependent of human mental functioning, while the latter are more complex and a function of a person’s life experience, development over time and situational setting. Consider how languages and ways of speaking are learned, for example. Certain physiological characteristics may potentially contain substantially more information for the purpose of biometrics in terms of a single mea- surement, such as a fingerprint. In contrast, a behavioural characteristic, such as voice, may contain less information in terms of a single measurement, implying the need for measurements over a length of time (this will be a common theme in the following sections.).

Desiderata of a particular biometric can be evaluated by looking at the following crite- ria [54,53,55]:

• Universality– The biometric characteristic should be measurable in the case of every individual.

• Uniqueness– Persons should be uniquely identifiable by the characteristic.

• Permanence– The characteristic should stay similar even as time passes.

• Collectability– There should exist some way to measure the characteristic.

• Acceptability– Measuring and processing the biometric should be acceptable for humans, i.e., not humiliating, inconvenient or dangerous.

• Performance– The biometric should provide accuracy and consistence in terms of the measurement.

While the overall quality of a biometric is some kind of a function of all of the above criteria, it would be impossible to cover all perfectly. Consider, for example, that cer- tain obvious trade-offs exist between the criteria. An easily collectable biometric, such

Viittaukset

LIITTYVÄT TIEDOSTOT

Mixture models are widely used in many speech analysis tasks including speech activity detection, speaker recognition, and speaker diarization.. In Publication IV, GMMs are used

Kinnunen et al., “Low-variance multitaper MFCC features: A case study in robust speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol.

Table 4: Speaker verification performance on NIST SRE 2001 using the proposed NSWEC features with GMM-UBM system for different length of temporal window and number of eigen-

This study identified 14 areas of interest, which are deep learning, vehicle, in- trusion detection system, pattern recognition, internet of things, network attack detection,

In this work, we evaluate the performance of acoustic and throat microphone based speaker verification system for GMM-UBM and i-vector based speaker recognition.. Moreover, since

The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using different

In the context of biometrics, presentation attack detection (PAD) or anti-spoofing aims at classifying a given signal ei- ther as a bonafide (human) sample or a spoofing

The work was performed with bandwidth-limited and coded versions of the ASVspoof 2015 and RedDots Replayed databases (covering 3 different types of presentation attacks, namely