• Ei tuloksia

Improving machine learning methods for speaker recognition and segmentation

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Improving machine learning methods for speaker recognition and segmentation"

Copied!
121
0
0

Kokoteksti

(1)

uef.fi

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND Dissertations in Forestry and Natural Sciences

ISBN 978-952-61-2976-1 ISSN 1798-5668

Dissertations in Forestry and Natural Sciences

DISSERTATIONS | ALEXEY SHOLOKHOV | IMPROVING MACHINE LEARNING METHODS FOR SPEAKER... | No 324

ALEXEY SHOLOKHOV

IMPROVING MACHINE LEARNING METHODS FOR SPEAKER RECOGNITION AND SEGMENTATION

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND

This dissertation develops and advances machine learning methods for text-independent speaker recognition and speaker segmentation.

It focuses on enhancing both front-end and back-end components of speaker recognition

and segmentation systems with the goal of improving robustness of recognition.

ALEXEY SHOLOKHOV

(2)
(3)

PUBLICATIONS OF THE UNIVERSITY OF EASTERN FINLAND DISSERTATIONS IN FORESTRY AND NATURAL SCIENCES

N:o 324

Alexey Sholokhov

IMPROVING MACHINE LEARNING METHODS FOR SPEAKER RECOGNITION

AND SEGMENTATION

ACADEMIC DISSERTATION

To be presented by the permission of the Faculty of Science and Forestry for pub- lic examination in the Louhela auditorium of Joensuu Science Park, Länsikatu 15, University of Eastern Finland, Joensuu, on December 17th, 2018, at 12 o’clock.

University of Eastern Finland Faculty of Science and Forestry

Joensuu 2018

(4)

Grano Oy Jyväskylä, 2018

Editors: Pertti Pasanen, Matti Tedre, Jukka Tuomela, and Matti Vornanen

Distribution:

University of Eastern Finland Library / Sales of publications julkaisumyynti@uef.fi

http://www.uef.fi/kirjasto

ISBN: 978-952-61-2976-1 (print) ISSNL: 1798-5668

ISSN: 1798-5668 ISBN: 978-952-61-2977-8 (pdf)

ISSNL: 1798-5668 ISSN: 1798-5676

(5)

Author’s address: University of Eastern Finland School of Computing

P.O.Box 111 80101 JOENSUU FINLAND

email: sholok@cs.uef.fi

Supervisors: Associate professor Tomi H. Kinnunen, Ph.D University of Eastern Finland

School of Computing P.O.Box 111

80101 JOENSUU FINLAND

email: tomi.kinnunen@uef.fi

Associate professor Timur Pekhovsky, Ph.D ITMO University

Department of Speech Information Systems 49 Kronverksky pr.

197101, SAINT PETERSBURG RUSSIAN FEDERATION email: tim@speechpro.com

Reviewers: Associate professor Nicholas Evans, Ph.D EURECOM

Campus SophiaTech 450 Route des Chappes 06410, Biot

FRANCE

email: evans@eurecom.fr

Associate professor Man-Wai Mak, Ph.D The Hong Kong Polytechnic University

Department of Electronic and Information Engineering Core E, 6/F

Hung Hom, Kowloon, HONG KONG HONG KONG

email: enmwmak@polyu.edu.hk

Opponent: Associate professor Mikko Kurimo, Ph.D Aalto University

Department of Information and Computer Science P.O.Box 15400

FI-00076 AALTO FINLAND

email: mikko.kurimo@aalto.fi

(6)
(7)

Alexey Sholokhov

Improving machine learning methods for speaker recognition and segmentation Joensuu: University of Eastern Finland, 2018

Publications of the University of Eastern Finland Dissertations in Forestry and Natural Sciences

ABSTRACT

The accuracy of modern automatic speaker recognition systems is strongly depen- dent on the acoustic properties of the environment in which they operate, and on the compatibility of their training data with that of the operational environment.

By leveraging the methods of modern machine learning, this dissertation focuses on enhancing both front-end and back-end components of speaker recognition (or diarization) systems with the goal of improving robustness of recognition.

In the front-end the work focuses onspeech activity detection(SAD), an essential component of speech processing systems, intended for detecting whether or not a given sound segment contains speech. Since the overall performance of downstream recognition systems strongly depends on the quality and amount of the input data, constructing accurate SAD is an important task in speech technology. One approach relies on using powerful general-purpose approximators, such as deep neural net- works, trained on a very large and diverse set of off-line audio recordings containing SAD labels. Unfortunately, this requires both a large dataset and SAD labels, po- tentially involving substantial human intervention, and therefore does not suit all applications. There is as of yet no universal speech activity detection algorithm that works reliably for all possible background noises and transmission channels. Cur- rent SAD solutions are sensitive to mismatched conditions leading to unacceptable performance in previously unencountered acoustic environments. An alternative approach, therefore, is to construct SADbased on the given recording only and using a limited amount of, or no supervision. This work contributes to such an approach with an application toautomatic speaker verification(ASV). In specific, the author pro- poses asemi-supervised learningstrategy to tackle this problem.

SAD, as such, is a building block of more complex speech processing systems and, in principle, independent of the back-end classifier architecture. Nonetheless, as a binary classifier, SAD makes detection errors. This leaves an important question regarding their relative importance to the downstream recognizer. As a binary clas- sifier, any SAD is characterized by two inescapable errors: speechmisses(a speech segments declared mistakenly as non-speech) and false alarms (a non-speech seg- ments declared mistakenly as speech). This work addresses the impact of SAD misses and false alarms on the performance of a speaker verification system.

Concerning the back-end, that is, the speaker modeling, this thesis contributes advances both in speakerdiarization(or speaker segmentation) and ASV. Specifically, the author develops a new clustering algorithm based on pairwise similarity com- parisons between pairs of short-term speech segments. The intention is to devise a general clustering algorithm that can accurately estimate the number of clusters and benefit from rich prior information available from large, off-line datasets. These desirable properties are obtained, respectively, by adopting a Bayesian approach for the automatic determination of model complexity and using a learnable similarity score. This proposal can be seen as an alternative to agglomerative hierarchical

(8)

clustering, one of the widely adopted heuristic speaker diarization techniques.

Finally, the author studied an alternative training strategy for pairwise support vector machine classifier based on the methodology ofmulti-task learning. The pro- posed training method aims at compensating for dataset bias and improving gen- eralization performance of the resulting classifier by learning jointly over multiple source datasets.

Universal Decimal Classification:004.85, 004.934

INSPEC Thesaurus: learning (artificial intelligence); speech processing; speaker recogni- tion; modelling; pattern clustering; pattern classification; Bayes methods

Yleinen suomalainen asiasanasto:tekoäly; koneoppiminen; puheteknologia; puhujantun- nistus; mallintaminen; bayesilainen menetelmä

(9)

ACKNOWLEDGEMENTS

This research work was carried out at the University of Eastern Finland (Finland) and the ITMO University (Russian Federation). The work was funded by Academy of Finland (projects 253120, 283256 and 309629) and the Government of the Russian Federation (grant 074-U01). I wish to thank Prof. Pasi Fränti and Prof. Yuri Matveev for giving me opportunity to study within the double-degree doctoral programme between two universities.

I would like to express my sincere gratitude to my supervisors Assoc. Prof.

Tomi Kinnunen and Assoc. Prof. Timur Pekhovsky who introduced me to speech technology and provided enormous support and guidance during my studies.

I wish also to thank Prof. Mikko Kurimo for accepting to be the opponent in my public defense. I am thankful to the reviewers of this work, Prof. Nicholas Evans and Prof. Man-Wai Mak, for their valuable comments.

Finally, I am especially grateful to my colleagues and co-authors in the publica- tions who made this work possible.

Joensuu, November 22, 2018 Alexey Sholokhov

(10)
(11)

LIST OF ABBREVIATIONS

AHC Agglomerative hierarchical clustering ASR Automatic speech recognition

ASV Automatic speaker verification BIC Bayesian information criterion BLFA Bilinear factor analysis

CMVN Cepstral mean and variance normalization CPD Change point detection

CPU Central processing unit CRP Chinese restaurant process DAC Domain Adaptation Challenge DCF Detection cost function

DER Diarization error rate DNN Deep neural network DSP Digital signal processing EER Equal error rate

ELBO Evidence lower bound EM Expectation-maximization ERM Empirical risk minimization FA Factor analysis

FAR False acceptance rate FRR False rejection rate

GLR Generalized likelihood ratio GMM Gaussian mixture model

IDVC Inter-dataset variability compensation JFA Joint factor analysis

LDC Linguistic Data Consortium LLR Log-likelihood ratio

LP Linear predictive

LPCC Linear predictive cepstral coefficient MAP Maximuma posteriori

MFCC Mel-frequency cepstral coefficient MLE Maximum likelihood estimation MDL Multi-domain learning

MTL Multi-task learning

NCE Noise-contrastive estimation

NIST National Institute of Standards and Technology PCA Principal component analysis

PLDA Probabilistic linear discriminant analysis PLP Perceptual linear predictive

PPCA Probabilistic principal component analysis PSVM Pairwise support vector machine

RASTA Relative spectral analysis SAD Speech activity detection SD Speaker diarization SID Speaker identification SNR Signal-to-noise ratio SR Speaker recognition

(12)

SRE Speaker recognition evaluation STFT Short-time Fourier transform SSL Semi-supervised learning SVM Support vector machine TVAE Tied variational autoencoder UBM Universal background model VAE Variational autoencoder VQ Vector quantization

WCCN Within-class covariance normalization ZCR Zero-crossing rate

(13)

TABLE OF NOTATION

Notation Description

R the set of real numbers

R+ the set of non-negative real numbers A,B,C, ... arbitrary sets

A × B Cartesian product of two sets,AandB a,b,c, ... arbitrary scalars

a,b,c, ... arbitrary vectors of real numbers A,B,C, ... arbitrary matrices over the reals

Ep[·] the expectation with respect to the probability distribution p x∼ p xis is generated from the distributionp

N(µ,Σ) normal (Gaussian) distribution with mean µ and covariance matrixΣ

Cat(π) discrete (categorical) distribution with parametersπ Dir(α) Dirichlet distribution with parametersα

log the natural logarithm exp the exponential function

1A indicator function of the setA;1A(x)equals 1 ifx ∈ Aand 0 otherwise

ai i-th entry of a vectora Aij (i,j)-th entry of a matrixA

diag(A) vector consisting of the diagonal of the matrixA diag(a) diagonal matrixAsatisfyingAii=ai

1 vector whose entries are 1 I identity matrix,I=diag(1) kak the euclidean norm of the vectora

∂f

∂x the partial derivative of the function f with respect tox

∇f gradient of a scalar function f

Jf Jacobian of a vector-valued function f

x∝y proportionality, that is, there is a non-zero constantksuch that y=kx

(14)
(15)

LIST OF PUBLICATIONS

This thesis consists of an overview of the author’s work in the field of speech tech- nology and the following selection of the author’s publications:

I A. Sholokhov, T. Pekhovsky, O. Kudashev, A. Shulipa and T. Kinnunen,

"Bayesian analysis of similarity matrices for speaker diarization,"Proc. ICASSP, 106–110 (2014).

II A. Sholokhov, T. Kinnunen and S. Cumani, "Discriminative multi-domain PLDA for speaker verification,"Proc. ICASSP, 5030–5034 (2016).

III T. Kinnunen,A. Sholokhov, E. Khoury, D. Thomsen, M. Sahidullah and Z.-H.

Tan "HAPPY Team Entry to NIST OpenSAD Challenge: A Fusion of Short- Term Unsupervised and Segment i-Vector Based Speech Activity Detectors,"

Proc. Interspeech, 2992–2996 (2016).

IV A. Sholokhov, M. Sahidullah and T. Kinnunen, "Semi-Supervised Speech Ac- tivity Detection with an Application to Automatic Speaker Verification,"Com- puter Speech & Language, 47: 132–156 (2018).

Throughout the overview, these papers will be referred to by Roman numerals. The next section summarizes the author’s contributions.

AUTHOR’S CONTRIBUTION

InI, the author of this dissertation develops a new probabilistic model and the cor- responding inference algorithm for speaker diarization. The author implemented and evaluated the new diarization method on data prepared by his co-authors. He planned and wrote the paper while his co-authors provided comments. InII, the author proposes a new training method for discriminative formulation of probabilis- tic linear discriminant analysis to compensate inter-dataset variability. The author was responsible in implementing the proposed method, carrying out all the experi- ments and writing the paper. The co-authors provided suggestions and minor text edits. InIII, the author carried out experiments on enhancing and optimizing unsu- pervised speech activity detectors for the submission of the “HAPPY” team to the NIST OpenSAD evaluation. In IV, the author extended the unsupervised speech activity detection method fromIII. The author was responsible in implementing the proposed method, carrying out experiments on the stand-alone evaluation of the proposed method and writing the major parts of the article. In all papers the order of authors indicates the contribution in preparing the papers. The first co-author was the principal author responsible for editing the text.

(16)
(17)

TABLE OF CONTENTS

1 INTRODUCTION 1

2 FUNDAMENTALS OF MACHINE LEARNING 5

2.1 Machine learning... 5

2.1.1 Supervised learning... 6

2.1.2 Unsupervised learning... 11

2.1.3 Semi-supervised learning... 14

2.2 Modeling complex densities... 15

2.2.1 Explicit density models... 15

2.2.2 Implicit models... 20

2.2.3 Parameter estimation... 21

2.3 Performance evaluation, model selection, and algorithm selection... 26

2.3.1 Performance evaluation... 27

2.3.2 Model selection... 28

2.3.3 Algorithm selection... 33

2.4 Approximate inference and model selection... 33

3 SPEECH ACTIVITY DETECTION 39 3.1 SAD overview... 39

3.2 Semi-supervised SAD... 41

4 AN OVERVIEW OF SPEAKER RECOGNITION AND SEGMEN- TATION 45 4.1 Front-end... 46

4.2 Back-end... 47

4.2.1 Speaker verification... 47

4.2.2 Speaker diarization... 48

4.3 Performance evaluation... 49

4.3.1 Speaker verification... 49

4.3.2 Speaker diarization... 50

5 SPEAKER RECOGNITION AND DIARIZATION — A MODERN MACHINE LEARNING PERSPECTIVE 53 5.1 I-vector features... 53

5.2 Speaker partitioning problem... 55

5.3 Speaker verification... 59

5.3.1 Generative models... 59

5.3.2 Discriminative models... 68

5.4 Speaker diarization... 73

6 SUMMARY OF PUBLICATIONS AND RESULTS 79 6.1 Summary of publications... 79

6.2 Summary of results... 81

(18)

7 CONCLUSIONS 85

BIBLIOGRAPHY 87

(19)

1 INTRODUCTION

Speech is the primary method of communication between humans. Speech is not only used for direct and instant interactions in our daily life, but also as a way to share information stored in the form of recorded speech. This includes lectures, podcasts, meetings, personal video and audio recordings, and other spoken docu- ments. During the past decades we can observe rapid growth in the quantity of audio and video data being produced and archived, enabled by the availability of inexpensive and efficient storage. This massive amount of multimedia data has cre- ated a need for technologies to facilitate efficient access and automatic organization of the information present in this data. As an example, it has been estimated that YouTube1alone has more than 400 hours of video uploaded every minute2.

Audio data contains different types information. Some of this information, such aslexical content (what is being spoken) and speaker identity(who is speaking) de- pends on the intrinsic properties of the speaker, while some other information, such as background sounds, depend on extrinsic factors. Therefore, different speech pro- cessing technologies are needed for the automatic extraction of information from raw audio recordings. Automatic speech recognition(ASR) [1] aims at extracting spo- ken words from audio, regardless of who is speaking. In contrast to ASR,automatic speaker recognition(SR) is concerned with the identity of a speaker, regardless of what was said. For instance, a SR system aims to verify whether voices in a pair of speech utterances originate from the same speaker or not. This setup, commonly known in SR asautomatic speaker verification (ASV), finds applications inforensic investigation, where ASV systems are used to determine whether or not a specific individual (sus- pected speaker) is the source of a voice in a given recording (trace) or not. Other applications of ASV include logical and physicalaccess control (e.g. controlling ac- cess to a bank account or a physical space) and personalization (e.g. personalized dialogue system). Another scenario,speaker identification (SID), involves comparing an unknown utterance against a collection of previously registered speakers so as to determine the identity of the unknown speaker — or to report that no match was found. The latter case is calledopen-setSID. Speaker identification can be par- ticularly useful in searching matches for given speaker(s) from large multimedia databases or a forensic voice register.

The usual assumption in both ASV and SI is that any given recording contains speech from one speaker only. If an audio recording, however, contains speech from more than one speaker, the task of segmenting it according to speakers’ identities is known asspeaker diarization(SD) [2].

Both SR and SD aim at determining an unobservedequivalence relationfor a set of given speech utterances with equivalence classes corresponding to distinct speakers (identities). Since the set of equivalence classes defines a partition of the input collection of utterances into disjoint subsets, both SR and SD can be seen as instances of a more generalspeaker partitioning problem[3]. A computer-aided solution to this

1www.youtube.com

2http://www.everysecond.io/youtube

(20)

problem requires an algorithm that takes a set of utterances as an input and outputs the partition such that the subsets are formed according to speaker identity.

Solutions to algorithmic problems can be, roughly, classified into two broad cat- egories: rule-basedanddata-driven. In rule-based solutions, the rules are constructed by human experts based on knowledge of their domain. This domain knowledge is often a result of long-term human experiences and learning. Therefore, this ap- proach could alternatively be defined as human learning: the knowledge acquired by human expert is converted into a set of rules, and then these rules are used for solving a new problem.

A demonstrative example of constructing a rule-based decision rule is that of speech activity detection(SAD), the problem of detecting human speech in audio sig- nals. Some of the earliest SAD methods [4] used zero-crossing rate (ZCR), which measures how often a signal amplitude changes its sign within a given time interval, to detect speech segments. It was observed that audio segments containing speech have lower ZCRs, on average, than the non-speech segments. The decision rule was based on comparing ZCR against a pre-defined decision threshold. Such a rule is based on domain knowledge about the properties of voiced speech acquired by hu- man through the research process. Nowadays, however, SADs are almost exclusively based on purely data-driven or hybrid decision rules.

While the rule-based approach often has the benefit of leading to interpretable and simple methods, many problems in the real-world cannot be accurately solved using such an approach. Whenever a large number of factors influence the answer, rules will easily depend on too many factors and will need to be tuned very finely.

Then it becomes difficult or even impossible for a human to accurately code the rules.

This is the case for many speech analysis problems including speech activity detection, speaker recognition and diarization. This is mainly due to the enormous variability of speech signals caused by several factors. One major factor,intra-speaker variability, includes a speaker’s emotional or health state, speaking rate, vocal effort, and various other effects. Another major factor, known aschannel variability, includes distortions introduced by the transmission channel (e.g. speech recorded with two different smartphones). As a result, modern speaker recognition (and diarization) systems are almost exclusively data-driven. In contrast to the rule-based approach, data-driven algorithms are based on knowledge or rules that are automatically ex- tracted, or learned from data. Therefore, this approach is also known as machine learning. Machine learning [5], [6] techniques allow to construct computer systems with the ability to learn (e.g., progressively improve performance on a specific task) using data, without being explicitly programmed.

In the speech field, the datasets used by researchers typically contain on the order of thousands of hours of speech data. The practical solution pursued by major tech- nology companies probably uses several orders of magnitudes more data (though such numbers are rarely reported in detail). However, despite large amounts of data and decades of research, many problems in speech technology remain challenging due to the high variability of speech signals and adverse acoustic conditions, most notably background noise, channel distortions and reverberation. While the intrin- sic variability of speech is theoretically confined within the set of possible sounds that could be produced by the human vocal organs, background noise can be, in principle, arbitrary. As a result, it is highly unlikely that a dataset used for design- ing a predictive model would accurately represent all possible acoustic conditions which leads to poor generalization to unseen conditions. This makes speech analy-

(21)

sis an extremely challenging problem for even the most powerful machine learning models. Collecting and labeling more diverse datasets is part of the solution. An- other part is to develop machine learning methods that exploit limited labeled data in more efficient ways.

This dissertation contributes to machine learning based solutions to several prob- lems in speech analysis:speech activity detection,speaker verification, andspeaker diarization. The main focus is on improving the generalization performance of pre- dictive models by using machine learning methodologies such as semi-supervised learning and multi-task learning, both of which are based on the idea of using in- formation sources beyond the labeled data. The former employs the unlabeled data along with small amount of labeled data, while the latter leverages information from related learning tasks or datasets.

The rest of this article-based dissertation is organized as follows:

• Chapter 2 contains theoretical background on relevant topics in machine learn- ing with a special emphasis on probabilistic models.

• Chapter 3 discusses speech activity detection task and the proposed semi- supervised SAD.

• Chapter 4 provides background for speaker verification and speaker diariza- tion tasks.

• Chapter 5 discusses speaker verification and diarization tasks from a proba- bilistic perspective and reviews the proposed methods.

• Chapter 6 summarizes contributions from the publications included in this dissertation.

• Chapter 7 concludes the dissertation and outlines possible directions of future work.

The original research papers are attached at the end as appendices. To promote reproducible research the author has made selected codes, originating from the work in this dissertation, publicly available3.

3http://cs.uef.fi/~sholok/

(22)
(23)

2 FUNDAMENTALS OF MACHINE LEARNING

2.1 MACHINE LEARNING

Machine learning is a subfield of computer science focused on constructing algo- rithms that can make data-driven decisions rather than relying on explicitly pro- grammed instructions. One definition of the algorithms studied in the field of ma- chine learning is provided in [5]: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E”. The definition suggests that such algorithms should generalize their experience to handle previously unencoun- tered data. Handwritten digit recognition is a representative example: the task T is to classify a given image of a digit to 10 categories (or classes): “0”, “1”, ..., “9”.

Performance measure, P, could be the percentage of digits that were categorized correctly. Finally, the experience,E, is a set of images with known labels. Elements of E are calledtraining examples. Another example, in the domain of speech tech- nology, is speaker verification. The task, T, is to decide whether or not a pair of speech utterances were spoken by the same person. As in the previous example, the performance measure,P, could be the percentage of pairs that were categorized correctly and the experience,E, is the set of audio recordings with speaker labels.

Machine learning tasks (and the corresponding techniques) can be divided into four broad classes:supervised,unsupervised,semi-supervised, andreinforcement learning.

• In supervised learning [7], training examples are pairs of inputs and desired outputs, for example, class labels. Thus, training data is often referred to as labeled data. The goal of a learning algorithm is to find a mapping from the set of all possible inputs to the set of all possible outputs such that it would predict correct outputs for previously unseen inputs.

• In unsupervised learning [7] the goal is to reveal a structure that underlies the data. In this case only elements of the input space are available. Therefore, such data is calledunlabeled.

• Semi-supervised learning [8] can be seen as a compromise between supervised and unsupervised learning tasks. Specifically, only part of the training inputs has the corresponding outputs available. The overall goal may be the same as in either supervised or unsupervised learning. More often, though, semi- supervised learning is seen as supervised learning supplied with an additional set of unlabeled inputs.

• Reinforcement learning [9] consists of interacting with environment and re- ceiving feedback for actions committed in the form of either reward or pun- ishment. The goal is to find apolicy, a mapping from the set of environmental states to the set of actions that allows one to maximize some notion of cumu- lative reward.

(24)

The focus of this thesis will be on (semi-)supervised and unsupervised learning tasks. The next two sub-sections provide more formal definitions of these learning tasks.

2.1.1 Supervised learning

Given an input object represented by a d-dimensional vector of features (or at- tributes)x= (x(1),x(2), ...,x(d))T∈ X, the aim of supervised learning is to approxi- mate an unknown relationship between the input (observation)x∈ X and the out- put (target)y∈ Yin order to build an accurate predictor that outputsy= f(x). The function f is known as adecision rule(orhypothesis) and it is estimated (learned) using a set consisting of input-output pairsD ={(x1,y1),(x2,y2), ...,(xN,yN)}known as thetraining set. To optimize the decision rule using Done needs a way to discern how close apredictionof the function, ˆy, is from a desired output or ground truth y. To this end, one defines a loss function `(y, ˆy). Here, the basic elements of a supervised learning task are formally specified:

• Input spaceX

• Output spaceY

• The space of decision rules (or hypothesis space)F, such that f : X → Y and f ∈ F

• Loss function L :Y × Y → R+ such thatL(y,y) = 0,i.e. no loss is incurred for correct predictions.

• Training set D = {(xi,yi)}i=1N , where adata point zi = (xi,yi)belongs to the product spaceZ =X × Y, known as theinput-output space.

Additionally, it is assumed that data points(x,y)are drawn independently from the setZ according to an unknown probability measurePoverZ. In the following, it will be assumed that the probability measure P is defined by some probability density function, denoted by p. Further, all distributions will be characterized by probability density functions or probability mass functions.

A learning algorithm is a way to select a decision rule f ∈ F. It is defined as mappingAthat maps training sampleD ∈ ZN =Z × Z ×...× Zto decision rule f. That is, different learning algorithms may output different functions f. A common assumption is that the learning algorithms ignores the ordering of elements in the training setD, that isA(z1,z2, ...,zN) =A(zπ1,zπ2, ...,zπN), wherezi∈ X × Y is an element ofDandπ is a permutation,i.e. a bijection between sets{1, ...,N}.

Supervised learning tasks are categorized by the type of output variables. In classification, the task is to predict a categorical, discrete class label. Assuming that each word in a vocabulary is a separate class, automatic speech recognition can be viewed as an instance of a classification task. Another learning task, regression, focuses on predicting a continuous quantity such as temperature, price or road traffic. Formally, if the cardinality of the output spaceY is finite, the learning task is referred to as classification. Otherwise, it is referred to as regression.

If the elements ofY are not scalar (discrete or real) values, then the learning task is referred to asstructured output learningorstructured prediction[10]. An example of structured learning islearning to rank[11] where the goal is to predict a permutation of a fixed list of items. An example of something that uses ranking algorithms are

(25)

web search engines whose goal is to find apermutationso that the most relevant web pages for a given query are placed at the top of a list.

Finally, there are some specialized tasks that are usually considered separately in the taxonomy of supervised learning tasks even if formally they can be placed under one of the afore-mentioned categories. For instance,ordinal ranking[12, 13] is a task of predicting the value of an ordinal variable, or a variable whose value exists on an arbitrary scale where only the relative ordering between different values matters.

Such values can be encoded as positive integers. That is, two different elements of Y can be compared by the ‘<’ operation and. These are usually referred to as ranks to distinguish them from the labels in the classification task. Examples of ordinal variables include age measured in years, level of agreement from “strongly disagree” to “strongly agree” and the size of an object. Even if predicting ordinal data can be cast as a classification problem, taking into account the natural order within categories allows more accurate and interpretable predictions to be made.

The goal of supervised learning is to find a function f ∈ F such that theexpected loss functional R, or risk, associated with f, defined as an expectation of the loss function, is minimized:

R(f),Ep(x,y)[`(y,f(x))] = Z Z

`(y,f(x))p(x,y)dydx→min

f∈F. (2.1) Using the relation between joint and conditional density functions known as the general multiplication rule [14], p(x,y) = p(y|x)p(x), the risk functional can be rewritten as

Ep(x)[Ep(y|x)[`(y,f(x))]] = Z Z

`(y,f(x))p(y|x)p(x)dydx (2.2)

= Z Z

`(y,f(x))p(y|x)dy

| {z }

Rcond

p(x)dx=Ep(x)[Rcond(f|x)], (2.3)

where Rcond(f|x) is known asconditional risk. A decision rule that minimizes the conditional risk for each inputxis referred to as theBayes decision rule[15],

fBayes(x),arg min

f∈FRcond(f|x) =arg min

f∈FEp(y|x)[`(y,f(x))], (2.4) also known as the optimal Bayes classifier in the context of classification. By con- struction it minimizes the overall risk. The resulting minimum value, theBayes risk:

RBayes = R(fBayes), is the smallest achievable risk, assuming that the distribution p(x,y)is known. This makes the Bayes risk a useful theoretical construct for super- vised learning tasks.

It should be noted that fBayesmay not be in the set of possible decision rules,F. Therefore, the choice of the family of predictors has an influence on how close the obtained predictor is to the Bayes optimal solution.

The choice of the loss function has also a major impact on the Bayesian decision.

Ideally, the loss function should be selected in a way that it would reflect the nature of the specific problem at hand. When the loss function cannot be determined due to limited knowledge about the problem, one option is to resort to classic losses used in statistics such as thequadratic loss

L(y, ˆy) = (y−yˆ)2, (2.5)

(26)

or theabsolute loss

L(y, ˆy) =|y−yˆ|, (2.6) both of which are suited to real-valued (univariate) outputs. For the quadratic loss, the Bayes-optimal decision corresponds to theexpected valueof the output,

fBayes(x) =arg min

f∈FEp(y|x)[(y−f(x))2] =Ep(y|x)[y] (2.7) and for the absolute loss to the median [16]. For categorical outputs, in turn, a common choice is thezero-one loss, defined as

`(y, ˆy) =

(0, ify=yˆ

1, otherwise (2.8)

which assigns the same penalty for any incorrect decision. The Bayes optimal clas- sifier with this loss selects an output with the largest probability [16]:

fBayes(x) =arg max

y p(y|x). (2.9)

The problem of selecting an appropriate family of decision rules and loss func- tion is discussed in further Section 2.3.

There are two most commonly adopted approaches used in a large number of applications to find a decision rule f [17]:

• Empirical risk minimization

• Estimation of probability density function

The former is used if one adopts the directform of the predictor, that is, explicitly specifies the set of decision rulesF and its elements. In contrast, the latter approach constructs a decision rule by constructing a probabilistic model of the data and applying Bayesian decision theory [15] to obtain a decision rule. These approaches will be described in the following section.

Empirical risk minimization

In practice, the joint probability density function of the inputs and outputs, p(x,y), is unknown. But if one assumes that the training setDwas drawn from this distribu- tion, it is possible to useMonte-Carlomethods [18], a class of numerical algorithms based on stochastic simulation, to approximate the risk functional. By the law of large numbers [19], the expected value of a random variable can be approximated by taking the empirical mean of independent samples of the variable. This allows numerical computation of integrals of the form

Eq(x)[h(x)] = Z

h(x)q(x)dx≈ 1 N

N i=1

h(xi), (2.10) wherehis some function andxiare independently and identically distributed (i.i.d.) samples from a distribution defined by probability density functionq. With the help of (2.10), one may then replace the true risk in (2.1)Rbyempirical risk:

Remp(f) = 1 N

N i=1

`(yi,f(xi)). (2.11)

(27)

Empirical risk minimization(ERM) principle [17] suggests picking a decision rule f so that it minimizes the empirical risk.

In practice a decision rule f belongs to a parametric family parametrized by a fixed number of parameters represented as a vector of real numbersθ= (θ1, ...,θd)T. That is, empirical risk is a function ofθand can be minimized using one of several numerical optimization methods for finding a local minimum of a function such as gradient descent or Newton method [20]. If the loss function isconvexthen the empirical risk is also a convex function, so any local minimum is a global minimum.

Some instances of empirical risk minimization involve non-convex loss functions.

Examples include training of a binary linear classifier f(x) = sign(θTx)using the zero-one loss function

`0-1(s) =1[s≤0](s). (2.12) Here, 1[s≤0] denotes an indicator function that is one if s ≤ 0 and zero otherwise.

This results in a piece-wise constant objective function:

Remp(θ) = 1 N

N i=1

`0-1(yiθTxi), (2.13) whereyi ∈ {−1,+1}denotes a class label. Instead of direct optimization of (2.13), commonly adopted remedy is to use a proxy to the loss, called asurrogate loss func- tion. For computational reasons it is usually a convex function that upper bounds the original loss function. The rationale for this is that by pushing down an upper bound one pushes down the original loss as well. Two broadly used surrogates to the zero-one loss are thehinge lossand thelogistic loss[21]:

`hinge(s) =max(0, 1−s), (2.14)

`logistic(s) =log(1+exp(−s)). (2.15) Even if the form of a decision rule is the same in both cases, the former approach and its associated learning algorithm is known as a support vector machine (SVM) while the latter is known aslogistic regression [22]. Convexity of the resulting ob- jective functions leads to tractable optimization problems, making these techniques attractive for practical use. However, in replacing original loss function by a surro- gate, one raises the natural question of whether minimizing the new risk leads to a function that also minimizes the original risk. The answer turns out to be affirmative for so-called consistentsurrogate losses [21]. Formally, a surrogate loss, `s, is con- sistent if for any distribution p(x,y)and for any sequence of functions fn :X → Y such that

n→limEp(x,y)[`(y,fn(x))] =min

f Ep(x,y)[`(y,f(x))] (2.16) the following equality holds

n→limEp(x,y)[`s(y,fn(x))] =min

f Ep(x,y)[`s(y,f(x))]. (2.17) In other words, minimization of the expected surrogate loss guarantees minimiza- tion of the original risk (2.1). It has been shown that both the hinge and the logistic losses are consistent with respect to zero-one loss for binary classification [21].

A classifier for speaker verification proposed in PublicationIIcan be seen as a modification of a conventional SVM with an aim to compensate fordomain mismatch,

(28)

a common problem in predictive modeling occurring when the distribution of the training and test sets differ [23].

Advantages and disadvantages of empirical risk minimization approach are sum- marized in Table 2.1.

Table 2.1: Advantages and disadvantages of empirical risk minimization.

Benefits Drawbacks

• In the limit, the empirical distri- bution tends to the correct distri- bution.

• The decision rule is found on the basis of minimal expected loss (risk), which is the quantity we are ultimately interested in.

• The decision rule must be re- trained if the loss function changes.

• There are no inherent ways to es- timate the confidence intervals of the predictions.

• There is no straightforward way to incorporate unlabeled data.

Density estimation

The main characteristic of the empirical risk minimization approach discussed in the previous section is that the functional form of a decision rule isexplicitly defined.

Typically, a decision rule f :X → Ybelongs to some parametric family of functions, therefore, ERM selects a member of that family by solving an optimization problem.

In contrast to ERM, the approach based on density estimation does not defines an explicit form of a decision rule. This approach aims at estimating the conditional density functionp(y|x)from the data and using the Bayes decision rule (2.4) to make prediction. There are two different types of probabilistic models to define p(y|x):

• discriminative models, and

• generative models.

Discriminative models directly approximate the conditional distribution p(y|x). Logistic regression [6] is an example of a discriminative model suitable for binary classification tasks, that is when Y = {+1,−1}. It defines the probability of an input vectorxbeing an instance of the classy=1 as follows:

p(y=1|x) =σ(−θTx) = 1

1+exp(−θTx), (2.18) where σ(·) denotes sigmoid function1 and θ is a vector of the parameters to be learned.

In contrast, generative models approximate the joint distributions of inputs and outputsp(x,y). The conditional density function,p(y|x), can be obtained as follows

p(y|x) = p(x,y)

p(x) = p(x,y)

R p(x,y)dy ∝p(x,y). (2.19)

1σ(s) =1/(1+exp(−s))

(29)

Often the joint density is defined as a product oflikelihoodandprior distribution(often simply called the prior): p(x,y) = p(x|y)p(y). Prior distribution p(y) expresses one’s a priori beliefs about the value of y before some evidence, x, is taken into account. The likelihoodp(x|y), in turn, is the conditional density of the observation xgiven the value of the target variabley. In the context of this factorizationp(y|x)is usually referred to as theposterior distributionbecause it expressesa posterioribeliefs about the value ofy, based on empirical evidence combined with prior beliefs.

One of the simplest examples of a generative model is Gaussian classifier (see Section 4.2 in [6]). Its name originates from the assumption that the class-conditional distributions are Gaussian, that is

p(x|y) =N(x|µy,Σy) = 1

(2π)d2|Σy|12 exp

1

2(xµy)TΣ−1y (xµy)

. (2.20) Denoting prior probability of the k-th class asp(y = k) =πk, the posterior proba- bility is found using Bayes theorem

p(y=k|x) = p(x|y=k)p(y=k)

p(x) = πkN(x|µk,Σk)

l=1πlN(x|µl,Σl). (2.21) A slightly more sophisticated version of this classifier is used in Publications III andIVto detect audio segments containing speech. More precisely, in that case the class-dependent densitiesp(x|y)are defined not as single Gaussian densities but as weighted sums of Gaussian densities.

One can see that, in contrast to empirical risk minimization, a loss function is not used in this approach to find a decision rule. Advantages and disadvantages of density estimation approach are summarized in Table 2.2.

Table 2.2: Advantages and disadvantages of density estimation.

Benefits Drawbacks

• If the estimated density is the

“true” model of the data, then this approach is optimal.

• Training is separated from pre- diction. There is no need to re- train the model if the loss func- tion changes.

• If the model is poorly fitted, the prediction can be highly inaccu- rate.

2.1.2 Unsupervised learning

In contrast to supervised learning,unsupervised learningaims at discovering hidden structure in the data based on somea prioriassumptions but without access to any labels or response variables. The two most commonly known unsupervised learning tasks include dimensionality reduction and clustering. The former aims at finding a compact representation of data points while retaining the most relevant information.

Therefore, it can be viewed as a form of lossy compression. The clustering task, in turn, aims at discovering groups of similar objects in a dataset.

(30)

Dimensionality reduction

Given an input object represented by ad-dimensional vectorx= (x(1),x(2), ...,x(d))T, dimensionality reduction aims at finding a vector of lower dimensionality

s= (s(1),s(2), ...,s(m))T, wherem<dso that as much information as possible about the original data is preserved. This aim can be formalized within anencoder-decoder framework [24]. An encoder is a mapping from the original feature space to some lower-dimensional space that assigns a compact representation s = encoder(x) to any given feature vector x. The decoder is used to obtain a reconstruction xrec = decoder(s)from the encodingsof the original feature vectorx. These mappings are found by minimizing some loss function`(·,·)that measures dissimilarity between a feature vectorxand its reconstructionxrec:

N i=1

`(xi,decoderθ(encoderφ(xi)))→min

φ,θ, (2.22)

where φ and θ are parameters of the encoder and decoder, respectively. In other words, the mappingdecoder(encoder(·)), usually referred to as anautoencoder[25]

in the context of artificial neural network models, approximates the identity map- ping.

Dimensionality reduction is used to combat the curse of dimensionality [26], a problem that occurs when the number of features,d, is large relative to the training set size. As the dimensionality increases, one needs larger datasets to achieve the satisfactory generalization ability of a learning algorithm as some feature combina- tions may never occur in the training set (similar argument holds for real-valued data).

Perhaps the most well known dimensionality reduction technique is principal component analysis (PCA) [27]. It seeks an orthogonal matrix URd×m to define encoding and decoding mappings ass=encoder(x) =UTxandx=decoder(s) = Us, respectively. This is achieved by minimizing themean squared error,`mse(x, ˆx) = kxxˆk2:

N i=1

kxiUUTxik2→min

U (2.23)

s.t.UTU=I. (2.24)

Assuming that the data is centered, that is, the empirical mean of the training set equals zero, the solution to this problem is the matrix having eigenvectors corre- sponding tomlargest eigenvalues of the empirical covariance matrixC=i=1N xixTi as columns. PCA falls to the category of lineardimensionality reduction methods because the encoding mapping is linear [28].

Dimensionality reduction can also be seen as part ofrepresentation learning[29], a class of techniques that aim at finding representations of the data that makes it easier to extract useful information for a learning algorithm. Typically, this is achieved by learning a mapping (encoder) from raw measurements or low-level features to some fixed-dimensional vector space. Elements of this space, known as embeddings, are then used either for visualization or as an input to an algorithm to solve a separate learning task. More often, however, the term embeddings is used specifically when similar input objects are expected to have similar representations in the embedding space. One example is word embeddings [30], where a vector of

(31)

real numbers represents each word from a fixed vocabulary. Other examples include face [31] and speaker [32] embeddings, where face images and speech utterances respectively are embedded into a vector space such that any two representations of inputs corresponding to the same person are characterized by a small distance and representations of different individuals are characterized by a large distance.

One can point out that learning such mapping requires labeled data, so learn- ing embedding could be categorized as supervised learning technique. But since learning representations is rarely the final goal, but rather an intermediate step to solve another applied problem, learning embeddings is usually categorized as an unsupervised learning task.

One broadly used technique to learn embeddings is linear discriminant analysis (LDA) [28]. It aims at projecting the data into lower-dimensional space in such a way that separation between classes is maximized. The first step in LDA computes between-class,Σb, and within-class,Σw, covariance matrices defined as follows:

Σb =

K k=1

(µkµ)(µkµ)T (2.25) Σw =

N i=1

(xiµk)(xiµk)T, (2.26) where µ is the global mean, µk is the class mean and K is the number of classes.

Then, LDA seeks the projectionUthat maximizesbetween-classvariability, tr(UTΣbU), while minimizing within-class variability, tr(UTΣwU), leading to the optimization problem

tr(UTΣbU)

tr(UTΣwU) →max

U (2.27)

s.t.UTU=I. (2.28)

This optimization problem is non-convex, so it is hard to solve it directly. Conven- tionally, the objective function in the original problem is replaced by an alternative objective, tr((UTΣwU)−1UTΣbU), leading to a closed-form solution obtained by se- lecting top eigenvectors of the matrixΣ−1w Σb[33].

LDA has a probabilistic extension known asprobabilistic linear discriminant analy- sis(PLDA) that is extensively used for face and speaker recognition [34]. In contrast with classical LDA, typically used as a tool for dimensionality reduction, PLDA can be reformulated as a probabilistic classifier to discriminate between a finite set of hypotheses describing the relationship among a set of biometric templates. Further, its discriminative formulation leads to a classifier known aspairwise support vector machine(PSVM) in the speaker recognition community [35, 36], which in turn forms a basis for an alternative learning algorithm proposed in PublicationII.

Clustering

Another major category of unsupervised learning is clustering, which aims to iden- tify groups of similar objects (clusters) in a dataset. The underlying assumption is that the dataset consists of a few groups of objects so that objects in the same group are more similar to each other than to those in different groups. For instance, one of

(32)

the most well known clustering algorithms,k-means[37], can be seen as a method to approximately solve the following optimization problem:

N i=1

K k=1

zikkxiµkk2→min

zikk, (2.29)

s.t.

K k=1

zik=1, (2.30)

where K is the number of clusters assumed to be known or fixed in advance and zik ∈ {0, 1} is an indicator variable that equals 1 if thei-th data point belongs to thek-th cluster and 0 otherwise. Finallyµk is the centroid of thek-th cluster. The k-means algorithm is block-coordinate descent (alternating minimization) to solve this problem, which alternates between updating zik while keeping µk fixed and vice versa until convergence.

Interestingly,k-means can be viewed as an instance of the encoder-decoder frame- work. In this case, the encoder maps the input vector to the index of the cluster with the closest centroid and the decoder outputs the centroid. That is, each vector in the input space is approximated by one element of a fixed dictionary (or codebook) of vectors. Therefore, encoding-decoding can be seen as vector quantization (VQ), a generalization of scalar quantization to multi-dimensional data. Vector quantization based modeling has a long history of applications to automatic speaker recognition and related tasks, beginning in the 1980s [38]. In [39], so-calledcentroid modelsbased on VQ were used to design a lightweight speaker verification system for mobile de- vises with limited hardware resources like CPU speed and memory. In Publication IIIand [40], codebook models are used for speech activity detection.

Many unsupervised learning tasks, including those reviewed above, can be for- mulated as estimation of probability density function p(x) given a training set of i.i.d. samplesxi ∼ p(x). For instance, in aprobabilisticformulation of PCA [41] one seeks a set of parameters {µ,U,σ}to fit a Gaussian density function of the form p(x) = N(x|µ,UUT+σ2I)to a given data set. It was shown in [41] that, similar to the classical PCA, columns of matrixUobtained via maximum likelihood estima- tion (discussed in Section 2.2.3) are scaled eigenvectors of the empirical covariance matrix. In turn, some clustering methods can be seen as estimating parameters of so-calledmixture models [6], which are discussed in the following sections. When estimated, parameters of a mixture model can be used to find assignments of the data points to groups discovered in the dataset. Speaker diarization is an example of a task for which mixture models are adopted (e.g. [42], [43]) to cluster speech segments according to speaker identity.

2.1.3 Semi-supervised learning

Semi-supervised learning [8] is based on the expectation that using unlabeled data in conjunction with labeled data can lead to improvement in prediction accuracy in supervised learning tasks. The acquisition of labeled data for a learning task of- ten requires a physical experiment or a human expert (e.g. to transcribe an audio recording). Typically, the high cost associated with the labeling process does not allow a fully labeled training set to be obtained. In contrast, the acquisition of unla- beled data is relatively inexpensive. In many cases, therefore, using semi-supervised learning techniques may become a practical alternative to pure supervised learning.

(33)

As in the supervised learning framework, one has a set of labeled training ex- amples {(x1,y1), ...,(xN,yN)}. However, there is also a set of unlabeled examples {xN+1, ...,xN+M}. Semi-supervised learning aims at using this combined informa- tion to improve predictive performance that could only be obtained by solving a supervised learning task with labeled data.

The key idea behind many semi-supervised learning methods suggests treating unknown labelsH={hN+1, ...,hN+M}as hidden variables. In the context of empir- ical risk minimization this leads to joint optimization for both the decision function

f and the unknown labelsH:

1 N

N i=1

`(yi,f(xi)) + 1 M

N+M

i=N+1

`(hi,f(xi))→min

f,H . (2.31)

This problem is an instance ofmixed-integer nonlinear programming[44] and therefore lacks scalability. Since finding exact optimal solution becomes impractical for large M, one has to resort to heuristic or approximate approaches [45]. For instance, in semi-supervised support vector machines (S3VM) [46] similar objective function is optimized using branch-and-bound techniques.

Generative probabilistic models provide a principled way to exploit unlabeled data. In this case, unknown labels are viewed as hidden (latent) variables and marginalized out to define a distribution of the unlabeled feature vectors. Given the class conditional densities p(x|h)and the prior distribution for classesp(h), the unconditional density of the observations is found as

p(x) =

h

p(x,h) =

k

p(x|h=k)p(h=k). (2.32) It is formed as a convex combination of class-conditional densities with coefficients defined by prior probabilities. In general, distributions of this from are known as mixture modelsand will be discussed in Section 2.2.1. The mixture model in (2.32) can be used to define thejoint density of labeled and unlabeled data points which allows to incorporate the unlabeled data into the training process. This approach is adopted to construct a semi-supervised probabilistic classifier for speech activity detection in PublicationIV.

2.2 MODELING COMPLEX DENSITIES

In order to solve supervised or unsupervised learning tasks one may need to obtain accurate estimates of the data distribution. This can be done using eitherexplicitor implicitrepresentations of a probability density function.

2.2.1 Explicit density models

Aparametricstatistical model is a family of non-negative density functions

{p(·|θ)|θΘ}, each of which is indexed by a finite-dimensional parameter vector θΘRd. By definition, given the parameters, a new data point,z, is independent of the training data,D={z1, ...,zN}:

p(z|θ,D) =p(z|θ). (2.33)

(34)

Therefore,θcaptures all of the relevant information about the training set. Thus, the complexity of a parametric model is bounded even if the amount of training data is unbounded.

In contrast, non-parametric models assume that the data distribution cannot be defined by a finite set of parameters. That is, the amount of information that the model can capture about the data can grow as the total amount of data increases.

This allows one to automatically determine the complexity of the model. This ability of nonparametric models is often employed, for instance, to determine the number of groups in clustering task. In Publication I, a non-parametric model is used for clustering speech segments when the number of clusters is unknowna priori.

Probabilistic models are eithernormalizedorunnormalized. A statistical model is said to be normalized, if the condition

Z

p(z|θ)dz=1, (2.34)

holds for allθ. In contrast, a model{u(·|θ)|θΘ}is said to be unnormalized if the above integral is finite but not necessarily equal to 1. Hence, for normalized models, p(z) are probability density functions but for unnormalized models, u(z) are not.

In theory, any non-negative functionu(z)associated with an unnormalized model can be converted to a probability density function by dividing it by its integral (or sum), referred to as apartition function:

Z(θ) = Z

p(z|θ)dz. (2.35)

The corresponding normalized model is then p(z|θ) = u(z|θ)

Z(θ) . (2.36)

However, this is rarely the case, because the integral in the partition function is generally intractable, that is, lacks analytic expression.

Explicit density modelsdefine an unnormalized model by explicitly constructing the functionu(z)∝ p(z). In some cases this unnormalized model can be converted to a normalized model with corresponding probability density function p(z). If one is able to find a closed form expression for the (unnormalized) density function the model is calledtractable, otherwise it is calledintractable.

Tractableexplicit density models can be constructed in several ways [47]:

• as a product of known conditional densities,

• by applying a nonlinear transform, or

• by introducing latent variables.

In the first case, a joint probability density function is represented as a product of several conditional densities in the following way:

p(z) =p(z1)

d j=2

p(zj|z1, ...,zj−1) (2.37) Given such factorization, each of the factors, in turn, can be constructed in one of the three aforementioned ways. This representation of density function is used in

(35)

WaveNet [48], a probabilistic model of raw audio waveforms, where each condi- tional density is defined using a convolutional neural network.

Another way to construct a probability density function is to use continuous invertible transformation g between two vector spaces. If s ∼ q(h) is a vector of latent random variables, then a distribution overzcan be defined as follows [47]:

p(z) =q(g−1(z))Jg1(z), (2.38) where Jg1(z) denotes the Jacobian of g−1 evaluated at z. The density p(z) is tractable if the density q(s) and the determinant of the Jacobian of g−1 are both tractable. Unfortunately, this is rarely the case. In addition, the invertibility of g requires that the latent variablehmust have the same dimensionality asz.

Finally, latent variable models can be seen as a middle ground between two pre- vious approaches to construct densities. As suggested by its name, this approach is based on introducing a vector of latent random variables h ∼ p(h). However, in contrast to the previous approach, instead of a deterministic transformation, one defines a conditional density,p(z|h). It can be seen as a generalization of a determin- istic transformation, so an equalityz= g(h)is equivalent to p(z|h) = δ(z−g(h)), whereδ(·)is the Diracdelta function. Assumingp(z|h) =δ(z−g(h)), the previous approach is recovered by using the substitution rule for integrals (see Theorem 7.26 in [49]). This allows a joint density to be defined for the observed and unobserved (latent) variables as a product of given density functions: p(z,h) = p(z|h)p(h). Finally, one can obtainp(z)by marginalizing out the latent variables:

p(z) = Z

p(z,h)dh= Z

p(z|h)p(h)dh. (2.39) Since many probabilistic models in speaker recognition, including those proposed in Publications I-IV, are instances of latent variable models this topic deserves a more detailed discussion.

Latent variable models

While modeling capacities of basic distributions, such as the Gaussian, Gamma, Beta, Chi-squared, Dirichlet, or Poisson distributions, are limited, one can build complex distributions from simpler building blocks. To model a complex probability density over the observed variable z, one introduces thehidden(or latent) variable h, which may be discrete or continuous. The density of interestp(z)is expressed as the marginalization of the joint densityp(z,h)so that:

p(z) = Z

p(z,h)dh. (2.40)

Usually the joint density p(z,h)is defined as a product of the conditional density p(z|h)and the marginal density of the hidden variablep(h)such that both factors are represented by simple distributions. In turn, the densityp(h)can also be defined by means of hidden variables leading to so-calledhierarchical models[50].

While, in general, a latent variable model may not be tractable — that is, there is no closed form expression forR

p(z,h)dh— the most popular models are tractable.

In the next section two relevant classes of tractable models, namelymixture models and subspace models, will be described in more detail. Both of these models are frequently used to represent the class-conditional densities in generative classifiers.

Viittaukset

LIITTYVÄT TIEDOSTOT

Keywords: Text-independent speaker recognition, vector quantization, spectral features, Gaussian mixture model, cohort modeling, classifier fusion, realtime recog-

Spectral analysis of a speech signal is at the heart of traditional speaker recognition, and while numerous different low-level features have been developed over the years in

In this work, we evaluate the performance of acoustic and throat microphone based speaker verification system for GMM-UBM and i-vector based speaker recognition.. Moreover, since

The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using different

Speech activity detection (SAD) [1], the classic problem to lo- cate speech segments from a given recording, finds use in cod- ing [2] and recognition applications to

Aronowitz, “Compensating inter-dataset variability in PLDA hyper-parameters for robust speaker recognition,” in Speaker and Language Recognition Workshop (IEEE Odyssey), 2014,

In this work, we evaluate the performance of acoustic and throat microphone based speaker verification system for GMM-UBM and i-vector based speaker recognition.. Moreover, since

The networks trained using metric learning with angular prototypical loss functions (Chung et al. 2020) showed better results across the board, than the end-to-end siamese