• Ei tuloksia

In PublicationI, we evaluated the proposed speaker diarization system using the NIST SRE 2008 [160] telephone data. Since the goal of the experiment was to study the properties of speaker clustering algorithms, rather than comparing complete

1Members of the “HAPPY” team are the authors of the paperIII.

Table 6.1: Summary of the speech corpora. SRE: speaker recognition evaluation, DAC: domain adaptation challenge, SAD: speech activity detection.

Corpus Description Used

in SRE’08

[160]

The data used in our experiments is a subset of the NIST SRE 2008 corpus. It consists of 2,215 telephone conversations, each involving two speakers, and approximately 5 minutes in duration. The total duration is approximately 200 hours.

I

DAC [216] The experimental setup of DAC involves two separate training sets:in-domain(SRE) andout-of-domain(SWB). The in-domain SRE set consists of telephone calls from 3790 speakers (male and female) and 36470 speech cuts taken from the SRE 04, 05, 06 and 08 collections. The out-of-domain SWB set consists of telephone calls from 3114 speakers (male and female) and 33039 speech cuts taken from Switchboard-I and II corpora [186]. The SRE’10 telephone data is used as the enrollment and test sets. The average duration of audio recordings in both the training and the test data is 5 minutes. Note: this dataset contains no actual speech data — all the recordings are represented by i-vectors.

II

OpenSAD [120]

The OpenSAD data originates from one of the DARPA RATS (Robust Automatic Transcription of Speech) evaluation sets [217]. It consists of highly degraded recordings obtained by transmitting the source audio over several different noisy radio communication channels. We included data from the dev-2 subset of the official development part in our evaluation set, resulting in 661 audio recordings with an average duration of 10 minutes. The total duration is approximately 100 hours.

III,IV

SRE’10 [151]

We used the NIST SRE 2010 dataset to evaluate the i-vector based speaker verification system using different SAD methods. In specific, we used a subset of the male trials from the normal vocal effort telephone speech condition (CC5) consisting of telephone calls of approximately 5 minutes in duration.

IV

RSR2015 [218]

The RSR2015 corpus consists of microphone speech collected in clean environmental condition using a set of six portable devices including smart phones and tablets. The corpus comprises over 71 hours of speech recorded from English speakers in Singapore. The pool of speakers consists of 300 participants (157 male and 143 female speakers). Each of the participants recorded 9 sessions consisting of 30 short sentences.

IV

RedDots [219]

In our experiments we used the quarter four (Q4) release of RedDots corpus. The corpus is collected from speakers using different smartphones in different environmental conditions. The pool of speakers consists of 300 participants (49 male and 13 female speakers) from 21 countries. The total number of sessions in the Q4 release is 572 (473 male and 99 female sessions).

IV

speaker diarization systems, we usedoraclespeech activity boundaries to ignore er-rors caused by a speech activity detector. As a result, two out of the three terms of the DER metric, namely, missed speech rate and false alarm speech rate vanish.

This allows one to focus on only speaker confusion error, which relates to the per-formance of speaker clustering. Evaluation of the proposed system yielded DERs varying from 1.74% to 2.35%, while the baseline eigenvoice based system [42], un-der comparable conditions, has a lower DER of 1.29%. However, [42] requires the number of speakers to be known in advance while in the proposed system this is determines automatically; experimentally, the proposed system detected the correct number of speakers in more than 90% of conversations.

In Publication II, we evaluated the proposed classifier, mt-PSVM, on the data from theDomain Adaptation Challenge(DAC) [216] organized by Johns Hopkins Uni-versity. In the experiments by the author, mt-PSM reduced the EER from 6.88% to 4.18% in a gender-independent setup when trained on the out-of-domain Switch-board [186] dataset and tested on the in-domain SRE 2010 [151] dataset. Even if the new proposal, mt-PSVM, did not outperform a competing technique, inter dataset variability compensation(IDVC) [220] which achieved EER=3.08% on the same setup, it represents a different way to compensate inter-dataset variability for discrimina-tive formulation of the PLDA. The other existing techniques, including IDVC, are designed for generative formulation of the PLDA. As discussed in [221], the shared weight vector in mt-SVM does not necessarily yield an accurate classifier because it is not constrained to perform well on any task and is only used to share information across tasks. Thus, devising alternative objective functions or parameter sharing strategies might be a research direction worthy of further exploration.

In Publications III and IV, the author and his co-authors evaluated the pro-posed SAD method using the dataset released as a part of the NIST OpenSAD challenge [120]. We observed increased accuracy when unlabeled data was used to estimate speech and non-speech statistical models. Further, we conducted extensive speaker verification experiments with NIST SRE 2010 [151], RSR2015 [218], and Red-Dots [219] corpora. The results showed benefits of the proposed method especially on long speech recordings containing non-stationary noise. Finally, we studied how integration of the proposed SAD method into a downstream recognition application

— here, speaker verification — impacts recognition errors. We analyzed the depen-dence of the trade-off between misses and false alarms on the EER of the i-vector based ASV system. We observed that a higher miss rate is less of a problem than a higher false alarm rate. Therefore, false alarms should be penalized roughly 4 to 5 times more to get the best performance. That is, a speaker model estimated from a smaller portion of pure data is preferable to a noisy model.

A summary of the main contributions and achievements is provided in Table 6.2.

Table 6.2: Summary of the main results in publications. ASV: automatic speaker verification, SAD: speech activity detection, SD: speaker diarization, EER: equal er-ror rate, DCF: detection cost function

Application The core proposal Dataset Results

I SD Diarization with anunknownnumber of speakers.

SRE’08 New clustering method to accurately detect the number of speakers.

II ASV Compensation of inter-dataset variability via multi-task learning.

DAC EER: 6.88%4.18%.

III SAD Introduction of unsupervised GMM-based SAD.

OpenSAD New method that contributes to fusion of SADs.

IV SAD Improvement of the SAD method from IIIto enable using all frames by adopting semi-supervised learning.

OpenSAD Up to 5% reduction in DCF compared to PublicationIII.

IV SAD &

ASV

Study of the impact of SAD method on ASV performance

SRE’10, RSR2015, RedDots.

The proposed SAD method is best suited for long and noisy data conditions.

IV SAD &

ASV

Determination of the optimal SAD miss and false alarm trade-off for ASV

SRE’10. SAD false alarms should be penalized 4 to 5 times more over the misses.

7 CONCLUSIONS

The work done in this thesis contributes to the improvement of machine learning techniques for automatic text-independent speaker recognition and speech segmen-tation. First, it addresses speaker diarization problem in the most general formula-tion, that is, when the number of speakers in a recorded conversation is unknown a priori. A probabilistic generative model for similarity (affinity) matrices was pro-posed and objectively evaluated on the NIST SRE 2008 telephone data using the commonly adopted protocol. The experiment indicated that the fully Bayesian treat-ment for determining model complexity allows the correct number of speakers to be accurately estimated. The downside is compromised segmentation accuracy com-pared to the best performing algorithm with a known (oracle) number of speakers.

One distinctive feature of the new model is that it can be naturally extended to the multi-view setting (when multiple similarity matrices are available) by using mul-tivariate distributions. Exploring the potential of combining multiple sources of information in the multi-view scenario would be an interesting direction for future work.

Second, a multi-task extension of the pairwise support vector machine was stud-ied with the aim of reducing the effect of domain mismatch between training and evaluation data in speaker verification. The evaluation results indicate that the pro-posed discriminative classifier reduces the performance gap caused by domain mis-match, though struggles to achieve accuracy similar to its generative counterparts.

The author believes that the proposed method can be improved by altering the ob-jective function and parameter sharing strategy.

Finally, the author studied a simple and general-purpose probabilistic speech activity detection method in which classes are modeled by Gaussian mixture mod-els. The author proposed a new training method based on semi-supervised learning paradigm that provides a methodology for estimating more accurate models by incorporating unlabeled data into the training process. We observed increased accu-racy of the stand-alone SAD system on the challenging dataset from the recent NIST OpenSAD evaluation. Our extensive automatic speaker verification (ASV) evalua-tion, including both text-independent experiments and text-dependent experiments, suggested benefits of the GMM based SAD method for the long speech segments.

Also, the evaluation results indicate comparable, on average, ASV performance with classic unsupervised SAD methods which are designed specifically for this task and exploit the information about the nature of speech signal. In contrast, the proposed SAD method makes no strong assumptions about the input data and is not tied, in principle, to any particular features. This leaves room for potential improvement by either constraining the model to take into account some specifics of the input data, or adopting alternative features with higher discriminative abilities.

In fact, output scores from other SADs can be used as the features enabling fusion of different SAD methods. Further exploration of this idea is left for future work.

BIBLIOGRAPHY

[1] D. Jurafsky and J. H. Martin, Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition(Prentice Hall, Pearson Education International, 2009).

[2] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker di-arization systems,” IEEE Transactions on Audio, Speech & Language Processing 14, 1557–1565 (2006).

[3] N. Brümmer and E. de Villiers, “The speaker partitioning problem,” in Odyssey(2010).

[4] L. Rabiner and R. Schafer,Digital Processing of Speech Signals(Englewood Cliffs:

Prentice Hall, 1978).

[5] T. M. Mitchell, Machine Learning, 1 ed. (McGraw-Hill, Inc., New York, NY, USA, 1997).

[6] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)(Springer-Verlag, Berlin, Heidelberg, 2006).

[7] T. Hastie, R. Tibshirani, and J. Friedman,The elements of statistical learning: data mining, inference and prediction, 2 ed. (Springer, 2009).

[8] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning, 1st ed. (The MIT Press, Cambridge, MA, 2010).

[9] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed.

(MIT Press, Cambridge, MA, USA, 1998).

[10] S. Nowozin, P. V. Gehler, J. Jancsary, and C. H. Lampert, Advanced Structured Prediction(The MIT Press, 2014).

[11] T.-Y. Liu, “Learning to Rank for Information Retrieval,”Found. Trends Inf. Retr.

3, 225–331 (2009).

[12] E. Frank and M. Hall, “A Simple Approach to Ordinal Classification,” in Pro-ceedings of the 12th European Conference on Machine Learning, EMCL ’01 (2001), pp. 145–156.

[13] J. S. Cardoso and J. F. Pinto da Costa, “Learning to Classify Ordinal Data: The Data Replication Method,” Journal of Machine Learning Research 8, 1393–1429 (2007).

[14] J. Pitman, Probability (Springer Texts in Statistics)(Springer, 1999).

[15] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition) (Wiley-Interscience, New York, NY, USA, 2000).

[16] C. Robert, The Bayesian Choice: From Decision-Theoretic Foundations to Computa-tional Implementation(Springer New York, 2007).

[17] V. N. Vapnik, The Nature of Statistical Learning Theory(Springer-Verlag, Berlin, Heidelberg, 1995).

[18] C. P. Robert and G. Casella, Monte Carlo Statistical Methods (Springer Texts in Statistics)(Springer-Verlag, Berlin, Heidelberg, 2005).

[19] M. Loeve,Probability Theory I(Springer, 1977).

[20] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. (Springer, New York, 2006).

[21] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,”Journal of the American Statistical Association101, 138–156 (2006).

[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar,Foundations of Machine Learning (The MIT Press, 2012).

[23] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in Machine Learning(The MIT Press, 2009).

[24] M. Ranzato, Y.-L. Boureau, S. Chopra, and Y. LeCun, “A Unified Energy-Based Framework for Unsupervised Learning,” in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Vol. 2, Proceedings of Machine Learning Research (2007), pp. 371–379.

[25] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and Composing Robust Features with Denoising Autoencoders,” inProceedings of the 25th International Conference on Machine Learning(2008), pp. 1096–1103.

[26] D. L. Donoho, “High-dimensional data analysis: The curses and blessings of dimensionality,” inAmerican Mathematical Society Conf. Math Challenges of the 21st Century(2000).

[27] I. Jolliffe, Principal Component Analysis(Springer Verlag, 1986).

[28] J. P. Cunningham and Z. Ghahramani, “Linear Dimensionality Reduction:

Survey, Insights, and Generalizations,”Journal of Machine Learning Research16, 2859–2900 (2015).

[29] Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,”IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).

[30] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,”https://arxiv.org/abs/1301.3781(2013).

[31] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering.,” inCVPR(2015), pp. 815–823.

[32] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in2016 IEEE Spoken Language Technology Workshop (SLT)(2016), pp. 165–170.

[33] C. Shen, H. Li, and M. J. Brooks, “A Convex Programming Approach to the Trace Quotient Problem,” inACCV (2), Vol. 4844, Lecture Notes in Computer Science (2007), pp. 227–235.

[34] S. J. D. Prince and J. H. Elder, “Probabilistic Linear Discriminant Analysis for Inferences About Identity,” inICCV(2007), pp. 1–8.

[35] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Brüm-mer, “Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic(2011), pp. 4832–4835.

[36] S. Cumani, N. Brümmer, L. Burget, and P. Laface, “Fast discriminative speaker verification in the i-vector space,” inProceedings of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic(2011), pp. 4852–4855.

[37] Y. Linde, A. Buzo, and R. M. Gray, “An Algorithm for Vector Quantizer De-sign,”IEEE Transactions on Communications28, 84–95 (1980).

[38] F. K. Soong, A. E. Rosenberg, B.-H. Juang, and L. R. Rabiner, “Report: A Vector Quantization Approach to Speaker Recognition,” AT&T Technical Journal 66, 14–26 (1987).

[39] T. Kinnunen, E. Karpov, and P. Fränti, “Real-time speaker identification and verification,”IEEE Trans. Audio, Speech & Language Processing14, 277–288 (2006).

[40] T. Kinnunen and P. Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data,” inProc.

ICASSP(2013), pp. 7229–7233.

[41] M. E. Tipping and C. M. Bishop, “Probabilistic Principal Component Analy-sis,”Journal of the Royal Statistical Society, Series B61, 611–622 (1999).

[42] P. Kenny, D. A. Reynolds, and F. Castaldo, “Diarization of Telephone Con-versations Using Factor Analysis,”J. Sel. Topics Signal Processing4, 1059–1070 (2010).

[43] S. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach,”IEEE Trans. Au-dio, Speech & Language Processing21, 2015–2028 (2013).

[44] C. A. Floudas, Nonlinear and Mixed-Integer Optimization: Fundamentals and Ap-plications (Topics in Chemical Engineering)(Oxford University Press, New York, 1995).

[45] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization Techniques for Semi-Supervised Support Vector Machines,” Journal of Machine Learning Re-search9, 203–233 (2008).

[46] K. P. Bennett and A. Demiriz, “Semi-supervised Support Vector Machines,” in Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II(1999), pp. 368–374.

[47] I. J. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,”

https://arxiv.org/abs/1701.00160(2016).

[48] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,”https://arxiv.org/abs/1609.03499(2016).

[49] W. Rudin, Real and Complex Analysis, 3rd Ed.(McGraw-Hill, Inc., New York, NY, USA, 1987).

[50] A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin, Bayesian Data Analysis, Third ed. (Chapman and Hall/CRC, London, 2013).

[51] S. T. Roweis and Z. Ghahramani, “A Unifying Review of Linear Gaussian Models,”Neural Computation11, 305–345 (1999).

[52] C. M. Bishop, “Bayesian PCA,” inAdvances in Neural Information Processing Systems 11(1998), pp. 382–388.

[53] S. Mohamed and B. Lakshminarayanan, “Learning in Implicit Generative Models,”https://arxiv.org/abs/1610.03483(2016).

[54] L. Devroye, Non-Uniform Random Variate Generation(originally published with (Springer-Verlag, 1986).

[55] B. W. Silverman, Density Estimation for Statistics and Data Analysis(Chapman

& Hall, London, 1986).

[56] E. L. Lehmann and G. Casella,Theory of Point Estimation, Second ed. (Springer-Verlag, New York, NY, USA, 1998).

[57] F. Sha and L. K. Saul, “Comparison of Large Margin Training to Other Dis-criminative Methods for Phonetic Recognition by Hidden Markov Models,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, Honolulu, Hawaii, USA, April 15-20, 2007(2007), pp.

313–316.

[58] J. A. Lasserre, C. M. Bishop, and T. P. Minka, “Principled Hybrids of Gen-erative and Discriminative Models,” in2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA(2006), pp. 87–94.

[59] A. Nadas, “A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood,”IEEE Transactions on Acoustics Speech and Signal Process-ing31, 814–817 (1983).

[60] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mu-tual information estimation of hidden Markov model parameters for speech recognition,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing(1986), pp. 49–52.

[61] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end Speech Recognition Using Lattice-free MMI,” inInterspeech(2018), pp. 12–16.

[62] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” inInterspeech(2013), pp. 2345–2349.

[63] P. C. Woodland and D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,”Computer Speech & Language16, 25–47 (2002).

[64] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B39, 1–38 (1977).

[65] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-Minimization Algorithms in Signal Processing, Communications, and Machine Learning,” IEEE Trans.

Signal Processing65, 794–816 (2017).

[66] K. Lange, MM Optimization Algorithms(SIAM-Society for Industrial and Ap-plied Mathematics, USA, 2016).

[67] S. Kullback and R. A. Leibler, “On Information and Sufficiency,” Ann. Math.

Statist.22, 79–86 (1951).

[68] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed. (The Johns Hopkins University Press, 1996).

[69] P. Kenny, “Bayesian Speaker Verification with Heavy-Tailed Priors,” (2010).

[70] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and Session Variability in GMM-Based Speaker Verification,”IEEE Trans. Audio, Speech &

Language Processing15, 1448–1460 (2007).

[71] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A Study of Interspeaker Variability in Speaker Verification,” IEEE Trans. Audio, Speech &

Language Processing16, 980–988 (2008).

[72] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new esti-mation principle for unnormalized statistical models,” in Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), Vol. 9, JMLR W&CP (2010), pp.

297–304.

[73] M. Gutmann and A. Hyvärinen, “Estimation of Unnormalized Statistical Mod-els without Numerical Integration,” inProc Workshop on Information Theoretic Methods in Science and Engineering(2013).

[74] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive Estimation of Unnor-malized Statistical Models, with Applications to Natural Image Statistics,”

Journal of Machine Learning Research13, 307–361 (2012).

[75] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A System for Large-scale Machine Learning,”

inProceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16 (2016), pp. 265–283.

[76] F. Valente and C. Wellekens, “Variational Bayesian Methods for Audio Index-ing,” inMachine Learning for Multimodal Interaction, Second International Work-shop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers(2005), pp. 307–319.

[77] L. P. Devroye and T. J. Wagner, “Distribution-free inequalities for the deleted and holdout error estimates,”IEEE Transactions on Information Theory25, 202–

207 (1979).

[78] S. Arlot, A. Celisse, et al., “A survey of cross-validation procedures for model selection,”Statistics surveys4, 40–79 (2010).

[79] S. Geisser, “The predictive sample reuse method with applications,”Journal of The American Statistical Association70(1975).

[80] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, Classification, and Risk Bounds,” (2003).

[81] D. J. MacKay, “Bayesian Interpolation,”Neural Computation4, 415–447 (1991).

[82] S. Geisser and W. F. Eddy, “A Predictive Approach to Model Selection,”Journal of the American Statistical Association74, 153–160 (1979).

[83] J. M. Bernardo and A. F. M. Smith, Bayesian Theory(John Wiley & Sons, New York, 1994).

[84] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno, “Evaluation Meth-ods for Topic Models,” inProceedings of the 26th Annual International Conference on Machine Learning(2009), pp. 1105–1112.

[85] Y. Lei, L. Burget, and N. Scheffer, “Bilinear Factor Analysis for iVector Based Speaker Verification,” inInterspeech(2012), pp. 1588–1591.

[86] F. Valente and C. Wellekens, “Variational Bayesian speaker change detection,”

inInterspeech(2005), pp. 693–696.

[87] D. A. van Leeuwen and M. Huijbregts, “The AMI Speaker Diarization Sys-tem for NIST RT06s Meeting Data,” in Proceedings of the Third International Conference on Machine Learning for Multimodal Interaction, MLMI’06 (2006), pp.

371–384.

[88] F. Valente, P. Motlícek, and D. Vijayasenan, “Variational Bayesian speaker diarization of meeting recordings,” inProceedings of the IEEE International Con-ference on Acoustics, Speech, and Signal Processing, ICASSP 2010, 14-19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA(2010), pp. 4954–4957.

[89] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei, “Automatic Differentiation Variational Inference,”Journal of Machine Learning Research18, 1–45 (2017).

[90] A. B. Dieng, D. Tran, R. Ranganath, J. W. Paisley, and D. M. Blei, “Variational Inference via χ-Upper Bound Minimization,” in Advances in Neural

[90] A. B. Dieng, D. Tran, R. Ranganath, J. W. Paisley, and D. M. Blei, “Variational Inference via χ-Upper Bound Minimization,” in Advances in Neural