Discussion - A comprehensive deep learning approach to end-to-end language identification

In this chapter, a series of experiments were conducted to understand the impact of class imbalance on end-to-end deep learning approach to language identification task. We present a comprehensive framework to tackle the challenge. Our results shows that deep neural net-work, as a nonlinear model, can be trained on imbalanced dataset with regularization and appropriate objective function. Furthermore, the mismatch between training distribution and predicted distribution, and the improvement after score calibration suggest that many exam-ples ofspa-carare non-representative and noisy which introduces bias to the training process.

Hence, a good training set selection techniques can remove non-representative examples and improve the system.

CHAPTER 6. TACKLING IMBALANCED DATASET FOR END-TO-END NETWORKS

CHAPTER 7 Conclusions

In this work, we investigate a comprehensive deep learning approach to end-to-end automatic language identification (LID). Motivated by the recent success of DNN to speech recogni-tion task, we explored a combinarecogni-tion of the most advanced network architectures including:

CNN, RNN, and FNN to replace the pipeline of handcrafted features with BNF and i-vector.

Our architecture has taken into account the computational issues and regularization effect to construct deeper network in order to address large-scale LID task. Additionally, we present an integrated framework to effectively tackle the challenge of class imbalance in dataset.

Our results show that deep neural network, as a nonlinear model, can be trained on imbal-anced dataset and achieves competitive results for the LID tasks. Even though our proposed architecture hasn’t surpassed the recent state-of-the-art BNF i-vector system, the trained model shows promising results when combining multiple architectures for LID. Our Deep Language system can outperform the shallow and single architecture approach. The network also is iteratively improved by including batch normalization, Bayesian cross-entropy objec-tive, and careful calibration of final scores. An initial good result on the validation set with a moderate performance on evaluation data suggests that the network was able to capture long-term temporal dependency of speech utterances.

On the other hand, the degradation of the system on evaluation corpus leaves plenty of room for further improvement. The mismatch between training distribution and predicted distribution suggests that many examples ofspa-carare non-representative and noisy. Hence, a good training set selection techniques can remove non-representative examples and im-prove the system. It is also notable that BNF of baseline approach was trained using external dataset (i.e Switchboard corpus). Subsequently, we plan to investigate ways to enhance

net-CHAPTER 7. CONCLUSIONS

work performance by pre-training it with an external corpus to be able to compete against the BNF system more fairly. [49] also suggests leveraging a large amount of evaluation data by proportionally fitting the model on pseudo-labeled test data.

Bibliography

[1] Convolutional tutorial (http://deeplearning.net/tutorial/lenet.html), 1997.

[2] Switchboard-1 corpus (https://catalog.ldc.upenn.edu/ldc97s62), 1997.

[3] Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.

[4] Categorical distribution (https://en.wikipedia.org/wiki/categorical_distribution), 2016.

[5] Chain rules (calculus) (https://en.wikipedia.org/wiki/chain_rule), 2016.

[6] Convolution (https://en.wikipedia.org/wiki/convolution), 2016.

[7] Cross entropy (https://en.wikipedia.org/wiki/cross_entropy), 2016.

[8] Gaussian mixture model (https://en.wikipedia.org/wiki/mixture_model), 2016.

[9] Hidden markov model (https://en.wikipedia.org/wiki/hidden_markov_model), 2016.

[10] Markov chain (https://en.wikipedia.org/wiki/markov_chain), 2016.

[11] Mean squared error (https://en.wikipedia.org/wiki/mean_squared_error), 2016.

[12] One hot (https://en.wikipedia.org/wiki/one-hot), 2016.

[13] Phoneme (https://en.wikipedia.org/wiki/phoneme), 2016.

[14] Preemphasis improvement (https://en.wikipedia.org/wiki/preemphasis_improvement), 2016.

[15] Sampling (signal processing) (https://en.wikipedia.org/wiki/sampling_(signal_processing)), 2016.

BIBLIOGRAPHY

[16] Softmax function (https://en.wikipedia.org/wiki/softmax_function ), 2016.

[17] Speech (https://en.wikipedia.org/wiki/speech), 2016.

[18] Stationary process (https://en.wikipedia.org/wiki/stationary_process), 2016.

[19] Transfer function (https://en.wikipedia.org/wiki/artificial_neuron), 2016.

[20] Voice frequency (https://en.wikipedia.org/wiki/voice_frequency), 2016.

[21] Windowing function (https://en.wikipedia.org/wiki/window_function), 2016.

[22] O. Abdel-Hamid, A. Mohamed, Hui Jiang, and G. Penn. Applying convolutional neu-ral networks concepts to hybrid NN-HMM model for speech recognition. InICASSP, pages 4277–4280, March 2012.

[23] Dario Amodei, Rishita Anubhai, and Eric Battenberg et al. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR, abs/1512.02595, 2015.

[24] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. CoRR, abs/1508.04395, 2015.

[25] H. Behravan, V. Hautamaki, S.M. Siniscalchi, T. Kinnunen, and Chin-Hui Lee. i-vector modeling of speech attributes for automatic foreign accent recognition. Au-dio, Speech, and Language Processing, IEEE/ACM Transactions on, 24(1):29–41, Jan 2016.

[26] Hamid Behravan, Ville Hautamäki, and Tomi Kinnunen. Factors affecting i-vector based foreign accent recognition. Speech Commun., 66(C):118–129, February 2015.

[27] Ratnajit Bhattacharjee. Non-stationary nature of speech signal (online course).

http://iitg.vlab.co.in, 2016.

[28] Sara Bongartz, Yucheng Jin, Fabio Paternò, Joerg Rett, Carmen Santoro, and Lu-cio Davide Spano. Ambient Intelligence: Third International Joint Conference, AmI 2012, Pisa, Italy, November 13-15, 2012. Proceedings, chapter Adaptive User Inter-faces for Smart Environments with the Support of Model-Based Languages, pages 33–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

[29] E. Oran Brigham. The Fast Fourier Transform and Its Applications. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.

BIBLIOGRAPHY [30] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated

Recur-rent Neural Networks on Sequence Modeling. ArXiv e-prints, December 2014.

[31] Hubel D and Wiesel T. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology (London), 195:215–243, 1968.

[32] Alexandre Dalyac, Prof Murray Shanahan, and Jack Kelly. Tackling class imbalance with deep convolutional neural networks. Thesis, Imperial College London, 2014.

[33] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Trans. Audio, Speech and Lang. Proc., 19(4):788–

798, May 2011.

[34] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas A Reynolds, and Reda Dehak.

Language recognition via i-vectors and dimensionality reduction. In Interspeech, pages 857–860. Citeseer, 2011.

[35] V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. ArXiv e-prints, March 2016.

[36] R. Eldan and O. Shamir. The Power of Depth for Feedforward Neural Networks.

ArXiv, December 2015.

[37] S. Furui. Digital Speech Processing: Synthesis, and Recognition, Second Edition,.

Signal Processing and Communications. Taylor & Francis, 2000.

[38] Mark Gales and Steve Young. The application of hidden markov models in speech recognition. Found. Trends Signal Process., 1(3):195–304, January 2007.

[39] A. L. Gibbs and F. E. Su. On Choosing and Bounding Probability Metrics. Interdisci-plinary Science Reviews, 70:419–435, December 2002.

[40] H. Gish. A probabilistic approach to the understanding and training of neural network classifiers. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, pages 1361–1364 vol.3, Apr 1990.

[41] Ondrej Glembek, Pavel Matejka, Lukas Burget, and Tomas Mikolov. Advances in phonotactic language recognition. INTERSPEECH (pp. 743-746)., 2008.

BIBLIOGRAPHY

[42] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-forward neural networks. InIn Proceedings of the International Conference on Arti-ficial Intelligence and Statistics (AISTATS’10). Society for ArtiArti-ficial Intelligence and Statistics, 2010.

[43] A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recognition with deep bidi-rectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273–278, Dec 2013.

[44] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ICASSP, 2013.

[45] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.

[46] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.

[47] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. End-to-end text-dependent speaker verification. CoRR, abs/1509.08062, 2015.

[48] Paulina Hensman and David Masko. The impact of imbalanced training data for con-volutional neural networks.Degree Project in Computer Science, KTH Royal Institute of Technology, 2015.

[49] G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network.

ArXiv, March 2015.

[50] Geoffrey Hinton, Li Deng, Dong Yu, Abdel rahman Mohamed, Navdeep Jaitly, An-drew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath George Dahl, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition.

IEEE Signal Processing Magazine, 29(6):82–97, November 2012.

[51] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recogni-tion: The shared views of four research groups.IEEE Signal Process. Mag., 29(6):82–

97, 2012.

BIBLIOGRAPHY [52] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[53] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Comput., 9(8):1735–1780, November 1997.

[54] Yoshua Bengio Ian Goodfellow and Aaron Courville. Deep learning. Book in prepa-ration for MIT Press, 2016.

[55] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.

[56] Bing Jiang, Yan Song, Si Wei, Jun-Hua Liu, Ian Vince McLoughlin, and Li-Rong Dai. Deep bottleneck features for spoken language identification. PLoS ONE, 9(7):e100795, 07 2014.

[57] Bing Jiang, Yan Song, Si Wei, Jun-Hua Liu, Ian Vince McLoughlin, and Li-Rong Dai.

Deep bottleneck features for spoken language identification. PloS one, 9(7):e100795, 2014.

[58] Kristiina Jokinen, Trung Ngo Trong, and Ville Hautamäki. Variation in spoken north sami language. InInterspeech 2016, pages 3299–3303, 2016.

[59] Dan Jurafsky. Cs 224s/linguist 285 cs 224s spoken language processing. Stanford course, Spring 2014.

[60] F. Kanaya and S. Miyake. Bayes statistical behavior and valid generalization of pattern classifying neural networks. IEEE Transactions on Neural Networks, 2(4):471–475, Jul 1991.

[61] Abbas Khosravani, Mohammad Mehdi Homayounpour, Dijana Petrovska-Delacrétaz, and Gérard Chollet. A plda approach for language and text independent speaker recog-nition. In Odyssey 2016: The Speaker and Language Recognition Workshop, pages 264–269, Bilbao, Spain, June 21-24 2016.

[62] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR, abs/1412.6980, 2014.

[63] Tomi Kinnunen and Haizhou Li. An overview of text-independent speaker recogni-tion: From features to supervectors. Speech Communication, 52(1):12 – 40, 2010.

BIBLIOGRAPHY

[64] Bradlow AR Koch DB, McGee TJ and Kraus N. Acoustic-phonetic approach toward understanding neural processes and speech perception.Department of Communication Sciences and Disorders, Northwestern University, Evanston, 1999.

[65] Li Deng Ville Hautamaki el. at Kong Aik Lee, Haizhou Li. The 2015 nist language recognition evaluation: the shared view of i2r, fantastic4 and singams. Interspeech, 2016.

[66] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch Normalized Recur-rent Neural Networks. ArXiv e-prints, October 2015.

[67] Steve Lawrence, Ian Burns, Andrew D. Back, Ah Chung Tsoi, and C. Lee Giles.

Neural network classification and prior class probabilities. InNeural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 299–313, London, UK, UK, 1998. Springer-Verlag.

[68] Yann LeCun and Yoshua Bengio. The handbook of brain theory and neural networks.

chapter Convolutional Networks for Images, Speech, and Time Series, pages 255–258.

MIT Press, Cambridge, MA, USA, 1998.

[69] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 05 2015.

[70] Yann LeCun, Leon Bottou, GenevieveB. Orr, and Klaus-Robert Müller. Efficient backprop. In GenevieveB. Orr and Klaus-Robert Müller, editors, Neural Networks:

Tricks of the Trade, volume 1524 ofLecture Notes in Computer Science, pages 9–50.

Springer Berlin Heidelberg, 1998.

[71] Haizhou Li, Bin Ma, and Kong Aik Lee. Spoken language recognition: From funda-mentals to practice. Proceedings of the IEEE, 101(5):1136–1159, May 2013.

[72] Zachary Chase Lipton. A critical review of recurrent neural networks for sequence learning. CoRR, abs/1506.00019, 2015.

[73] Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, and Oldrich Plchot. Automatic language identification using deep neural networks. InProc. ICASSP, 2014.

[74] D. C. Lyu, E. S. Chng, and H. Li. Language diarization for code-switch conversational speech. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7314–7318, May 2013.

BIBLIOGRAPHY [75] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in

nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.

[76] Marvin Minsky and Seymour A. Papert. Perceptrons. MIT Press, Cambridge, 1969.

[77] Sirko Molau. Normalization in the acoustic feature space for improved speech recog-nition. PhD thesis, Bibliothek der RWTH Aachen, 2003.

[78] NIST. The 2015 nist language recognition evaluation plan, 2015.

[79] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. InProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR

’14, pages 1717–1724, Washington, DC, USA, 2014. IEEE Computer Society.

[80] Hui Jiang Li Deng Gerald Penn Dong Yu Ossama Abdel-Hamid, Abdel-rahman Mo-hamed. Convolutional neural networks for speech recognition. IEEE/ACM Transac-tions on Audio, Speech, and Language Processing, 22:1533–1545, October 2014.

[81] Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. CoRR, abs/1312.6026, 2013.

[82] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. JMLR, 2013.

[83] Autor Práce. Phonotatic and Acoustic Language Recognition. PhD thesis, Faculty of Electrical Engineering and Communication Department of Radio Electronics, Brno University of Technology, 2008.

[84] Lutz Prechelt. Neural Networks: Tricks of the Trade: Second Edition, chapter Early Stopping — But When?, pages 53–67. Springer Berlin Heidelberg, Berlin, Heidel-berg, 2012.

[85] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learn-ing with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.

[86] Fred Richardson, Douglas A. Reynolds, and Najim Dehak. A unified deep neural network for speaker and language recognition. CoRR, abs/1504.00923, 2015.

BIBLIOGRAPHY

[87] Tony Robinson, Mike Hochberg, and Steve Renals. Automatic Speech and Speaker Recognition: Advanced Topics, chapter The Use of Recurrent Neural Networks in Continuous Speech Recognition, pages 233–258. Springer US, Boston, MA, 1996.

[88] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, pages 65–386, 1958.

[89] Tara N Sainath, Brian Kingsbury, George Saon, Hagen Soltau, Abdel-rahman Mo-hamed, George Dahl, and Bhuvana Ramabhadran. Deep Convolutional Neural Net-works for Large-scale Speech Tasks. Neural Networks, pages 1–10, November 2014.

[90] Tara N. Sainath, Ron J. Weiss, Andrew W. Senior, Kevin W. Wilson, and Oriol Vinyals. Learning the speech front-end with raw waveform cldnns. In INTER-SPEECH, pages 1–5. ISCA, 2015.

[91] T.N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long short-term mem-ory, fully connected deep neural networks. InICASSP, pages 4580–4584, April 2015.

[92] H. Sak, A. Senior, K. Rao, and F. Beaufays. Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition. ArXiv, July 2015.

[93] Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.

CoRR, abs/1402.1128, 2014.

[94] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.CoRR, abs/1312.6120, 2013.

[95] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. Published online 2014; based on TR arXiv:1404.7828 [cs.NE].

[96] Aleksandr Sizov, Kong Aik Lee, and Tomi Kinnunen. Discriminating languages in a probabilistic latent subspace. Accepted for Odyssey: the Speaker and Language Recognition Workshop, 2016.

[97] Aleksandr Sizov, Kong Aik Lee, and Tomi Kinnunen. Discriminating languages in a probabilistic latent subspace discriminating languages in a probabilistic latent subspace discriminating languages in a probabilistic latent subspace. Accepted for Odyssey: the Speaker and Language Recognition Workshop, 2016.

BIBLIOGRAPHY [98] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.

JMLR, 15:1929–1958, 2014.

[99] S. S. Stevens, J. Volkmann, and E. B. Newman. A Scale for the Measurement of the Psychological Magnitude Pitch. 8(3):185–190, January 1937.

[100] Ilya Sutskever. Training recurrent neural networks. A Doctor of Philosophy thesis, Graduate Department of Computer Science, University of Toronto, 2013.

[101] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.

[102] T. Tieleman and G. Hinton. Neural networks for machine learning, lecture 6.5 - rm-sprop. 2012, https://www.coursera.org/course/neuralnets.

[103] Pedro A Torres-Carrasquillo, Elliot Singer, Mary A Kohler, Richard J Greene, Dou-glas A Reynolds, and John R Deller Jr. Approaches to language identification using gaussian mixture models and shifted delta cepstral features. pages 89–92, 2002.

[104] Trung Ngo Trong, Ville Hautamaki, and Kong Aik Lee. Deep language: a compre-hensive deep learning approach to end-to-end language recognition. Accepted for Odyssey: the Speaker and Language Recognition Workshop, 2016.

[105] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu.

Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.

[106] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015.

[107] Neha Yadav, Anupam Yadav, and Manoj Kumar. An introduction to neural network methods for differential equations. SpringerBriefs in Applied Sciences and Technol-ogy, 2015.

[108] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.

[109] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional net-works. CoRR, abs/1311.2901, 2013.

[110] Fang Zheng, Guoliang Zhang, and Zhanjiang Song. Comparison of different imple-mentations of mfcc. Journal of Computer Science and Technology, 16(6):582–589, 2001.

In document A comprehensive deep learning approach to end-to-end language identification (sivua 81-94)