The "direct" approach: an end-to-end system

The “direct” method replaces the middle four blocks or the whole pipeline by a DNN (Fig. 4.1).

As an end-to-end system, deep learning removes the burden of hand crafting the feature ex-traction is the conventional approach in LID task. This versatility is achieved by integrating feature extractor and classifier into a single algorithm, and training the algorithm to learn dis-tributed representations of speech features with multiple levels of abstraction that explicitly benefit the task. This learning strategy allows the network to be optimized to handle a wide range of speech diversity including ambient noise, speaker variation, and channels. In [73], it was found that a deep learning system surpassed i-vector based approaches with a lower number of parameters when a large amount of training data was available. However, the paper only tries to use a single network architecture. Conversely, it was reported in [91] that a combination of many deep architectures outperforms conventional deep learning approach to ASR.

As a result, the main subject of our study is adapting the most recent advances in DNN architectures for LID task. The key to our approach is the recurrent architecture of DNN, a model has recently been shown to outperform the state-of-the-art DNN systems for acous-tic modeling in speech domain [23, 92, 93, 44]. The central idea behind recurrent neural network (RNN) is its feedback connection which creates an internal state to model temporal dependency in data which is essential in speech. The difference between traditional DNN ap-proaches and our apap-proaches is the way out network processing the acoustic features, which is illustrated in the middle of Fig. 4.4 (from left to right):

• Our approach is an RNN network which encodes a sequence of speech frames into its hidden states. In this case, temporal pattern of signal is preserved and learnt by the network.

• Multiple frames are stacked into a “super” vector. Then, a feed-forward network or convolutional neural networks is trained using the stacked features. Thus, the approach ignores the time axis and its specific order, hence, learning any temporal dependency

CHAPTER 4. DEEP LEARNING APPROACHES TO LANGUAGE IDENTIFICATION is more difficult.

Audio file

A frame

{

Segment length

{

Overlap frames

MFCC or

Filter-banks

{

x

₁

x

₂

. . . x

h₁

h₁ hh₂₂

. . .

hh_n_n

RNN

{

^h^h¹¹

^x

{ ^x

^h^h²²²

^{. . .} ^x

^h^hⁿⁿⁿ

Softmax

⎫ ⎬⎪ ⎪ ⎪ ⎪ ⎭⎪ ⎪ ⎪ ⎪

Target language probability distribution

. . .

Flatten into 1 big features vector

FNN

Sequences of frames Stacked of multiple frames

Figure 4.4: Two different strategies (RNN: left, FNN: right) for applying end-to-end deep learning to language identification

Finally, a softmax activation function projects hidden states of RNN into interpretable probability vector of the target languages [54],

ϕ(y)_j = e^y^j PK

k=1e^y^k, (4.1)

whereK is the total number of classes, vectory_j is the affine transform of activations from the last hidden layer, and ϕ(y)is the posterior distribution of the target languages. A hard decision can be made by selecting the most probable class. Moreover, another approach may transforms the posterior probability into log-likelihood ratio (LLR), which allows more flexible decision making process.

CHAPTER 5 Experiments in networks design

5.1 Speech corpus for LID

Spoken language recognition evaluation (LRE) campaigns, routinely conducted by National Institute of Standards and Technology (NIST), have a major effect on advancing the research in LID [65]. The newest LRE corpus was released by NIST in 2015, which is considered as a great challenge to the community due to its differences in certain key aspects [65].

The corpus contains ≈ 796 hours of speech. The task requires identifying languages in the more ambiguous context of closely related languages in specific languages cluster.

There are six different language clusters, which are combined of twenty languages. Fig 5.1 also indicates heterogeneous distribution among languages, clusters and even datasets which emphasizes the importance of prior information in NIST LRE’15. There exists not only the unbalanced distribution among languages but also the mismatch in prior distribution between the development set and the evaluation set. The competitors of NIST LRE’15 also have to deal with diverse audio conditions and qualities, from a phone conversation with ambient noise to a formal interview. Furthermore, our inspection on three random speech utterances from 3 clusters (Fig 5.2) suggests diverse structures in the audio clips, some utterances con-tain very long silences mixed with noise and the speech activities are occasionally short and meaningless [104]. This observation implies the importance of robustness in learning speech utterances representation, a distributed representation obtained by a deep network can rep-resent many intermediate concepts that are useful to capture the statistical dependencies of input signal and output language.

CHAPTER 5. EXPERIMENTS IN NETWORKS DESIGN

Training set distribution

Evaluation set distribution

Figure 5.1: Languages and clusters distribution in NIST LRE’15 corpus (total length in minutes of each languages is showed in horizontal axis)

NIST evaluates system performance on LRE’15 corpus using a closed-set scenario, where the set of non-target languages are limited to other languages in the same cluster [65]. The output from each system is a (20-dimensional) vector of log-likelihood-ratio (LLR) scores for each test segment. The objectives of LRE15 is minimizing the criterion in Eq.

(5.1), which applies to each cluster and all of its target/non-target languages pairs (L_T,L_N) C_avg = 1

{[C_miss∗P_Target∗X

L_T

P_Miss(L_T)] + 1 NL−1 [C_FA∗(1−P_Target)∗X

P_FA(L_T, L_N)]},

(5.1)

whereNL is the number of language in the cluster, Cmiss, CFA andPTarget are application-specific parameters represent the weights of detection miss and false alarm probabilities. For LRE15, the application parameters will be: C_miss = C_FA = 1, and P_Target = 0.5. This objective is used throughout all experiments in this thesis as a criterion to judge the final performance of each system. Furthermore, our intention in this work is constructing the most applicable LID system given a corpus, hence, the training data for any algorithm is only limited to the set illustrated in Fig 5.1. Since the primary cost function of NIST LRE’

15 is applied separately for each language cluster [78], we decided to train different networks for each cluster, and the finalC_avg is an average of all clusters’ scores.

CHAPTER 5. EXPERIMENTS IN NETWORKS DESIGN

Figure 5.2: Waveform of three randomly sampled audio files, which indicates very long silences between speech.

In document A comprehensive deep learning approach to end-to-end language identification (sivua 57-61)