Laughter detection with a BOA cascade - Efficient and Robust Methods for Audio and Video Signal

’target’ class samples being rare. In this case, for a decision cascade to be computationally efficient, it should specifically be able to make early decisions to the prevalent ’clutter’

class. A one-sided BOA cascade, which is able to make early detections to the ’clutter’

class, actualizes by utilizing only single conjunction which combines functions on all the target likelihoods according to conjunction listz₁= (1,2,3, ..., S). On the other hand, a one-sided BOA cascade capable of making early detections to ’target’ class actualizes if conjunction lists of every single target class model are included in the set of conjunction lists of the BOA, i.e. {(1),(2),(3), ...,(S)} ⊆ {zq|q= 1...Q}. Symmetrical BOA cascade actualizes using Q = S conjunction listsz1 = (1), z2 = (1,2), z3 = (1,2,3), ..., zQ = (1,2,3, ..., S).

5.4 Laughter detection with a BOA cascade

In Publication VI I have utilized a BOA cascade framework for classifying video clips of MAHNOB Laughter dataset Petridis et al. (2015) to those which contain laughter, and to those that do not contain laughter. I have used two models – or detectors – of laughter for the task. The detectors are built to mimic those used in Petridis et al. (2015) for the same task as closely as possible. The first detector operates on audio stream of the video and provides a laughter likelihood scorel1for the video clip. The detector computes MFCC features from the audio stream of a video and evaluates a single-output feed-forward NN. The second detector provides the laughter likelihood scorel2based on the image stream of the video. It finds 20 face points with the algorithm of Zhu and Ramanan (2012), reduces the feature dimensionality with PCA and provides a laughter likelihood scorel2using a feed-forward NN. Details of the two detectors can be found in Publication VI.

The class distribution in this task is nearly balanced, so any cascade type likely results in reduction of computational load. Thus BOA combinations with all the possible conjunction list configurations have been evaluated in the experiments for Publication VI.

The best results were obtained with a BOA cascade

B_C(x;α) = d₁(x;θ¹₁)∨

n=1

hd₁(x;θ^2,n₁ )∧d₂(x;θ^2,n₂ )i

(5.6)

which is built withz₁= [1]andz₂= [1,2]andNselected by the BOATS algorithm. The

50 Chapter 5. Sequential classification Table 5.2:Results obtained with a BOA cascadeBCof (5.6) in comparison to results found in the literature in laughter vs speech classification on MAHNOB laughter data. The used measures of classifier performance are the overall accuracy,F1-scores for both speech (F₁^sp) and laughter (F₁^lg), and percentage of computed visual features (v.f.). The BOA detectors are used at the operating pointαwith the highest accuracy.^∗)The classifier of Petridis et al. (2015) has been trained with another dataset.^∗∗)Results of Rao et al. (2015) are with 15 speakers while the other authors use 22 speakers in their tests.

acc. F₁^sp F₁^lg v.f. % BOA cascade ofBC, (5.6)N = 1 96.0 .966 .958 11%

BOA cascade ofBC, (5.6)Nby BOATS 96.9 .972 .955 33%

Rudovic et al. (2013) 92.7 .943 .905 100%

Petridis et al. (2015)^∗ 91.7 .932 .893 100%

Rao et al. (2015)^∗∗ 96.9 .973 .963 100%

decisions at the two stages of the resulting cascade are made according to B^laughter₁ = d1(x;θ¹₁)

B₁^speech =¬d₁(x;θ₁¹)∧

n=1

¬d₁(x;θ₁^2,n) (5.7)

B^laughter₂ =

n=1

hd₁(x;θ^2,n₁ )∧d₂(x;θ^2,n₂ )i

B₂^speech =

2^N

k=2

¬d₁(x;θ¹₁)∧

n=1

¬d_z₂_(j)(x;θ_z^q,n

2(j))

i₁=1 2

i₂=1

· · ·

i_N=1

h¬d1(x;θ₁¹)∧ ¬dz₂(i₁)(x;θ_z^2,1

2(i1))∧ ¬d1(x;θ_z^2,2

2(i2))· · ·

· · · ∧ ¬d_z₂_(i₃₎(x;θ_z^2,2

2(i₃))∧ · · · ∧ ¬d_z₂_(i_N₎(x;θ_z^2,N

2(i_N))i

The classification performance of the BOAB_Ccascade was evaluated in terms of resulting classification accuracy and computational load. The main results of Publication VI are reproduced in Table 5.2. The results show that a BOA cascadeBC, trained with the developed BOATS algorithm, described in Section 4.4, outperforms the reference classifiers of Rudovic et al. (2013) and Petridis et al. (2015), while utilizing far less computational resources. The best classification accuracy was achieved withBCwithN selected by the BOATS algorithm.

6 Automatic Speech Recognition

Automatic speech recognition (ASR) has appeared to be extremely difficult task to perform. This is shown by the long history of research on the problem. Only during recent years, the capability of algorithms have reached the performance satisfactory for general use. In case of low background noise, small vocabulary systems like digit recognition frameworks, or user interfaces with few command words have performed satisfactorily for a while. Also dictation software, which transcribes the spoken sentences into text, have been successfully used in noiseless environments, e.g. medical doctor appointment room. Automatic speech recognition performance reaching human level has been recently reported by Xiong et al. (2017) using DNNs.

The difficulty of the ASR -task is due to large variability inherent in speech signal. There are multiple sources of uncertainty for the ASR task of decoding the spoken words from an acoustic signal. Different languages form totally different probability distributions of phone sequences that the speech may be formed of. Every person has a unique voice and there is great variation in the ways a certain sentence may be pronounced. The acoustical characteristics of the speaking environment varies hugely and there is often some background noise on top of the speech to be recognized.

In terms of Bayesian decision making, the estimated word sequencewˆfrom among all the possible word sequencesW is given by a conditional probability

w= arg max

w∈WP(w|X) = arg max

w∈WP(w)P(X|w) (6.1)

in respect to the speech signalX. The standard procedure is to represent the probability P(w|X)in terms of two models representing the speech phenomenon, producing the probability of the word sequenceP(w)and the probability of the signal in respect to the given sequenceP(X|w). The success of an ASR framework thus depends on how well the models manage in their function. However, since the probability distributions of different word sequences are broad and overlapping, even the smallest possible Bayes error is notable for ASR.

In the core of an ASR framework, there is alanguage modelproducingP(w). The language model incorporates information about the vocabulary and grammar of the language that the ASR framework is built for. It also includes knowledge about different possibilities for pronunciation of words and phrases in the language. The language model, usually an n-gram model, mostly an HMM, is trained using written texts and pronunciation information in the language. Errors due to this model are of two kinds. First, it is very unlikely to learn an n-gram model incorporating all the possible use cases of the language.

On the other hand, the more flexible the model is, the smaller discrimination capability among different word sequences it has.

52 Chapter 6. Automatic Speech Recognition The probabilityP(X|w)within an ASR framework is given by anacoustical model. It informs the system about acoustic characteristics of each phone of speech. The acoustic model should handle the variability of the signal due to different voices, speech emphasis and acoustic conditions. The model is trained using recorded speech audio, and it has been shown by Raj et al. (2012) that if the training audio matches the acoustic conditions of the ASR-system use cases, the system performs notably better than if there is a mismatch between the two.

ASR systems are usually evaluated in terms of the word error rate (WER) WER= S+D+I

N , (6.2)

whereN is the number of uttered words within the audio material,Sis the number of misinterpreted words,Dis the number of words unnoticed by the system, andIis the number of extra words within the transcription. The recognition accuracy acc.= 1−WER is also an often used performance indicator.

In case of some words being more important to recognize than others, weighted error rates may be used. Binary weighting of words is used for reportingkey worderror rate or accuracy. In the Publications II,I and V the performance of the proposed ASR frameworks are reported in terms of the key word accuracy.

6.1 The traditional ASR framework

Compute MFCC features

GMMs Gaussian Mixture Models of MFCC features, one for each system state

Compute likelihoods of HMM states

HMM Hidden Markov Model of state progression within the language grammar

Transcribed speech Find the most

probable state sequence using Viterbi algorithm State likelihood

buffering Audio frame

cutting Audio

signal

Figure 6.1: A general framework, which is the basis of traditional style implementations for ASR. The framework operates on MFCC feature vectors from audio frames. It utilizes GMMs to model the distributions of feature vectors representative to each HMM state. The vocabulary and grammar are modeled with the transition probabilities between HMM states, and the best hypothesis for the speech transcription is found using the Viterbi algorithm.

A traditional ASR system, which is based on MFCC features, Gaussian mixture models of phonemes and hidden Markov n-gram models modeling the language structure, i.e.

capturing the vocabulary and grammar, is depicted in Figure 6.1.

The audio signal is processed in frames, as explained in Chapter 2. The length of a smoothed frame is usually around 5-20 ms, the consecutive frames generally overlapping

6.2. Methods for state likelihood estimation for ASR 53

In document Efficient and Robust Methods for Audio and Video Signal Analysis (sivua 62-66)