Results in blind dereverberation of music

on reverberation timeT60among other system parameters. A frequency dependent decay rateυ(f)has been utilized in Habets (2004). Alternatively, delayed auto-regressive model (3.7) is also widely utilized for reverberation PSD estimation asσˆ²_R(n, f) = ˆR²(n, f). This LP-based approach has been utilized in Publication III, where we found that utilizing only one predictive term, i.e. L =D, was the most effective setup, which relates the approach to that of Habets (2004). When multiple channels are available, reverberation PSDσ²_Rmay be estimated using a residualR1(n, f) =X1(n, f)−Sˆ1(n, f)based on the multi-channel estimateSˆ1(n, f), or alternatively using non-coherence among the time aligned microphone signals.

In case of a multi-channel signal, spectral enhancement is often utilized after beam-forming, on the obtained single-channel signalsˆ(orSˆ). The spectral subtraction type post-filtering after beamforming is then done for improved dereverberation and noise reduction.

3.3.5 Dereverberation using neural networks

Neural networks have been used for dereverberation task, e.g. by Xiao et al. (2014), as well as for so many other signal processing tasks also. To train a neural network to output an estimate of clean audio, the NN must be trained using target audio material with the clean audio available. The training material should naturally represent the scenario of the anticipated NN usage as closely as possible, thus making this approach dependent on relevant training material. In case AIRs of the space, the dereverberation is to be used within, is possible to be measured, simulation is an efficient means for generating parallel training data for derevereration NN training. Simulation is done by convolving clean source material with measured or approximated AIRs according to (3.1) and adding different kinds of noise signals on the reverberant sound.

Audio signal is a form of time series, rather than consisting of independent samples. In order a neural network to estimate the reverberation and noise free sample value or STFT frame, the context of each input sample or STFT frame should be encoded in the NN input. In Han et al. (2015), a feed forward DNN for estimating log spectral magnitudes log(|S(n, f)|)of clean speech has been tried out. Five log spectral magnitude frames log(|X(n+n)|),n=−5...+ 5anterior and posterior to framenare utilized as an input to the DNN to encode the context aroundS(n). In Weninger et al. (2014) a recurrent neural network (RNN), specifically a long short-term memory network (LSTM) (Hochreiter and Schmidhuber (1997)), has been utilized for reverberant feature enhancement for ASR.

A recurrent type neural network implicitly holds information of the past samples, thus single frame features are used for the LSTM input. However, it is not obvious how the dereverberated audio would be extracted based on the enhanced Mel-spectral features.

3.4 Results in blind dereverberation of music

The work of the Publication III pursues enhancing music recordings subject to rever-beration and dynamic range compression (DRC). We produced clean music from MIDI-representations and then applied reverberation and dynamic range compression to the signal to synthesize the needed data. For reverberation suppression, we use no external knowledge about the acoustic conditions of the recording environment. Thus the solution of choice is a blind dereverberation method, namely a method proposed in Furuya and

26 Chapter 3. Dereverberation Kataoka (2007) for speech dereverberation, which is applied to music signals in our work.

The aim was to test whether this method performs well with music material distorted by DRC-processing.

We estimate the reverberation within the STFT signal representation using LP-analysis with signal model

|X(n, f)|=|S(n, f)|+|R(n, f)| ≈ |S(n, f)|+

n=1

af(n)· |X(n−n, f)|, (3.17) where|X(n, f)|,|S(n, f)|and|R(t, f)|denote respectively spectral magnitudes of the reverberant signal, the clean signal and the reverberation in frequency bandf in signal framen, andLis the length of the LP-filter with frequency band specific coefficients af(n), n=d...L. The frequency band specific model parametersaf = [af(1), af(2), ...

..., af(L) ]⁰ were estimated based on all the magnitude spectra|X(n)|of the recording at hand. The standard least squares solution

af = (Vf0Vf)⁻¹Vf0vf(L+1) where (3.18) V_f = [vf(L), vf(L−1), ..., vf(1) ] and

vf(n) = [|X(n, f)|, |X(n+1, f)|, ..., |X(n+N−L−1, f)|]⁰,

whereNis the number of audio frames in the recording was used. The dereverberated STFT magnitude spectrum was then obtained as

|S(n, fˆ )|=|X(n, f)| −βf ·

t=1

af(t)· |X(n−n, f)|, (3.19) where frequency dependent weightsβfare used to limit the amount of dereverberation.

The dereverberated time domain signal was obtained by utilizing the phase ofX(n, f) asS(n, f) =ˆ |S(n, fˆ )| ·X(n, f)/|X(n, f)|, and performing inverse fast Fourier transform (IFFT) for allS(n), nˆ = 1...Nand combining consecutive audio frames by windowed overlap add processing.

The numerical values of used audio frame length for the frame spectral representations X(n), LP-model lengthLand the weighting functionβf were set using validation data.

Utilizing audio frame lengths from 20 ms to 160 ms with different linear prediction lengthsLwere tested. According to improvements in signal-to-distortion ratio within validation data, the longer the audio frame, the better the dereverberation quality ap-peared. In all cases, LP-model length larger thanL = 3did not give improvements.

Often, model lengthL= 1gave the best results. This is reasonable, if we compare the used signal model to the reverberation estimation according to (3.16). The model (3.16) assumes that all the information of the reverberation is present in the frame located around 50 ms – the limiting time between early reflections and late reverberation – after the frame under consideration. Thus for audio frame lengths longer than 50 ms, an LP-model lengthL= 1should be enough. Our experiments confirmed that this is true.

Our tests with different weighting functions revealed that weightsβffor low frequency bins dominate the performance. Thus we finally resorted to using constant weighting β_f =β. The value ofβthat gave best results on average was between 0.2-0.3 depending on whether the distortion by dynamic range compression was present or not. Slightly larger value ofβseemed to be the best for dynamically compressed signals. Using the

3.4. Results in blind dereverberation of music 27 best performing system parameters, the signal to distortion ratio of the reverberant signals was improved from 6.1 dB to 6.4 dB, and the SDR of signals suffering also DRC distortion improved from 5.2 dB to 5.6 dB.

Thus in Publication III we showed that this dereverberation framework performs well with music signals. Another finding was that the dynamic range compression of audio does not deteriorate the dereverberation performance. In contrast, dereverberation performance with dynamic range compressed audio appeared even better than with the non-compressed audio.

4 Classifying independent samples

In artificial intelligence (AI), computational methods are used to identify different phe-nomena from signal, e.g. image or a sound clip. After preliminary analysis of the signal, described in Section 2, an AI-system makes decisions according to internal models and action rules, which are utilized for providing the solution. The basic building blocks of an AI-system, which are responsible for interpreting the signal, are called classifiers.

They provide phenomenon level information for making the decisions of action. Thus classification, or categorization, is in the core of almost every intelligent application.

To make the problem of automatic categorization, i.e. classification, of a signal tractable, usually a small set of possible categories, calledclasses, is predefined, instead of eval-uating among all the feasible taxonomies. The signal is analyzed to solve whether it represents one of the pre-defined categories of phenomena. In the most fundamental form of classification, binary classification, the sample is assigned to one of only two categories. Binary classification may also be considered asdetection, where the categories are simply the ’target’ class, which represents the inquired phenomenon, and the ’non-target’ class, which is associated with everything else. Via utilization of multiple binary classifiers or detectors, these methods may be also utilized to produce classification decisions among more than only two categories. In that case multiple binary classifiers or detectors, at least one for spotting each class, are utilized, and their outputs are combined to perform the final classification.

In this chapter I first discuss how performance of a classification framework is evaluated in Section 4.1. Then, in Section 4.2 I discuss about different methodology for compu-tational classification, and in Section 4.3 I present different ways to combine different methodologies into single classification framework. In Section 4.4 I present a binary classifier combination function named BOA, which is proposed in Publication IV and elaborated in Publication VI.

4.1 Classification result evaluation

To evaluate the classification result of a computational framework, the true categorical class memberships of the signal samples must be known. The success is evaluated against the knowledge about this true class membership. The class label given by a classifier can only be correct or wrong. To get statistical information about the performance of a classifier, a bunch of signal samples must be classified with the framework, and statistics about correct and incorrect classifications must be collected. The fundamental statistics, which are also used to define other evaluation metrics for binary classification are the counts of

true positives (tp), i.e. test samples correctly classified as ’target’

30 Chapter 4. Classifying independent samples true negatives (tn), i.e. test samples correctly classified as ’non-target’

false positives (fp), i.e. test samples incorrectly classified as ’target’

false negatives (fn), i.e. test samples incorrectly classified as ’non-target’

In case of multi-category classification, the counts of true positives (tpc), false positives (fpc) and false negatives (fnc) are separately computed for each ’target’ classc, consid-ering all the other classes as ’non-target’. Statistics about decisions and errors between different classes in terms oftpcandfpcfor each class are often presented in the form of a confusion matrixshown in Figure 4.1.

The simplest statistic often reported as the value for general success of a classification framework is accuracy

ACC= tp+tn

N = 1

N X

tp_c (4.1)

whereN is the total number of classified samples. Closely related to accuracy, an often reported statistics for evaluating binary classification, are true positive rate (tpr) and true negative rate (tnr)

tpr= tp

tp+fn tnr= tn

tn+fp. (4.2)

The termssensitivityandspecificityare also used alternatively for thetprandtnr, respec-tively. In multi-category classification frameworks thetprcmay be computed separately for each classc. In binary classificationtpris also calledrecall. RecallRis usually reported in conjunction withprecisionP

P = tp

tp+fp. (4.3)

Precision is about how precisely the framework is able to distinguish the class in question from the other class or classes.

A measure combining the information from precisionP and recallRis Fβ-score Fβ= (1 +β²)P R

β²P+R = (1 +β²)tp

(1 +β²)tp+β²fn+fp. (4.4) Fβmay be adjusted withβ²to take into account the true class distribution and the cost of incorrect classification. Using valueβ²>1penalizes more for not detecting samples of the ’target’ class. Thus bigβ²is justifiable if the cost of false negative is high or the

’target’ class forms a minority of test samples. Again,β² <1is a rational choice if the

’target’ class is prevalent, or if the cost of false positive is high. For even class distribution or equal error costF₁= 2P R/(P+R) = 2tp/(2tp+fn+fp)is used. In addition toF₁, which is the harmonic mean ofP andR, the geometric meanG=√

P RofP andRis sometimes used as an evaluation metric calledG-score.

Often a classification framework is parametrized such that the predisposition of the system to assign a certain label can be tuned. In binary classification such a parametriza-tion accounts for changing the prior probability of the framework to assign another label against the other one. In this kind of situation equal error rate (EER) is commonly reported as a measure of overall performance of a system. Using the false positive rate and false negative rate

fpr= fp

fp+tn fnr= fn

fn+tp, (4.5)

4.1. Classification result evaluation 31

Figure 4.1:Confusion matrix of classification result of data from 5 classes. The number of samples from classes 1, 2, 3, 4 and 5 are 100, 50, 80, 30 and 60, respectively. Thetp-counts of correctly classified samples can be seen in bold in the diagonal bins of the matrix. The bold numbers in the off-diagonal bins denotefn-counts of incorrect classifications. Percentages of the respective tpandfp-counts in respect to all the 320 samples are shown below each count. The true positive rate (tpr) (orrecall) and false negative rate (fnr) for each class are shown in the bottom bins as percentages. TheprecisionPand1−Pof the classification result in respect to each class are given in the rightmost bins.

the system is tuned to function at an operating point wherefpr=fnr, which then becomes the value for EER.

With different parameter settings of a binary classification framework, curves of different pairs of performance scores, which express different aspects of the performance, may be outlined. Commonly used curves of classifier performance, depicted in Figure 4.2, are the precision-recall (P-R) -curve and the receiver operating characteristics (ROC) -curve. P-R -curve plots the scores of precisionPversus the scores of recallRat each possible operating point, that is, parametrization, of the system. ROC -curve plots the true positive rate (tpr) against the false positive rate (fpr) at different operating points

Figure 4.2:Precision vs recall -curves (left) and ROC -curves (right) with classifierC, using data XBwhich has even, 1:1, class distribution and dataXUwhich has class distribution 1:5. It is to be noted how the precision of classification seen in theP-R-curve is affected by the class distribution due to increased count of false positives (fp) in respect to the count of true positives (tp).

32 Chapter 4. Classifying independent samples of the framework. To evaluate a classification framework in terms of its whole space of operating points, an are under curve (AUC) -measure is commonly used. It may be defined in terms of the P-R or the ROC -curve respectively as

AUCP-R= Z 1

P dR, or AUCROC= Z 1

tpr dfpr (4.6)

In document Efficient and Robust Methods for Audio and Video Signal Analysis (sivua 38-45)