Features using neural networks - Efficient and Robust Methods for Audio and Video Signal Analys

constraints withλ1≥0andλ2≥0.

2.6 Features using neural networks

Recently the deep neural networks (LeCun et al. (2015)) have been used for nearly every task of machine learning, the feature extraction making no exception. Once the problems in learning DNN weights had been solved, the DNNs have turned out to be very efficient non-linear feature extraction functions.

Deep belief net (DBN) is a type of DNN, used for learning feature extraction functions, as proposed by Hinton and Salakhutdinov (2006). An efficient algorithm for training DBN weights is presented by Hinton et al. (2006). DBM is built by stacking multiple one-hidden-layer restricted Boltzmann machines (RBM) (Salakhutdinov et al. (2007)). Each RBM-layer is trained generatively as a stochastic network, where an energy function defined by network weights is associated with each training data vector. The RBM weights are to be set such that in terms of the training data the energy function is minimized and thus the likelihood of the data is maximized. An efficient learning algorithm for obtaining RBM weights is the contrastive divergence (CD) -algorithm developed by Hinton (2002). To train multiple RBMs for a DBM, the outputs of the previous layer RBM, used in deterministic rather than stochastic way, are used as training data for the next layer RBM. The trained DBM feature extractor is eventually utilized deterministically as a feed-forward type neural net. It has been found out, that by adding new layers to the DBN network, increasingly higher level features can be obtained (Hinton (2014)).

Autoencoder is a special type of DNN used to obtain so called bottle-neck features Goodfellow et al. (2016). Autoencoders are effectively feed-forward nets, which may be trained with any back-propagation type algorithm. An autoencoder consists of multiple feed-forward type layers of neurons, the middle layer being as small as the new feature vector is desired to be. This small layer forms a bottle-neck for the information. For the small bottle-neck layer, an autoencoder is thus enforced to learn a small dimensional representation of the input, the short representation discarding as little as possible of the information present in the original vector. The vectors obtained from the bottle-neck layer of an autoencoder thus provide the new features.

Convolutive neural networks (CNN) are feed forward networks, which are trained discriminatively for classification from the beginning, but the network structure is specif-ically designed from the feature extraction point of view. Layers of a CNN are restricted in comparison to general feed forward network structure such that not every input of each layer is connected to every neuron of the layer. It is the matter of choice for the algo-rithm developer to decide about which inputs are connected to which neurons. A CNN consists of multiple pairs of layers providing higher and higher level representations of the input in a position invariant way. The neurons of the first layer of each pair compose the basis for the convolutive property of the net. Thus it is called aconvolutive layer.

The convolutive layer contains plenty of similar neurons sharing the same connection weights, but being connected to different sets of the layer inputs. The second layer of each pair, called apooling layer, has restricted connectivity as well. The neurons of this layer are designed to provide the property of the position invariance for the net. Each neuron is connected to those convolutive layer outputs which share the same weights, i.e. implement a certain type of filter. Usually a max-pooling strategy is used such that

16 Chapter 2. Audio and video signals the neuron outputs the largest value among its inputs. CNNs are utilized in computer vision with huge popularity and success at least since the publication of Krizhevsky et al.

(2012). Lately they have been successfully tuned also for audio analysis e.g. by Zhang et al. (2017).

3 Dereverberation

Dereverberation is one of the signal processing tasks dealing with audio signals. Specif-ically, dereverberation is an audio signal enhancement task to reduce the detrimental effect of reverberation within a sound. In the following, characteristics and measures of reverberation are discussed in Section 3.1. The evaluation metrics used for dereverber-ation tasks are presented in Section 3.2, and an overview of dereverberdereverber-ation methods proposed in the literature is given in Section 3.3. My own work in blind dereverberation of music is explained in Section 3.4.

3.1 About reverberation

The term reverberation is used to denote the sound energy caused by the sound waves of the sound source, which are reflecting towards the listener from surrounding obstacles and objects.

In many environments, reverberation is considered improving the sound quality. Specif-ically concert halls are designed to produce optimally pleasant reverberation for the performed music. Pleasant reverberation is described in Beranek (2012) with terms like acoustical fullness, warmth, softness, definition/clarity, liveness, intimacy/presence, spaciousness, timbre/tone color etc. . A space with no or very little reverberation is appointed dry or dead, and people generally consider overly dry acoustics to be un-pleasant. However, too much reverberation is considered to deteriorate the sound. With excess reverberation, sounds smudge i.e. clarity of the sound degenerates. This causes the speech intelligibility to reduce and music to become more like broadband noise.

Properties of reverberation in a space are determined by the acoustical characteristics of materials of all the barriers and objects in the space (Kuttruff (2016)). A hard surface reflects most of the sound energy hitting it. On the contrary, a material with porous surface and non-rigid structure absorbs a good amount of sound energy which encoun-ters it. Some materials, which are fluffy enough, are acoustically transparent and have no effect in sound propagation. In addition to overall reflective / absorbing capability, every material treats different wave frequencies differently. Most materials attenuate high frequency waves more than low frequency waves.

All the above mentioned aspects suggest that the behavior of sound waves in a bounded space is very complex. The sound scape also depends much on positions of the sound sources. Within the sound scape, the perceived sound depends on position of the listener.

For a specified positions of a sound source and a listener, an acoustic impulse response (AIR), called also as room impulse response (RIR),h_SLcan be measured (Kuttruff (2016)).

AIR captures acoustical properties of the sound path from the sound source to the position of the listener in the space. The sound wave induced by a sound source is heard

18 Chapter 3. Dereverberation

Figure 3.1:Acoustic impulse response of a lecture room from Aachen Impulse Response Database (Jeub et al. (2009)). AIR may be divided into parts of the direct sound, early reflections up to following 50-80 ms and late reverberation after that.

by the listener as a reverberant sound, which builds up as a convolution of the original sound and AIR as

x(t) = [s~h_SL] (t) =

∞

t=0

s(t−t)h_SL(t), (3.1)

wheresdenotes the sound source, ~is the mathematical convolution operator and x(t)denotes the heard reverberant sound. The Figure 3.1 shows an example of a room impulse response. The first peak of AIR corresponds to the direct sound path from the source to the listener. The following distinguishable peaks up to 50-80 ms correspond to early reflections, i.e. the sound paths with just a few reflections on the way from the source to the listener. The rest of the AIR corresponds to late reverberation, which usually forms a diffuse sound field, where the direction of sound is indistinguishable.

Since exact AIR is dependent in the position of sound source as well as position of the listener, some more general measures of reverberation are often used. Gross measures of reverberation are e.g. reverberation time (RT) and early decay time (EDT). Both of them are defined based on instantaneous sound pressure level SPL= 20 log₁₀(P_RMS/20µP a)dB withP_RMSbeing the root mean square (RMS) sound pressure. Instantaneous SPL levels for RT and EDT may be defined via AIR, or by measuring the space response to abruptly ending sound source. The reverberation time is denoted as RT60 or RT30 depending on the manner of defining the time needed for SPL to decay 60 dB. These SPL decay times may be defined for a broad band signal as well as separately for different frequency bands. EDT is defined as six times the time needed for SPL to quiet down 10 dB.

The highly important property of sound respected by a listener, which is heavily affected by reverberation, is the clarity of the sound. Clarity of sound field is often measured as early-to-reverberant ratio (ERR), i.e. a ratio between the direct and early reflections part of the sound and the late reverberation part of the sound. The definition of clarity as ERR corresponds to the definition of the signal-to-noise ratio (SNR) considering the direct sound and early reflections before the mixing time,T_early= 50...80ms, as the good quality signal and the late reverberation afterT_earlyas noise. Sound clarity as ERR is given by

ERRτ = 10 log E0...T_early

ET_early...∞

dB with Ea...b=

t=a

h²_SL(t) (3.2) whereE0...T_early is the energy of the direct sound and the early reflections up to time T_early in AIR, andET_early...∞is the energy of the late reverberation after that. Clarity is a commonly used measure of reverberation conditions and an estimated ERR is used within some dereverberation algorithms. For speech, the clarityC₅₀withT_early = 50ms is used as a standard measure. For music,C₈₀withT_early= 80ms is commonly used.

In document Efficient and Robust Methods for Audio and Video Signal Analysis (sivua 28-32)