• Ei tuloksia

Dereverberation methods within literature

The research on algorithms for removing reverberation from audio signal concentrates mostly on speech audio (Naylor and Gaubitch (2010)), on improving its intelligibility.

This is well justifiable, as the most essential applications facilitate communication via mobile devices and serve the hearing impaired for improved quality of life. However, there exists also research on focusing music dereverberation where aesthetical viewpoint is more important. Dereverberation is also often considered as pre-processing step for audio content analysis algorithms (Watanabe et al. (2018)).

There are a variety of different methods available for dereverberation, depending on what is assumed to be known about recording conditions, e.g. number of microphones, room impulse responses, reverberation time, noise statistics etc. . The most efficient methods utilize multiple microphones, but there exists also many single channel dereverberation algorithms. Majority of methods operate in time-frequency domain, as time-frequency domain processing provides efficient means for robustness against changes in acoustic channel.

In the following, some popular processing principles utilized for audio dereverbera-tion are presented. The methods are dealt within six topics. First it is discussed how human vocalization model is utilized for dereverberation. Then the variety of ways the linear predictive (LP) analysis is utilized within dereverberation algorithms is ex-plained. The third topic concerns the methodology used for rigorous reverberation cancellation via AIR estimation end deconvolution filtering. The fourth topic discusses multi-microphone spatial filtering approaches to dereverberation. Then, the introduction turns to statistically inspired dereverberation methods, which use weight based sup-pression of DTFT magnitudes. Finally, the dereverberation implementations by neural networks are addressed.

In the discussion below, the reverberant signalx(t)is represented in terms of partss(t), the clean source signal,r(t)the reverberation andn(t)additional noise as

x(t) =s(t) +r(t) +n(t). (3.4)

In case of STFT-domain processing, the frame spectraX(n, f), S(n, f), R(n, f) and N(n, f)are assumed to sum up similarly as

X(n, f) =S(n, f) +R(n, f) +N(n, f). (3.5)

3.3. Dereverberation methods within literature 21 Speech dereverberation utilizing source-filter vocal-tract model

The first implementations of speech dereverberation in 1970’s (Allen (1974)) were based on the source-filter model of speech production. The speech signals(t)was estimated di-rectly without modelingr(t)andn(t)at all. The source-filter model of speech, presented e.g. in Stevens (1998), contains an all-pole filter, which represents the vocal track and gives the characteristics of each phone and human voice in general. The source signal of the model represents the glottal sound, which is assumed to be a time series of pulses in case of voiced sounds, or white noise in case of fricative phones.

Dereverberation of reverberant speech based on the source-filter model is performed leaning on an observation that the all-pole vocal tract filter coefficients are not much affected by reverberation, but most disturbance resides in the estimated glottal source signal (Gaubitch et al. (2006)). The vocal tract filter coefficientsg(t)are thus estimated from the reverberant signal using e.g. LP-estimation on the reverberant signalx(t)as

x(t) =

T

X

t=1

g(t)x(t−t) +e(t), (3.6)

whereT is the prediction filter length and errore(t)is to be minimized. The prediction errore(t)is then taken as the noisy glottal signal. To obtain dereverberation, it is cleaned up of noise. This approach, often called LP-residual processing, has been used e.g. by Gaubitch et al. (2003).

Characteristics of voiced vowels, particularly the harmonic structure of them, following the fundamental frequencyf0of the glottal pulse signal, is utilized in the HERB (Har-monicity based dEReverBeration) approach by Nakatani et al. (2007). The source-filter model of human speech production is utilized in conjunction with other processing principles presented below for speech dereverberation e.g. by Yoshioka et al. (2007) and Kinoshita et al. (2009).

3.3.1 Linear predictive analysis for reverberation estimation

Auto-regressive signal model is extensively utilized for estimating the reverberation part of the recorded signal (Naylor and Gaubitch (2010) Chapter 4). The prediction error is then considered as the clean source audio, which is not possible to be predicted based on previous observations. The signal model, in absence of noise, is thus in time domain or in STFT-domain respectively as

x(t) =s(t) +r(t) =s(t) +

T

X

t=d

g(t)x(t−t) and (3.7)

X(n, f) =S(n, f) +R(n, f) =S(n, f) +

L

X

n=D

G(n, f)X(n−n, f) (3.8) where denotes complex conjugate, t and n are time indices, T and L denote the length of prediction filter,dandD denote prediction delay andgandGdenote the linear prediction coefficients forx(t−t)andX(n−n, f)in time and DTFT-domains respectively. Dereverberation then actualizes as

ˆ

s(t) =x(t)−ˆr(t) or S(n, fˆ ) =X(n, f)−R(n, f).ˆ (3.9)

22 Chapter 3. Dereverberation The parametersgorGof (3.7) are solved minimizing the power of the estimation error e(t) =x(t)r(t)ˆ orE(n, f) =X(n, f)−R(n, fˆ ), e.g. using LP-analysis (Jackson (1989)).

The time-domain model is highly sensitive to changes in AIR, but the DTFT-domain model has shown to perform nicely already in case of single channel signal, both in case of both speech (Padaki et al. (2013)) and music signals (the Publication III).

In case of multi-channel recording, reverberationr1(R1) in the first channel signalx1

(X1)is estimated using all theM channel signalsxm(t)(Xm(n, f)) for, m= 1...M as ˆ

r1(t) =

M

X

m=2 T

X

t=d

gm(t)xm(t−t) or Rˆ1(n, f) =

M

X

m=2 L

X

n=D

Gm(n, f)Xm(n−n, f), (3.10) wherexm or Xm denote the signal on them:th channel of the recording and gmor Gmdenote the corresponding regression coefficients. The parametersgm(t),t=d...T orGm(n, f),n=D...Lform= 1...M are similarly to above solved by minimizing the estimation error e(t) = x1(t)−ˆr1(t) or E(n, f) = X1(n, f)−Rˆ1(n, f). A solution minimizing the squared errore2(E2) is obtained using multi-channel linear prediction (MCLP) (Naylor and Gaubitch (2010) Chapter 5). The problem with the generally used least squares minimization of linear prediction analysis is that the error signal becomes white Gaussian. This assumption does not hold for most real life audio source signals.

To fix this, e.g. prewhitening of signals in Triki and Slock (2005), or increasing prediction delay in Kinoshita et al. (2009) have been proposed. The time-domain implementation of the MCLP-method is also known as linear-predictive multi-input equalization (LIME), and the STFT-domain MCLP has been also referred to as weighted prediction error method (WPE). MCLP is extensively utilized in STFT-domain due to its robustness to changes in AIR.

In addition to the better robustness, STFT domain processing allows statistical models to be utilized for the signal, giving rise to a technique called variance-normalized delayed MCLP (NDLP) proposed in Nakatani et al. (2010). For NDLP, the complex-valued STFT coefficientsS(n, f)of the source signal are modeled using a time-varying Gaussian (TVG) model. The TVG setup usually chosen, defines the complex Gaussian distributions ofS(n, f)to be zero mean, each with its own variance. Then the regressorsGm(n, f) of the MCLP and the variancesσn,f2 of the TVG-model are optimized in a maximum-likelihood (ML) sense, with expectation-maximization (EM) type alternating optimization algorithm. Due to combining the statistical TVG model to MCLP, NDLP is also referred to as Bayesian blind deconvolution.

3.3.2 Blind system identification and deconvolution methods

Dereverberation algorithms calledreverberation cancellation(Naylor and Gaubitch (2010)) methods utilize the audio channel characteristics, that is AIR (or RIR), for the derever-beration solution. Usually AIRhis unknown and it must be estimated using someblind system identification(BSI) approach. Then the estimated AIRhˆis utilized for estimating inverse (i.e. equalization, deconvolution) parametersgto cancel the effect of AIR from the signal as

ˆ s(t) =

T−1

X

t=0

g(t)x(t−t). (3.11)

These time-domain methods are mathematically and physically faithful to true sound propagation laws encoded in (3.1) and thus they have high potential for very high

3.3. Dereverberation methods within literature 23 fidelity dereverberation. Unfortunately rigorous modeling is often accompanied with poor robustness against channel estimation errors and small changes in AIR, which are serious issues when working with real life signals.

For AIR estimation, multi-channel recording is necessary. AIRs may be estimated based on differences and similarities between signals from different microphones. There are two mainstream BSI methods available for AIR-estimation One is based on minimizing averaged cross-relation error over all pairs of M microphone signals xm, m= 1...M.

The cross-relation error between microphone signals xi and xj is defined aseij = s~hi~hjs~hj~hi=xi~hjxj~hi, wherehmdenotes AIR between the source and microphonem. All theM AIRs are then approximated minimizing the combined error of M2

cross-relation error equations ofei,j i= 1...M−1, j=i+1...M. The other mainstream time-domain BSI-approach finds AIRs from the null space{v|R v=0}of multi-channel data correlation matrixRof size(M T)×(M T), which consists ofM×M blocks of cross-correlation matricesρijbetween channel signalsxiandxj,ρijbeing of sizeT×T.

To estimate the equalization parameters ingmbased onhˆm, more sophisticated methods than direct inversion are necessary. This is due to that the noise level in estimated AIR-parametershˆmis generally high and AIRshmare generally non-minimum phase (Habets and Naylor (2018)). For robust estimation ofgm, many solutions utilize an equalized impulse response

EIR(t) =

M

X

m=1 T−1

X

t=0

hm(t−t)gm(t), (3.12)

which ideally produces an impulse function EIR(t), possibly with a delay. A Multiple input/output theorem (MINT) in Miyoshi and Kaneda (1988) provides a least squares solution for EIR(t)to become an impulse function. This solution only solves the problem of AIR being non minimum phase, but is not robust against AIR estimation errors.

Solutions for greater robustness against errors inhˆimplement different ways of relaxation from targeting the perfect impulse EIR(t). The different relaxation approaches include e.g. channel shortening in Zhang et al. (2010) and Lim et al. (2014), partial equalization in Kodrasi and Doclo (2012) and sparse optimization in Mertins et al. (2010).

3.3.3 Clean signal estimation using spatial filtering

Spatial filtering, i.e. beamforming, is dedicated for extracting a high quality signal from a certain geometrical direction by suppressing sounds and echoes coming from other directions. The necessity for spatial filtering is a microphone array, whose geometry is known. Also the direction of arrival (DOA) of the source signal must be known or estimated for the spatial filter optimization. A minimum variance distortionless response (MVDR) design Capon (1969) provides a beamforming filter to steer the focus to certain direction while minimizing the sound energy from other directions. An MVDR-filter gMVDR(orGMVDR) may be used for dereverberation as

ˆ s(t) =

M

X

m=1 T−1

X

t=0

gMVDR(m,t)xm(t−t) or S(n, fˆ ) =

M

X

m=1 L−1

X

n=0

GMVDR(m,n, f)Xm(n−n, f) (3.13) for time-domain or STFT-domain processing, respectively. These equations perform very similar operation as MCLP presented above, as discussed in Dietzen et al. (2016),

24 Chapter 3. Dereverberation although the MVDR filter design process is based on an estimated DOA and signal statistics rather than linear prediction analysis Haykin (2008). In the dereverberation overview of Habets and Naylor (2018) MVDR-filtering is grouped as asignal independent spatial filteringmethod although its beamforming optimality is based on statistics of the data. This is likely said to contrast MVDR-filtering to beamformers, where the coefficients of an MVDR beamformer are further tuned in terms of Wiener filtering for instantaneous signal contents. The methods implementing this kind of enhanced beamformer optimization on are calledsignal dependent beamformingmethods in Habets and Naylor (2018).

An alternative structure, which implements a beamformer similar to MVDR, but provides insight as well as simplifies the beamformer implementation, is the generalized side-lobe canceler (GSC) structure. The idea of GSC is to use a fixed beamformer for steering the attention at the desired direction, and another filter for minimizing the noise and reverberation power. The fixed beamformer is entirely independent of data as well as its statistics and provides the main lobe directivity similarly to a MVDR filter. The noise reducing filter on the other hand is optimized to minimize the noise power in the particular signal, the function which in MVDR is integrated within the steering beamformer coefficients. The GSC structure provides an opportunity to set larger variety of constraints and design parameters on the noise reduction filter, than what is possible to obtain with MVDR -filter design methodology. For example AIR characteristics are incorporated in the generalized side lobe canceler design procedure in Gannot et al.

(2001).

3.3.4 Spectral enhancement and time varying Gaussian signal model

The dereverberation algorithms calledspectral enhancement, spectral subtractionor reverber-ation suppressionmethods operate in time-spectral STFT-domain. These methods assume that the reverberant signalX(n, f) = S(n, f) +R(n, f) +N(n, f)and all of its parts S(n, f),R(n, f)andN(n, f)are coming from complex zero-mean time varying Gaussian distributionsNC{0, σt2}. Dereverberation with these methods is done by a multiplicative operation, instead of subtraction, on power spectral densities in each signal frame as

|S(n, f)|ˆ 2=max λ, W(n, f)· |X(n, f)|2

, (3.14)

where a spectral floorλis set to diminish a musical noise problem often encountered with STFT-processing, andW(n, f)is the gain, or suppression coefficient, for the time-spectral bin. Often, the signal partsS,RandN are assumed to be uncorrelated, and thus maximum likelihood gainsW(n, f)for single-channel dereverberation are given by

W(n, f) = σS2(n, f)

σ2X(n, f)= σX2(n, f)−σ2R(n, f)−σ2N(n, f)

σ2X(n, f) , (3.15)

whereσX2,σS2,σR2 andσN2 denote the power spectral densities (PSD), of the recorded signal and the source, reverberation and noise signal parts, respectively. An estimate of the recorded signal PSD is obtained asσˆX2(n, f) =|X(n, f)|2. In most of the algorithms, the noise PSDσ2N is assumed known, which means in practice that it is estimated from framesX(n)of silence within the recording. For detecting the frames of silence, any voice activity detection (VAD) technique may be utilized. The reverberation PSDσ2Rof the reverberation part may be estimated based on the recording PSD for example as

σ2R(n, f) =e−υ|X(n−D, f)|2, (3.16)

3.4. Results in blind dereverberation of music 25