Robust Direction Estimation with Convolutional Neural Networks-based
Steered Response Power
Pasi Pertil¨a, Emre Cakir
Laboratory of Signal Processing Audio Research Group Tampere University of Technology
Tampere, Finland
The 42nd IEEE International Conference on Acoustics,
Speech and Signal Processing, 2017
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Outline
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Introduction
Uses of sound DOA estimation
Spatial filtering: beamforming, speech enhancement
Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise
time-varying
e.g. household, cafeteria…
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Introduction
Uses of sound DOA estimation
Spatial filtering: beamforming, speech enhancement
Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise
time-varying
e.g. household, cafeteria…
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Time-Frequency (TF) masking
removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]
applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]
Regression to deal with reverberation [4]
This work proposes deep learning-based (CNN) TF-masking for DOA estimation
to deal with everyday noise and reverberation
[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.
[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015
[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Why use convolutional neural networks (CNNs) for speech?
Speech signals contain significant local information in spectral domain.
changing speaker and environmental conditions
→ shifts in spectral position
CNNs have a translational shift invariance property
→ Suitable option for our task.
convolutional neural networks (CNNs) are discriminative classifiers
compute activations through shared weights over
local receptive fields
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Array signal model
The i th microphone signal is modeled as the mixture of reverberated signals in presence of noise
x i (t, f )
| {z }
observation
= P
n h m
i,r
n(f ) · s n (t, f )
| {z }
reverbered nth source signal
+ e i (t, f )
| {z }
noise
, (1)
f = 0, . . . , K − 1 is discrete frequency index, t is processing frame index,
h m
i,r
n(f) is the room impulse response (RIR) between source position r n ∈ R 3 and microphone position m i ∈ R 3 in Cartesian coordinates
i = 0, . . . , M − 1 where M is the number of
microphones.
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
TF masking
Enhances by applying a mask η i (t, f ) to the observed signal ˆ
s i (t, f ) = x i (t, f) · η i (t, f ) (2)
Wiener filter
η(t, f ) = |s(t, f )| 2
|s(t, f )| 2 + |e(t, f )| 2 , (3)
To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,
interference with reverberation, and noise
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
TF masking
Enhances by applying a mask η i (t, f ) to the observed signal ˆ
s i (t, f ) = x i (t, f) · η i (t, f ) (2)
Wiener filter
η(t, f ) = |s(t, f )| 2
|s(t, f )| 2 + |e(t, f )| 2 , (3)
To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,
interference with reverberation, and noise
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
TF masking
Enhances by applying a mask η i (t, f ) to the observed signal ˆ
s i (t, f ) = x i (t, f) · η i (t, f ) (2)
Wiener filter
η(t, f ) = |s(t, f )| 2
|s(t, f )| 2 + |e(t, f )| 2 , (3)
To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,
interference with reverberation, and noise
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]
BBC sound effects library
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]
BBC sound effects library
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]
BBC sound effects library
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
TIMIT speech convolved with RWCP RIRs → array observations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise . Three classes of interference [1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB
Ambient noise SNR ∼ U[6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]
BBC sound effects library
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).
1 Introduction 2 Time-Frequency masking for speech enhancement
Array signal model TF-mask learning with CNNs CNN Training data generation process
3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
CNN Training data generation process
CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar Features and targets are averaged over microphones
→ Single mask: Computationally efficient CNN design
Four convolutional layers
96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.
Hyper-params obtained with grid search
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
SRP-PHAT
Steered Response Power with PHAse Transform (SRP-PHAT)
L(k, t) = X
i,j K−1
X
f=0
x
i(t, f ) · (x
j(t, f ))
∗|x
i(t, f )|·|x
∗j(t, f )| exp(τ
i,j· ω
f), (4)
k ∈ R 3 sound direction (unit vector)
τ i,j = k T (m i − m j )/c is time difference between microphones,
(·) ∗ denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.
Due to feature and target averaging, η i (t, f ) = η j (t, f )
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
SRP-PHAT with TF masking
Based on weighted GCC-PHAT [1]
Weighted SRP-PHAT
L(k, t) = X
i,j K−1
X
f=0
η
i(t, f )x
i(t, f ) · (η
j(t, f )x
j(t, f ))
∗|x
i(t, f )|·|x
∗j(t, f )| exp(τ
i,j·ω
f), (4) k ∈ R 3 sound direction (unit vector)
τ i,j = k T (m i − m j )/c is time difference between microphones,
(·) ∗ denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.
Due to feature and target averaging, η i (t, f ) = η j (t, f )
[1]F. Grondin, and F. Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
SRP-PHAT with TF masking
Based on weighted GCC-PHAT [1]
Weighted SRP-PHAT
L(k, t) = X
i,j K−1
X
f=0
η(t, f )
2x
i(t, f ) · (x
j(t, f ))
∗|x
i(t, f )|·|x
∗j(t, f )| exp(τ
i,j· ω
f), (4)
k ∈ R 3 sound direction (unit vector)
τ i,j = k T (m i − m j )/c is time difference between microphones,
(·) ∗ denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.
Due to feature and target averaging, η i (t, f ) = η j (t, f )
[1]
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
SRP-PHAT with TF masking
The point estimate for DOA ˆ k(t) = arg max
k
L(k, t). (5)
For each frame t separately, pick the maximum response
direction of the weighted SRP-PHAT L(k, t)
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals
Same SIR levels as in training.
Different inteference instances, different RIRs
1 Speech played back from moving loudspeaker
4 different rooms (with different number of RIRs) Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 different rooms (with different number of RIRs)
Total of 1650 mixtures
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Used performance measures
1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.
2 Percentage of DOA estimates in ground truth direction(±10°)
Only frames with detected speech are included
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Used performance measures
1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.
2 Percentage of DOA estimates in ground truth direction(±10°)
Only frames with detected speech are included
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Used performance measures
1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.
2 Percentage of DOA estimates in ground truth direction(±10°)
Only frames with detected speech are included
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Compared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask (ICM)
ICM ← Wiener filter with access to added
interference and original reverberated speech
will not remove reverberation of target speaker.
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Compared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask (ICM)
ICM ← Wiener filter with access to added
interference and original reverberated speech
will not remove reverberation of target speaker.
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Compared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask (ICM)
ICM ← Wiener filter with access to added
interference and original reverberated speech
will not remove reverberation of target speaker.
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation
SRP-PHAT with TF masking DOA performance analysis
4 Results 5 Conclusions and Discussion
DOA performance analysis
Azimuth angle (deg)
a)
50 100 150
100 200 300
DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10°
Azimuth angle (deg)
b)
50 100 150
100 200 300
Time (frame)
Azimuth angle (deg)
c)
50 100 150
100 200 300
Time (frame)
Azimuth angle (deg)
d)
50 100 150
100 200 300
a) SRP-PHAT without interference
b) SRP-PHAT with added printer interference at +6 dB SIR, c) SRP-PHAT with ICM weight for the mixture,
d) SRP-PHAT with CNN predicted weight for the mixture.
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Results
SRP-PHAT results [1] , static source
0 10 20 30 40 50 60 70 80 90 100
House Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print Avg.
Performance (%)
Relative SRP-PHAT mass within ±10° of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.
Most likely caused by reduction of reverberation
[1]
Normalized with SRP-PHAT result without interference
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Results
SRP-PHAT results [1] , static source
0 10 20 30 40 50 60 70 80 90 100
House Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print Avg.
Performance (%)
Relative SRP-PHAT mass within ±10° of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.
Most likely caused by reduction of reverberation
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Results
SRP-PHAT results, moving source
0 10 20 30 40 50 60 70 80 90 100
House Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print Avg.
Performance (%)
Relative SRP-PHAT mass within ±10° of ground truth (Moving)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Results
DOA point estimate, static source
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
House Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print Avg.
Relative performance (%)
Correct DOA estimates within ±10 deg. of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Results
DOA point estimate, moving source
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
House Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print
Avg. House
Inter.bg. Print Avg.
Relative performance (%)
Correct DOA estimates within ±10 deg. of ground truth (Moving)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
1 Introduction
2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results
5 Conclusions and Discussion
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHAT function was proposed.
Results showed reduction in detrimental effects caused by reverberation and time-varying interference.
Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.
The CNN mask is obtained from log-magnitude
→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference
→ Phase errors not affected by (real-valued) masking
1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion