• Ei tuloksia

Robust Direction Estimation with Convolutional Neural Networks-based Steered Response Power

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Robust Direction Estimation with Convolutional Neural Networks-based Steered Response Power"

Copied!
75
0
0

Kokoteksti

(1)

Robust Direction Estimation with Convolutional Neural Networks-based

Steered Response Power

Pasi Pertil¨a, Emre Cakir

Laboratory of Signal Processing Audio Research Group Tampere University of Technology

Tampere, Finland

The 42nd IEEE International Conference on Acoustics,

Speech and Signal Processing, 2017

(2)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Outline

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(3)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(4)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speech enhancement

Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise

time-varying

e.g. household, cafeteria…

(5)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speech enhancement

Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise

time-varying

e.g. household, cafeteria…

(6)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(7)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(8)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(9)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(10)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(11)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(12)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(13)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning [1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise [3]

Regression to deal with reverberation [4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(14)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(15)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(16)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(17)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(18)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(19)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(20)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(21)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Array signal model

The i th microphone signal is modeled as the mixture of reverberated signals in presence of noise

x i (t, f )

| {z }

observation

= P

n h m

i

,r

n

(f ) · s n (t, f )

| {z }

reverbered nth source signal

+ e i (t, f )

| {z }

noise

, (1)

f = 0, . . . , K − 1 is discrete frequency index, t is processing frame index,

h m

i

,r

n

(f) is the room impulse response (RIR) between source position r n ∈ R 3 and microphone position m i ∈ R 3 in Cartesian coordinates

i = 0, . . . , M − 1 where M is the number of

microphones.

(22)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s i (t, f ) = x i (t, f) · η i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| 2

|s(t, f )| 2 + |e(t, f )| 2 , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(23)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s i (t, f ) = x i (t, f) · η i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| 2

|s(t, f )| 2 + |e(t, f )| 2 , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(24)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s i (t, f ) = x i (t, f) · η i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| 2

|s(t, f )| 2 + |e(t, f )| 2 , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(25)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(26)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(27)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(28)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(29)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(30)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(31)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference [1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(32)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(33)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(34)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(35)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(36)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(37)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(38)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(39)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(40)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(41)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(42)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(43)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(44)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

SRP-PHAT

Steered Response Power with PHAse Transform (SRP-PHAT)

L(k, t) = X

i,j K−1

X

f=0

x

i

(t, f ) · (x

j

(t, f ))

|x

i

(t, f )|·|x

j

(t, f )| exp(τ

i,j

· ω

f

), (4)

k ∈ R 3 sound direction (unit vector)

τ i,j = k T (m i − m j )/c is time difference between microphones,

(·) denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η i (t, f ) = η j (t, f )

(45)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

SRP-PHAT with TF masking

Based on weighted GCC-PHAT [1]

Weighted SRP-PHAT

L(k, t) = X

i,j K−1

X

f=0

η

i

(t, f )x

i

(t, f ) · (η

j

(t, f )x

j

(t, f ))

|x

i

(t, f )|·|x

j

(t, f )| exp(τ

i,j

·ω

f

), (4) k ∈ R 3 sound direction (unit vector)

τ i,j = k T (m i − m j )/c is time difference between microphones,

(·) denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η i (t, f ) = η j (t, f )

[1]F. Grondin, and F. Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for

(46)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

SRP-PHAT with TF masking

Based on weighted GCC-PHAT [1]

Weighted SRP-PHAT

L(k, t) = X

i,j K−1

X

f=0

η(t, f )

2

x

i

(t, f ) · (x

j

(t, f ))

|x

i

(t, f )|·|x

j

(t, f )| exp(τ

i,j

· ω

f

), (4)

k ∈ R 3 sound direction (unit vector)

τ i,j = k T (m i − m j )/c is time difference between microphones,

(·) denotes complex conjugate, ω f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η i (t, f ) = η j (t, f )

[1]

(47)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

SRP-PHAT with TF masking

The point estimate for DOA ˆ k(t) = arg max

k

L(k, t). (5)

For each frame t separately, pick the maximum response

direction of the weighted SRP-PHAT L(k, t)

(48)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(49)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(50)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(51)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(52)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(53)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(54)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(55)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(56)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(57)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(58)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(59)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(60)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(61)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

DOA performance analysis

Azimuth angle (deg)

a)

50 100 150

100 200 300

DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10°

Azimuth angle (deg)

b)

50 100 150

100 200 300

Time (frame)

Azimuth angle (deg)

c)

50 100 150

100 200 300

Time (frame)

Azimuth angle (deg)

d)

50 100 150

100 200 300

a) SRP-PHAT without interference

b) SRP-PHAT with added printer interference at +6 dB SIR, c) SRP-PHAT with ICM weight for the mixture,

d) SRP-PHAT with CNN predicted weight for the mixture.

(62)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(63)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Results

SRP-PHAT results [1] , static source

0 10 20 30 40 50 60 70 80 90 100

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]

Normalized with SRP-PHAT result without interference

(64)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Results

SRP-PHAT results [1] , static source

0 10 20 30 40 50 60 70 80 90 100

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

(65)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Results

SRP-PHAT results, moving source

0 10 20 30 40 50 60 70 80 90 100

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

(66)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Results

DOA point estimate, static source

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Relative performance (%)

Correct DOA estimates within ±10 deg. of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT

(67)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Results

DOA point estimate, moving source

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Relative performance (%)

Correct DOA estimates within ±10 deg. of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

(68)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(69)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(70)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(71)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(72)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(73)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(74)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(75)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Thank you for listening

Time for questions?

Viittaukset

LIITTYVÄT TIEDOSTOT

A Regression Model of Recurrent Deep Neural Networks for Noise Robust Estimation of the Fundamental Frequency Contour of Speech Kato, Akihiro ISCA Artikkelit ja

Deep convolutional neural networks for estimating porous material parameters with ultrasound tomography..

DNN Achitecture for TDoA Estimation Using Masking The proposed DNN model predicts the TDoA (and optionally also the TF mask) using the mel-frequency resolution input features

SRP with phase transform (SRP-PHAT) combined with the CNN-based masking is shown to be capable of reducing the impact of time-varying interference for speaker direction estimation

Simulations were used to evaluate the performance of TOA estimators for a moving source, where Kalman filter based TOA estimator was found more accurate over the Moore- Penrose

Nesta, “Tracking of multidimensional TDOA for multiple sources with distributed microphone pairs,” Computer Speech & Language, vol.. •  By defining an

It has come to our attention, that the article [1] contains notational errors, which are corrected below.. The observation matrix H for the Kalman

Keywords: Fault Detection and Diagnosis, Deep Learning, Convolutional Neural Networks, Recurrent Neural Network, Long Short Term Memory, Mel Frequency Cepstrum