Robust Direction Estimation with Convolutional Neural Networks-based Steered Response Power

(1)

Robust Direction Estimation with Convolutional Neural Networks-based

Steered Response Power

Pasi Pertil¨a, Emre Cakir

Laboratory of Signal Processing Audio Research Group Tampere University of Technology

Tampere, Finland

The 42nd IEEE International Conference on Acoustics,

Speech and Signal Processing, 2017

(2)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

Outline

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(3)

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(4)

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speech enhancement

Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise

time-varying

e.g. household, cafeteria…

(5)

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speech enhancement

Surveillance, automatic camera management Speaker DOA estimate can be degraded by reverberation and everyday noise

time-varying

e.g. household, cafeteria…

(6)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

[1]A. Narayanan and D. Wang,Ideal ratio mask estimation using deep neural networks for robust speech recognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang,On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849ff1858,2014.

[3]Grondin, Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for Sound Source Localization on Mobile Robots”, IROS, 2015

[4]Wilson, Darrell, ”Learning a precedence effect-like weighting function for the generalized

(7)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(8)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(9)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(10)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(11)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(12)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(13)

Time-Frequency (TF) masking

removes undesired TF-components from observation successfully applied for speech enhancement using deep learning ^[1][2]

applied in Time Difference of Arrival estimation Eqs. based on insight to deal with static noise ^[3]

Regression to deal with reverberation ^[4]

This work proposes deep learning-based (CNN) TF-masking for DOA estimation

to deal with everyday noise and reverberation

(14)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(15)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(16)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(17)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(18)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(19)

Why use convolutional neural networks (CNNs) for speech?

Speech signals contain significant local information in spectral domain.

changing speaker and environmental conditions

→ shifts in spectral position

CNNs have a translational shift invariance property

→ Suitable option for our task.

convolutional neural networks (CNNs) are discriminative classifiers

compute activations through shared weights over

local receptive fields

(20)

1 Introduction 2 Time-Frequency masking for speech enhancement

Array signal model TF-mask learning with CNNs CNN Training data generation process

3 Sound direction of arrival (DOA) estimation 4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(21)

Array signal model

The i ^th microphone signal is modeled as the mixture of reverberated signals in presence of noise

x _i (t, f )

| {z }

observation

= P

n h _m

_i

_,r

_n

(f ) · s _n (t, f )

| {z }

reverbered nth source signal

+ e _i (t, f )

| {z }

noise

, (1)

f = 0, . . . , K − 1 is discrete frequency index, t is processing frame index,

h m

i

,r

n

(f) is the room impulse response (RIR) between source position r n ∈ R ³ and microphone position m i ∈ R ³ in Cartesian coordinates

i = 0, . . . , M − 1 where M is the number of

microphones.

(22)

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s _i (t, f ) = x _i (t, f) · η _i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| ²

|s(t, f )| ² + |e(t, f )| ² , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(23)

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s _i (t, f ) = x _i (t, f) · η _i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| ²

|s(t, f )| ² + |e(t, f )| ² , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(24)

TF masking

Enhances by applying a mask η i (t, f ) to the observed signal ˆ

s _i (t, f ) = x _i (t, f) · η _i (t, f ) (2)

Wiener filter

η(t, f ) = |s(t, f )| ²

|s(t, f )| ² + |e(t, f )| ² , (3)

To enhance the direct-path signal in presence of interference s(t, f ) contains the direct path component of source e(t, f ) is a mixture of source reverberation,

interference with reverberation, and noise

(25)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(26)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(27)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(28)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(29)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(30)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

(31)

CNN Training data generation process

TIMIT speech convolved with RWCP RIRs → array observations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise . Three classes of interference ^[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs) {+12, +6, 0, −6} dB

Ambient noise SNR ∼ U[6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]

BBC sound effects library

(32)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(33)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(34)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(35)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(36)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(37)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(38)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(39)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(40)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(41)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).

(42)

CNN Training data generation process

CNN maps feats. to targets: log (|x(t, f )|) 7→ η(t, f ) . Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar Features and targets are averaged over microphones

→ Single mask: Computationally efficient CNN design

Four convolutional layers

96 feature maps, ReLU activation functions, 11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25) First three followed by max-pooling ↓ 4 freq. bins.

Hyper-params obtained with grid search

(43)

1 Introduction 2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation

SRP-PHAT with TF masking DOA performance analysis

4 Results 5 Conclusions and Discussion

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(44)

SRP-PHAT

Steered Response Power with PHAse Transform (SRP-PHAT)

L(k, t) = X

i,j K−1

X

f=0

x

_i

(t, f ) · (x

_j

(t, f ))

^∗

|x

_i

(t, f )|·|x

^∗_j

(t, f )| exp(τ

_i,j

· ω

_f

), (4)

k ∈ R ³ sound direction (unit vector)

τ _i,j = k ^T (m _i − m _j )/c is time difference between microphones,

(·) ^∗ denotes complex conjugate, ω _f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η _i (t, f ) = η _j (t, f )

(45)

SRP-PHAT with TF masking

Based on weighted GCC-PHAT ^[1]

Weighted SRP-PHAT

L(k, t) = X

i,j K−1

X

f=0

η

i

(t, f )x

_i

(t, f ) · (η

j

(t, f )x

_j

(t, f ))

^∗

|x

_i

(t, f )|·|x

^∗_j

(t, f )| exp(τ

_i,j

·ω

f

), (4) k ∈ R ³ sound direction (unit vector)

τ i,j = k ^T (m i − m j )/c is time difference between microphones,

(·) ^∗ denotes complex conjugate, ω _f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η i (t, f ) = η j (t, f )

[1]F. Grondin, and F. Michaud, ”Time Difference of Arrival Estimation based on Binary Frequency Mask for

(46)

SRP-PHAT with TF masking

Based on weighted GCC-PHAT ^[1]

Weighted SRP-PHAT

L(k, t) = X

i,j K−1

X

f=0

η(t, f )

²

x

_i

(t, f ) · (x

_j

(t, f ))

^∗

|x

_i

(t, f )|·|x

^∗_j

(t, f )| exp(τ

i,j

· ω

f

), (4)

k ∈ R ³ sound direction (unit vector)

τ _i,j = k ^T (m _i − m _j )/c is time difference between microphones,

(·) ^∗ denotes complex conjugate, ω _f = 2πf /K , and c is the speed of sound.

Due to feature and target averaging, η i (t, f ) = η j (t, f )

[1]

(47)

SRP-PHAT with TF masking

The point estimate for DOA ˆ k(t) = arg max

k

L(k, t). (5)

For each frame t separately, pick the maximum response

direction of the weighted SRP-PHAT L(k, t)

(48)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(49)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(50)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(51)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(52)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(53)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(54)

DOA performance analysis

Speech: recorded Japanese sentences (RWCP) Mixed with reverberated interference signals

Same SIR levels as in training.

Different inteference instances, different RIRs

1 Speech played back from moving loudspeaker

4 different rooms (with different number of RIRs) Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 different rooms (with different number of RIRs)

Total of 1650 mixtures

(55)

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(56)

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(57)

DOA performance analysis

Used performance measures

1 Portion of the SRP-PHAT likelihood in ground truth direction (±10 ° ) w.r.t. all directions.

2 Percentage of DOA estimates in ground truth direction(±10°)

Only frames with detected speech are included

(58)

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(59)

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(60)

DOA performance analysis

Compared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask (ICM)

ICM ← Wiener filter with access to added

interference and original reverberated speech

will not remove reverberation of target speaker.

(61)

DOA performance analysis

Azimuth angle (deg)

a)

50 100 150

100 200 300

DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10^°

Azimuth angle (deg)

b)

50 100 150

100 200 300

Time (frame)

Azimuth angle (deg)

c)

50 100 150

100 200 300

Time (frame)

Azimuth angle (deg)

d)

50 100 150

100 200 300

a) SRP-PHAT without interference

b) SRP-PHAT with added printer interference at +6 dB SIR, c) SRP-PHAT with ICM weight for the mixture,

d) SRP-PHAT with CNN predicted weight for the mixture.

(62)

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(63)

Results

SRP-PHAT results ^[1] , static source

0 10 20 30 40 50 60 70 80 90 100

House Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print Avg.

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]

Normalized with SRP-PHAT result without interference

(64)

Results

SRP-PHAT results ^[1] , static source

0 10 20 30 40 50 60 70 80 90 100

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

CNN weighting improves SRP-PHAT In SIR +12, +6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

(65)

Results

SRP-PHAT results, moving source

0 10 20 30 40 50 60 70 80 90 100

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Performance (%)

Relative SRP-PHAT mass within ±10° of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

(66)

Results

DOA point estimate, static source

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Relative performance (%)

Correct DOA estimates within ±10 deg. of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

CNN weighting improves SRP-PHAT

(67)

Results

DOA point estimate, moving source

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

Avg. House

Inter.bg. Print

Avg. House

Inter.bg. Print

Avg. House

Relative performance (%)

Correct DOA estimates within ±10 deg. of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

(68)

1 Introduction

2 Time-Frequency masking for speech enhancement 3 Sound direction of arrival (DOA) estimation 4 Results

5 Conclusions and Discussion

(69)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(70)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(71)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(72)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(73)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(74)

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHAT function was proposed.

Results showed reduction in detrimental effects caused by reverberation and time-varying interference.

Relative performance unchanged by source motion CNN generalized to new speakers and new instances of interference from unseen angles.

The CNN mask is obtained from log-magnitude

→ Should work on other arrays (without re-training) SRP-PHAT weighting does not fully remove effects of interference

→ Phase errors not affected by (real-valued) masking

(75)

Robust Direction Estimation with Convolutional Neural Networks-based Steered Response Power