TY - GEN
T1 - Learning Multi-Target TDOA Features for Sound Event Localization and Detection
AU - Berg, Axel
AU - Engman, Johanna
AU - Gulin, Jens
AU - Åström, Kalle
AU - Oskarsson, Magnus
PY - 2024
Y1 - 2024
N2 - Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
AB - Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
KW - sound event localization and detection
KW - time difference of arrival
KW - generalized cross-correlation
M3 - Paper in conference proceeding
SP - 16
EP - 20
BT - Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)
PB - Zenodo
T2 - Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2024
Y2 - 23 October 2024 through 25 October 2024
ER -