Sammanfattning
This technical report gives an overview of our submission to task 3 of the DCASE 2024 challenge. We present a sound event localization and detection (SELD) system using input features based on trainable neural generalized cross-correlations with phase transform (NGCC-PHAT). With these features together with spectrograms as input to a Transformer-based network, we achieve significant improvements over the baseline method. In addition, we also present an audio-visual version of our system, where distance predictions are updated using depth maps from the panorama video frames.
Originalspråk | engelska |
---|---|
Status | Published - 2024 juni 30 |
Ämnesklassifikation (UKÄ)
- Datorseende och robotik (autonoma system)