In this paper we study two major challenges in few-shot bioacoustic event detection: variable event lengths and false-positives. We use prototypical networks where the embedding function is trained using a multi-label sound event detection model instead of using episodic training as the proxy task on the provided training dataset. This is motivated by polyphonic sound events being present in the base training data. We propose a method to choose the embedding function based on the average event length of the few-shot examples and show that this makes the method more robust towards variable event lengths. Further, we show that an ensemble of prototypical neural networks trained on different training and validation splits of time-frequency images with different loudness normalizations reduces false-positives. In addition, we present an analysis on the effect that the studied loudness normalization techniques have on the performance of the prototypical network ensemble. Overall, per-channel energy normalization (PCEN) outperforms the standard log transform for this task. The method uses no data augmentation and no external data. The proposed approach achieves a F-score of 48.0% when evaluated on the hidden test set of the Detection and Classification of Acoustic Scenes and Events (DCASE) task 5.
|Titel på värdpublikation||Proceedings of the 7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)|
|Status||Published - 2022|
|Evenemang||7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022) - Nancy, Frankrike|
Varaktighet: 2022 nov. 3 → 2022 nov. 4
|Konferens||7th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2022)|
|Period||2022/11/03 → 2022/11/04|
- Sannolikhetsteori och statistik