The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


Research units


In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques are all used in our experiments. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 8% in the public leaderboard.


Original languageEnglish
Title of host publicationProceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)
Publication statusPublished - Nov 2018
MoE publication typeA4 Article in a conference publication
EventDetection and Classification of Acoustic Scenes and Events - Surrey, United Kingdom
Duration: 19 Nov 201820 Nov 2018


WorkshopDetection and Classification of Acoustic Scenes and Events
Abbreviated titleDCASE
CountryUnited Kingdom

Download statistics

No data available

ID: 30233108