Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

25 Lataukset (Pure)


Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best 𝜌 value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.
OtsikkoMM '23: Proceedings of the 31st ACM International Conference on Multimedia
ISBN (elektroninen)979-8-4007-0108-5
DOI - pysyväislinkit
TilaJulkaistu - 27 lokak. 2023
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaACM International Conference on Multimedia - Ottawa, Kanada
Kesto: 29 lokak. 202329 lokak. 2023
Konferenssinumero: 31


ConferenceACM International Conference on Multimedia


Sukella tutkimusaiheisiin 'Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä