Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals

Manila Kodali, Sudarsana Kadiri, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

2 Citations (Scopus)
36 Downloads (Pure)

Abstract

Speakers regulate vocal intensity on many occasions for example to be heard over a long distance or to express vocal emotions. Humans can regulate vocal intensity over a wide sound pressure level (SPL) range and therefore speech can be categorized into different vocal intensity categories. Recent machine learning experiments have studied classification of vocal intensity category from speech signals which have been recorded without SPL information and which are represented on arbitrary amplitude scales. By fine-tuning four pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, HuBERT, audio speech
transformers), this paper studies classification of speech into four intensity categories (soft, normal, loud, very loud), when speech is presented on such arbitrary amplitude scale. The fine-tuned model embeddings showed absolute improvements of 5% and 10-12% in accuracy compared to baselines for the target intensity category label and the SPL-based intensity category
label, respectively.
Original languageEnglish
Title of host publicationInterspeech 2024
PublisherInternational Speech Communication Association (ISCA)
Pages482-486
Number of pages5
DOIs
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventInterspeech - Kos Island, Greece
Duration: 1 Sept 20245 Sept 2024

Publication series

NameInterspeech
PublisherInternational Speech Communication Association
ISSN (Electronic)2958-1796

Conference

ConferenceInterspeech
Country/TerritoryGreece
CityKos Island
Period01/09/202405/09/2024

Keywords

  • speech
  • audio speech transformers
  • HuBERT
  • sound pressure level
  • Vocal intensity
  • wav2vec2

Fingerprint

Dive into the research topics of 'Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals'. Together they form a unique fingerprint.

Cite this