Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings

Manila Kodali, Sudarsana Kadiri, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

4 Citations (Scopus)
136 Downloads (Pure)

Abstract

In speech communication, talkers regulate vocal intensity resulting in speech signals of different intensity categories (e.g., soft, loud). Intensity category carries important information about the speaker's health and emotions. However, many speech databases lack calibration information, and therefore sound pressure level cannot be measured from the recorded data. Machine learning, however, can be used in intensity category classification even though calibration information is not available. This study investigates pre-trained model embeddings (Wav2vec2 and Whisper) in classification of vocal intensity category (soft, normal, loud, and very loud) from speech signals expressed using arbitrary amplitude scales. We use a new database consisting of two speaking tasks (sentence and paragraph). Support vector machine is used as a classifier. Our results show that the pre-trained model embeddings outperformed three baseline features, providing improvements of up to 7%(absolute) in accuracy.

Original languageEnglish
Title of host publicationProceedings of Interspeech'23
PublisherInternational Speech Communication Association (ISCA)
Pages4134-4138
Number of pages5
Volume2023-August
DOIs
Publication statusPublished - 2023
MoE publication typeA4 Conference publication
EventInterspeech - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Publication series

NameInterspeech
PublisherInternational Speech Communication Association
ISSN (Print)1990-9772
ISSN (Electronic)2308-457X

Conference

ConferenceInterspeech
Country/TerritoryIreland
CityDublin
Period20/08/202324/08/2023

Fingerprint

Dive into the research topics of 'Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings'. Together they form a unique fingerprint.

Cite this