TY - GEN
T1 - Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings
AU - Kodali, Manila
AU - Kadiri, Sudarsana
AU - Alku, Paavo
PY - 2023
Y1 - 2023
N2 - In speech communication, talkers regulate vocal intensity resulting in speech signals of different intensity categories (e.g., soft, loud). Intensity category carries important information about the speaker's health and emotions. However, many speech databases lack calibration information, and therefore sound pressure level cannot be measured from the recorded data. Machine learning, however, can be used in intensity category classification even though calibration information is not available. This study investigates pre-trained model embeddings (Wav2vec2 and Whisper) in classification of vocal intensity category (soft, normal, loud, and very loud) from speech signals expressed using arbitrary amplitude scales. We use a new database consisting of two speaking tasks (sentence and paragraph). Support vector machine is used as a classifier. Our results show that the pre-trained model embeddings outperformed three baseline features, providing improvements of up to 7%(absolute) in accuracy.
AB - In speech communication, talkers regulate vocal intensity resulting in speech signals of different intensity categories (e.g., soft, loud). Intensity category carries important information about the speaker's health and emotions. However, many speech databases lack calibration information, and therefore sound pressure level cannot be measured from the recorded data. Machine learning, however, can be used in intensity category classification even though calibration information is not available. This study investigates pre-trained model embeddings (Wav2vec2 and Whisper) in classification of vocal intensity category (soft, normal, loud, and very loud) from speech signals expressed using arbitrary amplitude scales. We use a new database consisting of two speaking tasks (sentence and paragraph). Support vector machine is used as a classifier. Our results show that the pre-trained model embeddings outperformed three baseline features, providing improvements of up to 7%(absolute) in accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85171597337&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-2038
DO - 10.21437/Interspeech.2023-2038
M3 - Conference article in proceedings
VL - 2023-August
T3 - Interspeech
SP - 4134
EP - 4138
BT - Proceedings of Interspeech'23
PB - International Speech Communication Association (ISCA)
T2 - Interspeech
Y2 - 20 August 2023 through 24 August 2023
ER -