AVID: A speech database for machine learning studies on vocal intensity

Paavo Alku, Manila Kodali*, Laura Laaksonen, Sudarsana Kadiri

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

35 Downloads (Pure)

Abstract

Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named
Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learing (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into
soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support
vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.
Original languageEnglish
Article number103039
Number of pages11
JournalSpeech Communication
Volume157
DOIs
Publication statusPublished - Feb 2024
MoE publication typeA1 Journal article-refereed

Keywords

  • Vocal intensity
  • convolutional neural network
  • machine learning
  • sound pressure level
  • speech database
  • support vector machine

Fingerprint

Dive into the research topics of 'AVID: A speech database for machine learning studies on vocal intensity'. Together they form a unique fingerprint.

Cite this