Projects per year
Abstract
Differences in acoustic characteristics between children’s and adults’ speech
degrade performance of automatic speech recognition systems when systems
trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the difference in vocal tract resonances (formant frequencies) between adult and child speakers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of children’s speech to better correspond to formants of adults’ speech. This is carried out by warping the linear prediction (LP) spectrum computed from children’s speech. The warped LP spectra computed in a frame-based manner from children’s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is more similar to that of adults’ speech. When used in testing of an ASR system trained using adults’ speech, the warping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of children’s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PFSTAR databases, respectively, and by recognising children’s speech using acoustic models trained with adults’ speech. The proposed method gave relative improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic models, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recognition performance for the proposed method. We also combined the proposedmethod with VTLN and SRA, and found that the combined method gave a further reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.
degrade performance of automatic speech recognition systems when systems
trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the difference in vocal tract resonances (formant frequencies) between adult and child speakers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of children’s speech to better correspond to formants of adults’ speech. This is carried out by warping the linear prediction (LP) spectrum computed from children’s speech. The warped LP spectra computed in a frame-based manner from children’s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is more similar to that of adults’ speech. When used in testing of an ASR system trained using adults’ speech, the warping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of children’s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PFSTAR databases, respectively, and by recognising children’s speech using acoustic models trained with adults’ speech. The proposed method gave relative improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic models, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recognition performance for the proposed method. We also combined the proposedmethod with VTLN and SRA, and found that the combined method gave a further reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.
Original language | English |
---|---|
Pages (from-to) | 98-106 |
Number of pages | 8 |
Journal | Speech Communication |
Volume | 136 |
DOIs | |
Publication status | Published - Jan 2022 |
MoE publication type | A1 Journal article-refereed |
Keywords
- Children’s speech recognition
- formant modification
- noise
- DNN
- TDNN
Fingerprint
Dive into the research topics of 'A Formant Modification Method for Improved ASR of Children’s Speech'. Together they form a unique fingerprint.Projects
- 2 Finished
-
HEART: Speech-based biomarking of heart failure
Alku, P. (Principal investigator), Javanmardi, F. (Project Member), Mittapalle, K. (Project Member), Tirronen, S. (Project Member), Pohjalainen, H. (Project Member), Kodali, M. (Project Member), Yagnavajjula, M. (Project Member) & Kadiri, S. (Project Member)
01/09/2020 → 31/08/2024
Project: Academy of Finland: Other research funding
-
-: Movie Making Finland: Finnish fiction films as audiovisual big data, 1907-2017
Kurimo, M. (Principal investigator), Virkkunen, A. (Project Member), Moisio, A. (Project Member), Porjazovski, D. (Project Member) & Kathania, H. (Project Member)
01/01/2020 → 31/12/2022
Project: Academy of Finland: Other research funding