A Formant Modification Method for Improved ASR of Children’s Speech

Tutkimustuotos: LehtiartikkeliArticleScientificvertaisarvioitu

Abstrakti

Differences in acoustic characteristics between children’s and adults’ speech
degrade performance of automatic speech recognition systems when systems
trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the difference in vocal tract resonances (formant frequencies) between adult and child speakers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of children’s speech to better correspond to formants of adults’ speech. This is carried out by warping the linear prediction (LP) spectrum computed from children’s speech. The warped LP spectra computed in a frame-based manner from children’s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is more similar to that of adults’ speech. When used in testing of an ASR system trained using adults’ speech, the warping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of children’s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PFSTAR databases, respectively, and by recognising children’s speech using acoustic models trained with adults’ speech. The proposed method gave relative improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic models, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recognition performance for the proposed method. We also combined the proposedmethod with VTLN and SRA, and found that the combined method gave a further reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.
AlkuperäiskieliEnglanti
Sivut98-106
Sivumäärä8
JulkaisuSpeech Communication
Vuosikerta136
DOI - pysyväislinkit
TilaJulkaistu - tammikuuta 2022
OKM-julkaisutyyppiA1 Julkaistu artikkeli, soviteltu

Sormenjälki

Sukella tutkimusaiheisiin 'A Formant Modification Method for Improved ASR of Children’s Speech'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä