A Formant Modification Method for Improved ASR of Children’s Speech

Hemant Kathania, Sudarsana Kadiri, Paavo Alku, Mikko Kurimo

    Research output: Contribution to journalArticleScientificpeer-review

    21 Citations (Scopus)
    200 Downloads (Pure)

    Abstract

    Differences in acoustic characteristics between children’s and adults’ speech
    degrade performance of automatic speech recognition systems when systems
    trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the difference in vocal tract resonances (formant frequencies) between adult and child speakers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of children’s speech to better correspond to formants of adults’ speech. This is carried out by warping the linear prediction (LP) spectrum computed from children’s speech. The warped LP spectra computed in a frame-based manner from children’s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is more similar to that of adults’ speech. When used in testing of an ASR system trained using adults’ speech, the warping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of children’s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PFSTAR databases, respectively, and by recognising children’s speech using acoustic models trained with adults’ speech. The proposed method gave relative improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic models, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recognition performance for the proposed method. We also combined the proposedmethod with VTLN and SRA, and found that the combined method gave a further reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.
    Original languageEnglish
    Pages (from-to)98-106
    Number of pages8
    JournalSpeech Communication
    Volume136
    DOIs
    Publication statusPublished - Jan 2022
    MoE publication typeA1 Journal article-refereed

    Keywords

    • Children’s speech recognition
    • formant modification
    • noise
    • DNN
    • TDNN

    Fingerprint

    Dive into the research topics of 'A Formant Modification Method for Improved ASR of Children’s Speech'. Together they form a unique fingerprint.

    Cite this