A Formant Modification Method for Improved ASR of Children’s Speech

Tutkimustuotos: LehtiartikkeliArticleScientificvertaisarvioitu


Differences in acoustic characteristics between children’s and adults’ speech
degrade performance of automatic speech recognition systems when systems
trained using adults’ speech are used to recognize children’s speech. This per-
formance degradation is due to the acoustic mismatch between training and
testing. One of the main sources of the acoustic mismatch is the difference
in vocal tract resonances (formant frequencies) between adult and child speak-
ers. The present study aims to reduce the mismatch in formant frequencies
by modifying formants of children’s speech to better correspond to formants of
adults’ speech. This is carried out by warping the linear prediction (LP) spec-
trum computed from children’s speech. The warped LP spectra computed in
a frame-based manner from children’s speech are used with the corresponding
LP residuals to synthesize speech whose formant structure is more similar to
that of adults’ speech. When used in testing of an ASR system trained using
adults’ speech, the warping reduces the spectral mismatch in speech between
training and testing and improves the system performance in recognition of
children’s speech. Experiments were conducted using narrowband (8 kHz) and
wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and
PFSTAR databases, respectively, and by recognising children’s speech using acoustic models trained with adults’ speech. The proposed method gave rela-
tive improvements of 24% and 11% for the DNN and TDNN acoustic models,
respectively, for narrowband speech. For wideband speech, the technique gave
relative improvements of 27% and 13% for the DNN and TDNN acoustic mod-
els, respectively. The performance of the proposed method was also compared
to two speaker adaptation methods: vocal tract length normalization (VTLN)
and speaking rate adaptation (SRA). This comparison showed the best recog-
nition performance for the proposed method. We also combined the proposed
method with VTLN and SRA, and found that the combined method gave a fur-
ther reduction in WER. Moreover, our experiments carried out for noisy speech
using various types of additive noise and signal-to-noise ratios showed that the
proposed method performs well also for degraded speech.
Sivut98 - 106
JulkaisuSpeech Communication
DOI - pysyväislinkit
TilaJulkaistu - tammikuuta 2022
OKM-julkaisutyyppiA1 Julkaistu artikkeli, soviteltu


Sukella tutkimusaiheisiin 'A Formant Modification Method for Improved ASR of Children’s Speech'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä