TY - JOUR
T1 - Creating speaker independent ASR system through prosody modification based data augmentation
AU - Shahnawazuddin, S.
AU - Adiga, Nagaraj
AU - Kathania, Hemant Kumar
AU - Sai, B. Tarun
PY - 2020/3/1
Y1 - 2020/3/1
N2 - In this paper, the effect of prosody-modification-based data augmentation is explored in the context of automatic speech recognition (ASR). The primary motive is to develop ASR systems that are less affected by speaker-dependent acoustic variations. Two factors contributing towards inter-speaker variability that are focused on in this paper are pitch and speaking-rate variations. In order to simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Compared to adults’ speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is basically due to large differences in pitch and speaking-rate between adults’ and children's speech. To overcome this problem, pitch and speaking-rate of the training speech are modified to create new versions of the data. The original and the modified versions are then pooled together in order to capture greater acoustic variability. The ASR system trained on augmented data is noted to be more robust towards speaker-dependent variations. Relative improvements of 11.5% and 27.0% over the baseline are obtained on decoding adults’ and children's speech test sets, respectively.
AB - In this paper, the effect of prosody-modification-based data augmentation is explored in the context of automatic speech recognition (ASR). The primary motive is to develop ASR systems that are less affected by speaker-dependent acoustic variations. Two factors contributing towards inter-speaker variability that are focused on in this paper are pitch and speaking-rate variations. In order to simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Compared to adults’ speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is basically due to large differences in pitch and speaking-rate between adults’ and children's speech. To overcome this problem, pitch and speaking-rate of the training speech are modified to create new versions of the data. The original and the modified versions are then pooled together in order to capture greater acoustic variability. The ASR system trained on augmented data is noted to be more robust towards speaker-dependent variations. Relative improvements of 11.5% and 27.0% over the baseline are obtained on decoding adults’ and children's speech test sets, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85077926895&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2019.12.019
DO - 10.1016/j.patrec.2019.12.019
M3 - Article
AN - SCOPUS:85077926895
SN - 0167-8655
VL - 131
SP - 213
EP - 218
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -