Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system

Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

5 Sitaatiot (Scopus)
178 Lataukset (Pure)

Abstrakti

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.
AlkuperäiskieliEnglanti
OtsikkoProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
KustantajaInternational Speech Communication Association (ISCA)
Sivut1368-1372
Sivumäärä5
Vuosikerta2017-August
ISBN (painettu)978-1-5108-4876-4
DOI - pysyväislinkit
TilaJulkaistu - elok. 2017
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaInterspeech - Stockholm, Ruotsi
Kesto: 20 elok. 201724 elok. 2017
Konferenssinumero: 18
http://www.interspeech2017.org/

Julkaisusarja

NimiInterspeech: Annual Conference of the International Speech Communication Association
ISSN (elektroninen)1990-9772

Conference

ConferenceInterspeech
Maa/AlueRuotsi
KaupunkiStockholm
Ajanjakso20/08/201724/08/2017
www-osoite

Sormenjälki

Sukella tutkimusaiheisiin 'Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä