Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system

Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

5 Citations (Scopus)
154 Downloads (Pure)

Abstract

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association (ISCA)
Pages1368-1372
Number of pages5
Volume2017-August
ISBN (Print)978-1-5108-4876-4
DOIs
Publication statusPublished - Aug 2017
MoE publication typeA4 Conference publication
EventInterspeech - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017
Conference number: 18
http://www.interspeech2017.org/

Publication series

NameInterspeech: Annual Conference of the International Speech Communication Association
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech
Country/TerritorySweden
CityStockholm
Period20/08/201724/08/2017
Internet address

Keywords

  • statistical parametric speech synthesis
  • excitation modeling

Fingerprint

Dive into the research topics of 'Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system'. Together they form a unique fingerprint.

Cite this