Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Researchers

Research units

  • National Institute of Informatics
  • University of Edinburgh

Abstract

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.

Details

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - Aug 2017
MoE publication typeA4 Article in a conference publication
EventInterspeech - Stockholm, Sweden
Duration: 20 Aug 201724 Aug 2017
Conference number: 18
http://www.interspeech2017.org/

Publication series

NameInterspeech: Annual Conference of the International Speech Communication Association
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech
CountrySweden
CityStockholm
Period20/08/201724/08/2017
Internet address

    Research areas

  • statistical parametric speech synthesis, excitation modeling

Download statistics

No data available

ID: 14239098