Using text and acoustic features in predicting glottal excitation waveforms for parametric speech synthesis with recurrent neural networks

Lauri Juvela, Xin Wang, Shinji Takaki, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

6 Citations (Scopus)

Abstract

This work studies the use of deep learning methods to directly model glottal excitation waveforms from context dependent text features in a text-to-speech synthesis system. Glottal vocoding is integrated into a deep neural network-based text-to-speech framework where text and acoustic features can be flexibly used as both network inputs or outputs. Long short-term memory recurrent neural networks are utilised in two stages: first, in mapping text features to acoustic features and second, in predicting glottal waveforms from the text and/or acoustic features. Results show that using the text features directly yields similar quality to the prediction of the excitation from acoustic features, both outperforming a baseline system based on using a fixed glottal pulse for excitation generation.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Subtitle of host publicationInterspeech'16, San Francisco, USA, Sept. 8-12, 2016
PublisherInternational Speech Communication Association (ISCA)
Pages2283-2287
Number of pages5
Volume08-12-September-2016
ISBN (Electronic)978-1-5108-3313-5
DOIs
Publication statusPublished - 2016
MoE publication typeA4 Conference publication
EventInterspeech - San Francisco, United States
Duration: 8 Sept 201612 Sept 2016
Conference number: 17

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association
PublisherInternational Speech Communications Association
ISSN (Print)1990-9770
ISSN (Electronic)2308-457X

Conference

ConferenceInterspeech
Country/TerritoryUnited States
CitySan Francisco
Period08/09/201612/09/2016

Keywords

  • Excitation modelling
  • Glottal vocoding
  • LSTM
  • Parametric speech synthesis

Fingerprint

Dive into the research topics of 'Using text and acoustic features in predicting glottal excitation waveforms for parametric speech synthesis with recurrent neural networks'. Together they form a unique fingerprint.

Cite this