Abstract
This work studies the use of deep learning methods to directly model glottal excitation waveforms from context dependent text features in a text-to-speech synthesis system. Glottal vocoding is integrated into a deep neural network-based text-to-speech framework where text and acoustic features can be flexibly used as both network inputs or outputs. Long short-term memory recurrent neural networks are utilised in two stages: first, in mapping text features to acoustic features and second, in predicting glottal waveforms from the text and/or acoustic features. Results show that using the text features directly yields similar quality to the prediction of the excitation from acoustic features, both outperforming a baseline system based on using a fixed glottal pulse for excitation generation.
Original language | English |
---|---|
Title of host publication | Proceedings of the Annual Conference of the International Speech Communication Association |
Subtitle of host publication | Interspeech'16, San Francisco, USA, Sept. 8-12, 2016 |
Publisher | International Speech Communication Association (ISCA) |
Pages | 2283-2287 |
Number of pages | 5 |
Volume | 08-12-September-2016 |
ISBN (Electronic) | 978-1-5108-3313-5 |
DOIs | |
Publication status | Published - 2016 |
MoE publication type | A4 Conference publication |
Event | Interspeech - San Francisco, United States Duration: 8 Sept 2016 → 12 Sept 2016 Conference number: 17 |
Publication series
Name | Proceedings of the Annual Conference of the International Speech Communication Association |
---|---|
Publisher | International Speech Communications Association |
ISSN (Print) | 1990-9770 |
ISSN (Electronic) | 2308-457X |
Conference
Conference | Interspeech |
---|---|
Country/Territory | United States |
City | San Francisco |
Period | 08/09/2016 → 12/09/2016 |
Keywords
- Excitation modelling
- Glottal vocoding
- LSTM
- Parametric speech synthesis