Abstrakti
This work studies the use of deep learning methods to directly model glottal excitation waveforms from context dependent text features in a text-to-speech synthesis system. Glottal vocoding is integrated into a deep neural network-based text-to-speech framework where text and acoustic features can be flexibly used as both network inputs or outputs. Long short-term memory recurrent neural networks are utilised in two stages: first, in mapping text features to acoustic features and second, in predicting glottal waveforms from the text and/or acoustic features. Results show that using the text features directly yields similar quality to the prediction of the excitation from acoustic features, both outperforming a baseline system based on using a fixed glottal pulse for excitation generation.
Alkuperäiskieli | Englanti |
---|---|
Otsikko | Proceedings of the Annual Conference of the International Speech Communication Association |
Alaotsikko | Interspeech'16, San Francisco, USA, Sept. 8-12, 2016 |
Kustantaja | International Speech Communication Association (ISCA) |
Sivut | 2283-2287 |
Sivumäärä | 5 |
Vuosikerta | 08-12-September-2016 |
ISBN (elektroninen) | 978-1-5108-3313-5 |
DOI - pysyväislinkit | |
Tila | Julkaistu - 2016 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
Tapahtuma | Interspeech - San Francisco, Yhdysvallat Kesto: 8 syysk. 2016 → 12 syysk. 2016 Konferenssinumero: 17 |
Julkaisusarja
Nimi | Proceedings of the Annual Conference of the International Speech Communication Association |
---|---|
Kustantaja | International Speech Communications Association |
ISSN (painettu) | 1990-9770 |
ISSN (elektroninen) | 2308-457X |
Conference
Conference | Interspeech |
---|---|
Maa/Alue | Yhdysvallat |
Kaupunki | San Francisco |
Ajanjakso | 08/09/2016 → 12/09/2016 |