A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis

Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    49 Citations (Scopus)

    Abstract

    Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.

    Original languageEnglish
    Title of host publication2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
    Place of PublicationUnited States
    PublisherIEEE
    Pages4804-4808
    Number of pages5
    Volume2018-April
    ISBN (Electronic)978-1-5386-4658-8
    ISBN (Print)978-1-5386-4659-5
    DOIs
    Publication statusPublished - 10 Sept 2018
    MoE publication typeA4 Article in a conference publication
    EventIEEE International Conference on Acoustics, Speech, and Signal Processing - Calgary, Canada
    Duration: 15 Apr 201820 Apr 2018
    https://2018.ieeeicassp.org/

    Publication series

    NameProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
    ISSN (Electronic)2379-190X

    Conference

    ConferenceIEEE International Conference on Acoustics, Speech, and Signal Processing
    Abbreviated titleICASSP
    Country/TerritoryCanada
    CityCalgary
    Period15/04/201820/04/2018
    Internet address

    Keywords

    • Autoregressive neural network
    • Deep learning
    • General adversarial network
    • Speech synthesis
    • Wavenet

    Fingerprint

    Dive into the research topics of 'A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis'. Together they form a unique fingerprint.

    Cite this