Speech synthesis, or artificial generation of speech from any given text, has been one of the fundamental problems in speech communication technology. While early research on synthesis was driven by curiosity about the human voice production, modern speech synthesis has found many applications in screen readers, assistive devices and human-computer speech interfaces, to name a few. With the recent advances in statistical model based synthesis using neural networks, speech synthesis has reached an unprecedented level of naturalness and flexibility that will make possible many exciting future applications. A major contributor to the recent improvements has been the introduction of neural network waveform synthesis models, which take the role of a vocoder in a traditional speech synthesis system. However, a gap remains both in understanding and computational efficiency of the algorithms between the recent raw waveform neural vocoders and the classical model-based signal processing vocoders. A central motivation of the present dissertation has been to combine the emerging generative neural network models with classical speech signal processing concepts for efficient, high-quality synthesis that retains a degree of interpretability. Specifically, this dissertation focuses on neural network modeling of the excitation signal related to the source-filter model of human voice production. Since the present signal processing techniques for modeling the spectral envelope of the vocal tract are highly developed, the spectral envelope can be parameterized and used directly as a part of neural vocoding schemes. The remaining task is then to develop neural network models for the residual excitation signal. This dissertation presents an improved framework for representing residual excitation waveform in a pitch synchronous format, and applies generative adversarial networks for synthesizing these waveforms without a parametric aperiodicity model. Furthermore, it proposes an autoregressive WaveNet based excitation model, which only explicitly uses a spectral envelope model during synthesis. Finally, the two approaches are combined into a parallel-inference-capable source-filter synthesizer, which is trainable in an end-to-end fashion.
|Translated title of the contribution||Puheen aaltomuotojen tuottaminen hermoverkoilla lähde-suodin mallissa|
|Publication status||Published - 2020|
|MoE publication type||G5 Doctoral dissertation (article)|
- speech synthesis
- deep learning
- generative models