The aim of a text-to-speech (TTS) system is to transform a given input text into a corresponding speech waveform that sounds like human speech. TTS is a key component in many voice-interactive devices such as screen readers and virtual personal assistants (e.g. Siri by Apple). Recently, tremendous progress has been made in TTS, and synthetic speech at its best is difficult to distinguish from natural speech for short utterances. This progress stems from advancements in training algorithms in deep learning, from larger computational power and memory resources as well as from better access to effective development tools and data. These advanced TTS systems are based on statistical models, which often require both large quantities of training data recorded in studio environments and large computational resources to train the models. However, it is difficult to collect such quantities of data for speaking styles that are challenging to produce in large quantities. This thesis studies TTS by investigating new training algorithms to generate excitation waveforms in vocoders. Vocoders are key components in TTS that aim to represent speech signals in such forms that are suitable for statistical modeling. Glottal vocoders are a specific type of vocoders that use representations based on modeling the true acoustical excitation of the human speech production mechanism, the glottal flow. In glottal vocoders, the accuracy in the prediction of the glottal excitation is important to generate high-quality synthetic speech. This thesis studies both adversarial and auto-regressive training mechanisms to model the glottal waveform. Extensive subjective evaluations were conducted to evaluate the developed techniques by comparing them to other widely used vocoders in TTS. The obtained results reveal improvements in the naturalness of synthetic speech and for some of the voices synthesized, the glottal vocoders outperformed the other vocoders. Furthermore, this thesis investigates adaptation techniques to synthesize Lombard speaking style, which humans involuntarily produce in noisy surroundings, using just 500 utterances (approx. 20 minutes) of speech data. The adaptation techniques were integrated onto both conventional statistical parametric speech synthesis (SPSS) systems and modern end-to-end TTS systems. The results suggest that a transfer learning-based adaptation technique can generate Lombard speech with the best quality.
|Julkaisun otsikon käännös||Improving the quality of text-to-speech (TTS) using deep learning -- Emphasis on vocoders and speaking style adaptation|
|Tila||Julkaistu - 2020|
|OKM-julkaisutyyppi||G5 Tohtorinväitöskirja (artikkeli)|