Voice source modelling techniques for statistical parametric speech synthesis

Tuomo Raitio

Research output: ThesisDoctoral ThesisCollection of Articles

Abstract

Speech is the most natural way of human communication, and thus designing a machine that imitates human speech has long fascinated people. Only rather recently, due to digitisation of speech and increase in computing power, this goal has become feasible. Although speech synthesis is used today in various applications from human-computer interaction to assistive technologies, the performance of modern speech synthesisers is far from the abilities of human speakers. The ultimate goal of text-to-speech (TTS) synthesis is to read any text and convert it to intelligible and natural sounding speech with the desired contextual and speaker characteristics. Meeting all of these goals at once makes this task extremely difficult. Moreover, the quality of the speech signal cannot be compromised since humans are very sensitive in perceiving even the slightest artefacts in a speech signal. This thesis aims to improve both the naturalness and expressivity of speech synthesis by developing speech processing algorithms that utilise information from the speech production mechanism. One of the key algorithms in this work is glottal inverse filtering (GIF), which is used for estimating the voice source signal from recorded speech. The voice source is known to be the origin of several essential acoustic cues used in spoken communication, such as the fundamental frequency, but it is also related to acoustic cues underlying voice quality, speaking style, and speaker identity, all of which contribute to the naturalness and expressivity of speech. Accurate modelling of the voice source is often overlooked in conventional speech processing algorithms, and this work aims to improve especially this shortcoming. In this thesis, two new GIF methods are first proposed that can be used for improved estimation of the voice source signal. Secondly, several novel voice source parameterization and modelling methods are developed that can be used in statistical parametric speech synthesis (SPSS) to improve naturalness and expressivity. Thirdly, using GIF and the voice source modelling methods in the context of SPSS, expressive voices are created that aim to cover various human speaking styles used in everyday spoken communication. Finally, the created synthetic voices are assessed using extensive subjective evaluation in different listening conditions. The results of the evaluation show that the naturalness and expressivity of synthetic speech can be enhanced using the techniques proposed in this thesis, and that the voices are perceived to be more suitable in various realistic contexts. Thus, the methods presented in this thesis provide a large potential to enhance the naturalness, expressivity, and suitability of speech synthesis in various applications.
Translated title of the contributionPuheen äänilähteen mallintaminen tilastollisessa parametrisessa puhesynteesissä
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
Supervisors/Advisors
  • Alku, Paavo, Supervisor
  • Alku, Paavo, Advisor
Publisher
Print ISBNs978-952-60-6136-8
Electronic ISBNs978-952-60-6137-5
Publication statusPublished - 2015
MoE publication typeG5 Doctoral dissertation (article)

Keywords

  • statistical parametric speech synthesis
  • voice source modelling
  • glottal inverse filtering
  • voice quality
  • expressive speech synthesis

Fingerprint Dive into the research topics of 'Voice source modelling techniques for statistical parametric speech synthesis'. Together they form a unique fingerprint.

  • Cite this