Building personalised speech technology systems with sparse, bad quality or out-of-domain data

Research output: ThesisDoctoral ThesisCollection of Articles


Automatic speech recognition and text-to-speech systems offer handsfree and eyesfree interfaces for applications on computers, telephones and home and wearable electronics. The perceived quality and identity of a text-to-speech system's voice are essential to the user experience. The possibilities for different speaker identities are practically limitless if short or out-of-domain collections of speech can be used to transfer speaker identity to the synthetic voice. This thesis describes background, methods and results for a group of experiments performed with statistical parametric speech synthesis and speech recognition, with focus on speaker adaptation of the models and evaluation the quality of the systems' output. All these systems rely on speech models that are trained on large collections of speech and text data. The speech data have been preprocessed into acoustic features using a vocoder. The amount and quality of available data are addressed in experiments on the effects of background noise in the adaptation data of speaker-adaptive HMM-GMM statistical parametric speech synthesis, listener perception of speaker background in speaker adapted speech synthesis with sparse, foreign-accented data, and stacking group and speaker adaptations to improve quality of speech synthesis for out-of-domain speakers. Cross-lingual adaptation is investigated in experiments on probabilistic cross-lingual speaker adaptation when a model for source language is not available, and bilingual speech synthesis with code-switching when source language data is not available for one of the languages. In all these studies, the speaker characteristics were successfully transferred to a synthesic voice even if the adaptation data was noisy, in another language or there was very little of it. Cross-lingual adaptation was also investigated for automatic speech recognition of bilingual speakers and found to improve recognition results. Any system development relies on measuring the quality of the output, and this thesis also includes an overview of objective and subjective methods of quality evaluation for synthetic speech and natural foreign language speech, as well as an analysis of different objective measures for evaluating quality of HMM-GMM based speech synthesis systems. Building on components of speech recognition and synthesis systems, this thesis also presents a system for evaluating and scoring the pronunciation quality of foreign language learners utterances. Rating pronunciation quality of single utterances is a difficult problem but our system manages to do it at a speed and reliability that is satisfactory for computer games used to study language learning.
Translated title of the contributionBuilding personalised speech technology systems with sparse, bad quality or out-of-domain data
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Kurimo, Mikko, Supervisor
  • Kurimo, Mikko, Advisor
Print ISBNs978-952-60-8594-4
Electronic ISBNs978-952-60-8595-1
Publication statusPublished - 2019
MoE publication typeG5 Doctoral dissertation (article)


  • statistical parametric speech synthesis
  • automatic speech recognition
  • computer assisted pronunciation training
  • speech synthesis quality evaluation
  • acoustic model adaptation

Fingerprint Dive into the research topics of 'Building personalised speech technology systems with sparse, bad quality or out-of-domain data'. Together they form a unique fingerprint.

  • Equipment

  • Science-IT

    Mikko Hakala (Manager)

    School of Science

    Facility/equipment: Facility

  • Cite this