Speech technology is a field of technological research focusing on methods to process spoken language. Work in the area has largely relied on a combination of domain-specific knowledge and digital signal processing (DSP) algorithms, often combined with statistical (parametric) models. In this context, machine learning (ML) has played a central role in estimating the parameters of such models. Recently, better access to large quantities of data has opened the door to advanced ML models that are less constrained by the assumptions necessary for the DSP models and are potentially capable of achieving higher performance. The goal of this thesis is to investigate the applicability of recent state-of-the-art (SoA) developments in ML to the modelling and processing of speech at the so-called suprasegmental level to tackle the following topical problems in speech research: 1) zero-resource speech processing (ZS), which aims to learn language patterns from speech without access to annotated datasets, 2) automatic word (WCE) and syllable (SCE) count estimation which focus on quantifying the amount of linguistic content in audio recordings, and 3) speaking style conversion (SSC), which deals with the conversion of the speaking style of an utterance while retaining the linguistic content, speaking identity and quality. In contrast to the segmental level which consists of elementary speech units known as phone(me)s, the suprasegmental level encodes more slowly varying characteristics of speech such as the speaker identity, speaking style, prosody and emotion. The ML-approaches used in the thesis are non-parametric Bayesian (NPB) models, which have a strong mathematical foundation based on Bayesian statistics, and artificial neural networks (NNs), which are universal function approximators capable of leveraging large quantities of training data. The NN variants used include 1) end-to-end models that are capable of learning complicated mapping functions without the need to explicitly model the intermediate steps, and 2) generative adversarial networks (GANs), which are based on training a minimax game between two competing NNs. In ZS, NPB clustering methods were investigated for the discovery of syllabic clusters from speech and were shown to eliminate the need for model selection. In the WCE/SCE task, a novel end-to-end model was developed for automatic and language-independent syllable counting from speech. The method improved the syllable counting accuracy by approximately 10 percentage points from the previously published SoA method while relaxing the requirements of the data annotation used for the model training. As for SSC, a new parametric approach was introduced for the task. Bayesian models were first studied with parallel data, followed by GAN-based solutions for non-parallel data. GAN-based models were shown to achieve SoA performance in terms of both subjective and objective measures and without access to parallel data. Augmented CycleGANs also enable manual control of the degree of style conversion achieved in the SSC task.
|Julkaisun otsikon käännös||Machine learning methods for suprasegmental analysis and conversion in speech|
|Tila||Julkaistu - 2020|
|OKM-julkaisutyyppi||G5 Tohtorinväitöskirja (artikkeli)|