Speech carries information related to, e.g., the linguistic message, speaker identity, speaking situation, speaking style and speaker-related characteristics. Feature extraction refers to the process of converting the digital speech signal into acoustic parameters that can be used to automatically uncover such information, especially using machine learning systems that have been trained on speech data labeled with target information. Such analyses are central in automatic speech recognition, speaker recognition, speech event detection and computational paralinguistic analysis. Each of these application categories is covered in this thesis. With increasing computational and storage capacity of communication technology, speech applications become more widespread and are used in more challenging environments. Ambient noise, varying communication and recording channels and large speaker-related variability tend to cause variation in the acoustic feature statistics and thus mislead speech analysis systems. This study aims to improve the robustness of these systems through feature extraction, so that they better maintain their performance level with increased signal variability. In short-time feature extraction, the focus is on robust spectrum analysis using especially time-weighted linear predictive methods, in which temporal locations of the signal are differently emphasized. These methods are found to improve additive-noise robustness in automatic speech, speaker and emotion recognition and to improve fundamental-frequency or vocal-effort robustness in formant analysis and speaker recognition. In addition, emphasis of the spectral fine structure is found to improve the robust detection of shouted speech in ambient-noise conditions. In long-term feature processing, modulation filtering of short-time features using multiple time scales is used to emphasize the typical long-term modulation dynamics of a given speech signal class in detecting emotions over a telephone channel in the presence of noise. Feature selection methods capable of tackling data sets with high dimensionality are developed and applied to find relevant utterance-level features to parametrize speech in different paralinguistic tasks with considerable speaker-related variability. The studies presented have developed speech feature extraction methods that succeed in improving the robustness of various speech analysis systems by focusing on relevant information and de-emphasizing or ignoring irrelevant information. These general-purpose modeling methods are not constrained to any particular application or system structure and thus have many potential uses.
|Translated title of the contribution||Robusteja menetelmiä puheen piirrelaskentaan|
|Publication status||Published - 2014|
|MoE publication type||G5 Doctoral dissertation (article)|
- speech processing
- machine learning
- robust features
- linear prediction