Analysis of dynamics of vocal tract system using Zero time windowing method

Aktiviteetti: Väitöskirjan esitarkastajana tai vastaväittäjänä toimiminen tai jäsenyys tohtorikoulutusneuvostossa


Speech signal is the output of a dynamic production mechanism. The articulators involved in the production process move continuously at different rates, giving rise to a time–varying vocal tract transfer function. The process of speech production is dictated by the linguistic and para–linguistic information being conveyed by the speaker. The resulting speech signal embodies the characteristics of this time–varying acoustic system, which is easily perceived by humans. Algorithmic processing of the speech signal attempts to study these characteristics, to uncover the behavior of the acoustic response of the vocal tract system during the production of different sounds. Analysis of the dynamics of the production mechanism from the speech signal is a difficult task due to rapid movement of articulators and complex interaction among different elements of production. The elements of the speech production comprise of static and dynamic articulators. Movement of the articulators can be voluntary, driven by the motor functions (such as the movement of lips, tongue, jaw etc.), to produce a desired sequence of sounds. There are some involuntary or semi– voluntary movements of these articulators that take place due to factors such as the air pressure and muscular tension. Glottal opening and closing, and the velar opening and closing are two such examples of semi–voluntary movements, which are caused during the production of voiced and nasal sounds, respectively. Changes in the acoustic response of the vocal tract system during these events are rapid, and tracking the dynamics of such responses is difficult. Assumption of stationarity or quasi–stationarity behavior of the system limits our comprehension of the production viii ix process. A detailed study of the dynamic nature of the acoustic transfer function is therefore necessary to gain deeper insight into the speech production process. Traditional methods to study the speech production mechanism assume stationarity of the vocal tract acoustic system over a short duration of 10–30 ms. The spectral characteristics for these segments are extracted using methods based on short time analysis of speech. The spectral envelope obtained from the magnitude of the short time Fourier transform (STFT) is interpreted as the frequency response of the acoustic system. The fine structure of the magnitude is attributed mostly to the excitation component of the production system. The gross and fine structure of the spectrum is exploited to develop models for the speech production system. The source–system modeling is a paradigm based on this principle. Analysis using bank of bandpass filters is another method to study the spectral characteristics. The output energy of each filter within a short (10–30 ms) segment is attributed to the spectral energy in the band corresponding to the filter. There is an implicit quasi–stationarity assumption in the implementation of these methods. These methods do not bring out the underlying production mechanism well enough to explain dynamics of the articulatory phenomena. The objective of this study is to examine changes in the vocal tract system to bring out the dynamic nature of the system during production of different sounds. The glottal activity in voiced speech involves opening and closing of the vibrating vocal folds, which results in coupling and decoupling of the subglottal and supraglottal tracts, respectively. The production of nasal consonants and nasalized vowels also involves coupling and decoupling of the nasal tract due to velar movement. These changes occur continuously over a short interval of time, and hence are difficult to track using analysis based on quasi–stationarity assumption. A method is proposed to track the dominant resonance characteristics of the vocal tract acoustic system continuously as a function of time. The method is useful to study the dynamics of the production system caused by the movement of vocal folds, velum, and other articulators. In order to study the dynamic characteristics of the vocal tract system during speech production, a speech analysis method, called the zero time windowing (ZTW) is used. The ZTW method gives spectral characteristics of the time varying system at each instant of time, reflecting the instantaneous behavior of the production system. The spectral characteristics are derived using the Hilbert envelope of the numerator of the group delay (HNGD) function. The behavior of the x time varying system response is captured using the dominant resonance frequency (DRF) obtained from the HNGD spectrum. Changes in the production system due to coupling and decoupling of different cavities during the glottal and velic activity, is reflected as shift in the locations of the DRFs. Other spectral parameters are derived from the instantaneous system response to study the dynamics of the vocal tract during production of some voiced and unvoiced segments. Novel methods of speech signal processing are proposed to study the glottal and velic activities by tracking the changes in the vocal tract system. The DRF contour is a one–dimensional representation corresponding to the equivalent length of the overall vocal tract. This representation is used to demarcate glottal open and closed phases in continuous speech. The transitions in the DRF contour help to identify the regions of coupling and decoupling of the subglottal and supraglottal tracts during the glottal activity. The coupling of the oral and nasal cavities, and the extent of the opening of the velopharyngeal port, is studied using the first and second dominant resonances obtained from the HNGD spectra. The study addresses two major problems, namely, the identification of duration and the extent of nasalization in a vowel. Different vowel–nasal pairs are examined to study the contextual load of the nasal consonants on the front, mid and back vowels. These studies illustrates the dynamic behavior of the co–articulatory nasalization phenomena in continuous speech. The thesis also examines whether certain averaged features derived from HNGD spectra help in discriminating nasal and approximant consonants. For this, the low frequency behavior is captured using the HNGD spectra around the glottal closure instants in speech which correspond to high SNR regions within a glottal cycle. The thesis also explores the possibility of identification of fricatives in speech using aggregated behavior of the HNGD spectral features in the high frequency region. Thus the objective of the thesis is to study the dynamics of the speech production system using the instantaneous acoustic response obtained by the ZTW method. The changes in the vocal tract system due to movement of the vocal folds and velum are captured using the characteristics of the dominant resonances derived from the HNGD spectrum. The effects of glottal activity represented by the changes in the vocal tract is a major contribution of the thesis. It gives new insight of different phases of glottal activity, which otherwise is difficult to obtain using conventional STFT methods of analysis of speech. The study of nasalization in vowels is another major contribution xi of the thesis. Illustration of the nasalization phenomena using the first two dominant resonances will help improve our understanding of the production process of different voices. Discrimination of nasal and approximant segments is another useful contribution of the thesis. Identification of fricatives in continuous speech is also explored using the spectral features derived from HNGD. The proposed methods are therefore useful in studying the continuously varying characteristics of the speech production mechanism. Due to short segment analysis of the ZTW method, the results of these studies are mostly valid only for clean speech. Any type of averaging to reduce the effects of noise will destroy the instantaneous characteristics of the dynamical system

Aikajakso14 kesäk. 2019
TutkittavaRavi Shankar Prasad
Tutkimuksen ajankohta
  • International Institute of Information Technology Hyderabad
Tunnustuksen arvoInternational