This dissertation focuses on perceptually motivated processing of audio in the time-frequency domain, and on spatial audio in particular. The topic takes into account sound physics, perception, and digital signal processing (DSP). Sound is emitted by a source in all directions with a pattern of directivity as a function of frequency, and it arrives to the listener through a direct path as well as through reverberation. The sounds from a multitude of sources superimpose at the ear canals. Due to the acoustic effect of the head, torso, and pinnae, the inter-aural, spectro-temporal characteristics of each arriving wave are specific to the direction of arrival. Ears transform the audio waveform into neural signals with frequency selectivity. Among other features, human hearing includes processes for analyzing the level and timing differences in the signals at the ears, which is necessary for obtaining information about the locations of the sound sources. In perceptually motivated audio DSP, a key technique is to decompose the sound into frequency bands. Several perceptually relevant signal properties can be directly measured and processed in the frequency bands. Typically, the practical DSP design only approximates the hearing mechanisms and resolutions, and simultaneously also other features are optimized, such as the computational efficiency and latency. Several novel techniques were proposed as part of the dissertation work. The first is an optimized and versatile framework for frequency band processing of spatial audio. The method functions based on the channel energies and the inter-channel dependencies, which are key features for controlling the spatial perception. The method performs the translation of the spatial sound characteristics while minimizing the square difference between the produced waveform and a defined preferred waveform. The method also provides a means to apply the decorrelated sound energy to the minimum necessary extent. The method was applied to perform spatial sound reproduction based on a compact set of microphones, and its benefit with respect to legacy methods was confirmed by listening tests and simulations. In another study, a frequency band reverberator was proposed that produces diffuse late reverberation with low computational complexity and high perceptual quality. Finally, a phase-adaptive, multi-channel downmixer was proposed that avoids the spectral artifacts that would otherwise occur if the input channels include non-aligned but coherent sounds. The downmixer has been selected as part of the reference model 0 (RM0) of the MPEG-H standard.
|Translated title of the contribution||Havaintoperusteinen aika-taajuusalueen tilaäänenkäsittely|
|Publication status||Published - 2014|
|MoE publication type||G5 Doctoral dissertation (article)|
- spatial audio
- time-frequency transforms
- perceptual audio signal processing