Regarding the reproduction of recorded or synthesised spatial sound scenes, perhaps the most convenient and flexible approach is to employ the Ambisonics framework. The Ambisonics framework allows for linear and non-parametric storage, manipulation and reproduction of sound-fields, described using spherical harmonics up to a given order of expansion. Binaural Ambisonic reproduction can be realised by matching the spherical harmonic patterns to a set of binaural filters, in manner which is frequency-dependent, linear and time-invariant. However, the perceptual performance of this approach is largely dependent on the spatial resolution of the input format. When employing lower-order material as input, perceptual deficiencies may easily occur, such as poor localisation accuracy and colouration. This is especially problematic, as the vast majority of existing Ambisonic recordings are often made available as first-order only. The detrimental effects associated with lower-order Ambisonics reproduction have been well studied and documented. To improve upon the perceived spatial accuracy of the method, the simplest solution is to increase the spherical harmonic order at the recording stage. However, microphone arrays capable of capturing higher-order components, are generally much more expensive than first-order arrays; while more affordable options tend to offer higher-order components only at limited frequency ranges. Additionally, an increase in spherical harmonic order also requires an increase in the number of channels and storage, and in the case of transmission, more bandwidth is needed. Furthermore, it is important to note that this solution does not aid in the reproduction of existing lower-order recordings. It is for these reasons that this work focuses on alternative methods which improve the reproduction of first-order material for headphone playback. For the task of binaural sound-field reproduction, an alternative is to employ a parametric approach, which divides the sound-field decoding into analysis and synthesis stages. Unlike Ambisonic reproduction, which operates via a linear combination of the input signals, parametric approaches operate in the time-frequency domain and rely on the extraction of spatial parameters during their analysis stage. These spatial parameters are then utilised to conduct a more informed reproduction in the synthesis stage. Parametric methods are capable of reproducing sounds at a spatial resolution that far exceeds their linear and time-invariant counterparts, as they are not bounded by the resolution of the input format. For example, they can elect to directly convolve the analysed source signals with Head-Related Transfer Functions (HRTF), which correspond to their analysed directions. An infinite order of spherical harmonic components would be required to attain the same resolution with a binaural Ambisonic decoder. The most well-known and established parametric reproduction method is Directional Audio Coding (DirAC), which employs a sound-field model consisting of one plane-wave and one diffuseness estimate per time-frequency tile. These parameters are derived from the active-intensity vector, in the case of first-order input. More recent formulations allow for multiple plane-wave and diffuseness estimates via spatially-localised active-intensity vectors, using higher-order input. Another parametric method is High Angular Resolution plane-wave Expansion (HARPEX), which extracts two plane-waves per frequency and is first-order only. The Sparse-Recovery method extracts a number of plane-waves, which corresponds to up to half the number of input channels of arbitrary order. The COding and Multi-Parameterisation of Ambisonic Sound Scenes (COMPASS) method also extracts source components up to half the number of input channels, but employs an additional residual stream that encapsulates the remaining diffuse and ambient components in the scene. In this paper, a new binaural parametric decoder for first-order input is proposed. The method employs a sound-field model of one plane-wave and one diffuseness estimate per frequency, much like the DirAC model. However, the source component directions are identified via a plane-wave decomposition using a dense scanning grid and peak-finding, which is shown to be more robust than the active-intensity vector for multiple narrow-band sources. The source and ambient components per time-frequency tile are then segregated, and their relative energetic contributions are established, using the Cross-Pattern Coherence (CroPaC) spatial-filter. This approach is shown to be more robust than deriving this energy information from the active-intensity-based diffuseness estimates. A real-time audio plug-in implementation of the proposed approach is also described.A multiple-stimulus listening test was conducted to evaluate the perceived spatial accuracy and fidelity of the proposed method, alongside both first-order and third-order Ambisonics reproduction. The listening test results indicate that the proposed parametric decoder, using only first-order signals, is capable of delivering perceptual accuracy that matches or surpasses that of third-order ambisonics decoding.
|DOI - pysyväislinkit|
|Tila||Julkaistu - syyskuuta 2019|
|Tapahtuma||EAA Spatial Audio Signal Processing Symposium - Sorbonne, Ranska|
Kesto: 6 syyskuuta 2019 → 7 syyskuuta 2019
|Conference||EAA Spatial Audio Signal Processing Symposium|
|Ajanjakso||06/09/2019 → 07/09/2019|