Abstrakti
Vector quantization (VQ) is a classic signal processing technique that models the probability density function of a distribution using a set of representative vectors called the codebook, such that each codebook vector represents a subset of the distribution's samples. Deep neural networks (DNNs) are a branch of machine learning that has gained popularity in recent decades as they can solve complicated optimization problems. Since VQ provides an abstract high-level discrete representation of a distribution, it has been widely used as a beneficial tool in many applications based on DNNs, such as image generation, speech recognition, text-to-speech synthesis, and speech and video coding. Regarding VQ's broad utilization in applications based on DNNs, a small improvement in VQ can result in a huge boost in the performance of many applications installed on devices dealing with different data types such as speech, image, video, and text. This thesis mainly focuses on improving various VQ methods within deep learning frameworks. We propose using vector quantization instead of scalar quantization in a speech coding framework. The experiments show that the decoded speech has a higher perceptual quality because VQ considers the correlation between different dimensions of spectral envelopes. As another contribution, we propose a new solution to the gradient collapse problem called noise substitution in vector quantization (NSVQ), in which we model VQ as the addition of a noise vector to the input. Experiments show that NSVQ gives a faster convergence, more accurate gradients, and less hyperparameters to tune than two state-of-the-art solutions, i.e., straight-through estimator and exponential moving average. We further demonstrate that NSVQ can also optimize other variants of VQ that use multiple codebooks, e.g., product VQ, residual VQ, and additive VQ. Experimental results under different speech coding, image compression, and approximate nearest neighbor search show that VQ variants optimized by NSVQ can perform comparably to the baselines. By incorporating space-filling curves into VQ, we introduced a novel quantization technique called space-filling vector quantization (SFVQ), which quantizes the input on a continuous piecewise linear curve. Because of inherent order in the SFVQ codebook, adjacent codebook vectors refer to similar contents. We used this property of SFVQ, which allows us to interpret the underlying phonetic structure of the latent space of a voice conversion model. Moreover, we used SFVQ to interpret the intermediate latent space of StyleGAN2 and BigGAN image generative models. SFVQ gives good control over generations, such that we found the mapping between the latent space and generative factors (e.g., gender, age, etc.), and we discovered the interpretable directions to change the image's attributes (e.g., smile, pose, etc.). In another work, we used SFVQ to cluster the speaker embeddings to enhance the speaker's privacy in speech processing tools based on DNNs.
Julkaisun otsikon käännös | Vector quantization in deep neural networks for speech and image processing |
---|---|
Alkuperäiskieli | Englanti |
Pätevyys | Tohtorintutkinto |
Myöntävä instituutio |
|
Valvoja/neuvonantaja |
|
Kustantaja | |
Painoksen ISBN | 978-952-64-2361-6 |
Sähköinen ISBN | 978-952-64-2362-3 |
Tila | Julkaistu - 2025 |
OKM-julkaisutyyppi | G5 Artikkeliväitöskirja |