Formant Tracking by Combining Deep Neural Network and Linear Prediction

Sudarsana Kadiri*, Kevin Huang, Christina Hagedorn, Dani Byrd, Paavo Alku*, Shrikanth Narayanan

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

10 Downloads (Pure)

Abstract

Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasi-closed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker
with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNNQCP-FB) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepFQCP-FB). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formant-tracking performance across most test conditions.
Original languageEnglish
Pages (from-to)222-230
Number of pages9
JournalIEEE Open journal of Signal Processing
Volume6
DOIs
Publication statusPublished - 2025
MoE publication typeA1 Journal article-refereed

Keywords

  • Formant tracking
  • MFCCs
  • deep learning
  • linear prediction
  • machine learning
  • spectrogram

Fingerprint

Dive into the research topics of 'Formant Tracking by Combining Deep Neural Network and Linear Prediction'. Together they form a unique fingerprint.

Cite this