Abstract
Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. In this paper, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 15% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.
Original language | English |
---|---|
Title of host publication | 29th European Signal Processing Conference, EUSIPCO 2021 - Proceedings |
Publisher | IEEE |
Pages | 176-180 |
Number of pages | 5 |
ISBN (Electronic) | 978-9-0827-9706-0 |
DOIs | |
Publication status | Published - 27 Aug 2021 |
MoE publication type | A4 Article in a conference publication |
Event | European Signal Processing Conference - Dublin, Ireland, Dublin, Ireland Duration: 23 Aug 2021 → 27 Aug 2021 Conference number: 29 |
Publication series
Name | European Signal Processing Conference |
---|---|
ISSN (Print) | 2219-5491 |
ISSN (Electronic) | 2076-1465 |
Conference
Conference | European Signal Processing Conference |
---|---|
Abbreviated title | EUSIPCO |
Country/Territory | Ireland |
City | Dublin |
Period | 23/08/2021 → 27/08/2021 |
Keywords
- jitter
- mel-spectrogram
- fusion
- shimmer
- speech recognition