Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets

Matias Lindgren, Tommi Jauhiainen, Mikko Kurimo

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

1 Citation (Scopus)
77 Downloads (Pure)

Abstract

In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the back-end classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association (ISCA)
Pages467-471
Number of pages5
Volume2020-October
DOIs
Publication statusPublished - 2020
MoE publication typeA4 Conference publication
EventInterspeech - Shanghai, China
Duration: 25 Oct 202029 Oct 2020
Conference number: 21
http://www.interspeech2020.org/

Publication series

NameInterspeech
PublisherInternational Speech Communication Association
ISSN (Print)2308-457X

Conference

ConferenceInterspeech
Abbreviated titleINTERSPEECH
Country/TerritoryChina
CityShanghai
Period25/10/202029/10/2020
Internet address

Keywords

  • Deep learning
  • Language embedding
  • Spoken language identification
  • TensorFlow
  • X-vector

Fingerprint

Dive into the research topics of 'Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets'. Together they form a unique fingerprint.

Cite this