Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation

Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

8 Citations (Scopus)

Abstract

Recent neural waveform synthesizers such as WaveNet, WaveG-low, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training from scratch on music data, zero-shot learning from the speech domain, and fine-tuning-based adaptation from the speech to the music domain. The results of a large-scale perceptual test demonstrated that the performance of three synthesizers improved when they were pre-trained on speech data and fine-tuned on music data, which indicates the usefulness of knowledge from speech data for music audio generation. Among the synthesizers, WaveGlow showed the best potential in zero-shot learning while NSF performed best in the other scenarios and could generate samples that were perceptually close to natural audio.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherIEEE
Pages6269-6273
Number of pages5
ISBN (Electronic)9781509066315
DOIs
Publication statusPublished - May 2020
MoE publication typeA4 Conference publication
EventIEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual conference, Barcelona, Spain
Duration: 4 May 20208 May 2020
Conference number: 45

Publication series

NameProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

ConferenceIEEE International Conference on Acoustics, Speech, and Signal Processing
Abbreviated titleICASSP
Country/TerritorySpain
CityBarcelona
Period04/05/202008/05/2020
OtherVirtual conference

Keywords

  • fine-tuning
  • musical instrument sounds synthesis
  • Neural waveform synthesizer
  • zero-shot adaptation

Fingerprint

Dive into the research topics of 'Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation'. Together they form a unique fingerprint.

Cite this