High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Standard

High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network. / Juvela, Lauri; Bollepalli, Bajibabu; Airaksinen, Manu; Alku, Paavo.

IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers, 2016. p. 5120-5124 7472653 (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Harvard

Juvela, L, Bollepalli, B, Airaksinen, M & Alku, P 2016, High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network. in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Proceedings. vol. 2016-May, 7472653, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Institute of Electrical and Electronics Engineers, pp. 5120-5124, IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 20/03/2016. https://doi.org/10.1109/ICASSP.2016.7472653

APA

Juvela, L., Bollepalli, B., Airaksinen, M., & Alku, P. (2016). High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Proceedings (Vol. 2016-May, pp. 5120-5124). [7472653] (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICASSP.2016.7472653

Vancouver

Juvela L, Bollepalli B, Airaksinen M, Alku P. High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Proceedings. Vol. 2016-May. Institute of Electrical and Electronics Engineers. 2016. p. 5120-5124. 7472653. (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing). https://doi.org/10.1109/ICASSP.2016.7472653

Author

Juvela, Lauri ; Bollepalli, Bajibabu ; Airaksinen, Manu ; Alku, Paavo. / High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers, 2016. pp. 5120-5124 (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing).

Bibtex - Download

@inproceedings{abbd661961fa46198e09e65e85c8fb25,
title = "High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network",
abstract = "Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances in the study area. Vocoding is one such key element in all statistical speech synthesizers that is known to affect the synthesis quality and naturalness. The present study focuses on a special type of vocoding, glottal vocoders, which aim to parameterize speech based on modelling the real excitation of (voiced) speech, the glottal flow. More specifically, we compare three different glottal vocoders by aiming at improved synthesis naturalness of female voices. Two of the vocoders are previously known, both utilizing an old glottal inverse filtering (GIF) method in estimating the glottal flow. The third on, denoted as Quasi Closed Phase - Deep Neural Net (QCP-DNN), takes advantage of a recently proposed new GIF method that shows improved accuracy in estimating the glottal flow from high-pitched speech. Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.",
keywords = "Deep neural network, Glottal inverse filtering, Glottal vocoder, QCP, Statistical parametric speech synthesis",
author = "Lauri Juvela and Bajibabu Bollepalli and Manu Airaksinen and Paavo Alku",
year = "2016",
month = "5",
day = "18",
doi = "10.1109/ICASSP.2016.7472653",
language = "English",
isbn = "9781479999880",
volume = "2016-May",
series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
publisher = "Institute of Electrical and Electronics Engineers",
pages = "5120--5124",
booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016",
address = "United States",

}

RIS - Download

TY - GEN

T1 - High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network

AU - Juvela, Lauri

AU - Bollepalli, Bajibabu

AU - Airaksinen, Manu

AU - Alku, Paavo

PY - 2016/5/18

Y1 - 2016/5/18

N2 - Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances in the study area. Vocoding is one such key element in all statistical speech synthesizers that is known to affect the synthesis quality and naturalness. The present study focuses on a special type of vocoding, glottal vocoders, which aim to parameterize speech based on modelling the real excitation of (voiced) speech, the glottal flow. More specifically, we compare three different glottal vocoders by aiming at improved synthesis naturalness of female voices. Two of the vocoders are previously known, both utilizing an old glottal inverse filtering (GIF) method in estimating the glottal flow. The third on, denoted as Quasi Closed Phase - Deep Neural Net (QCP-DNN), takes advantage of a recently proposed new GIF method that shows improved accuracy in estimating the glottal flow from high-pitched speech. Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.

AB - Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances in the study area. Vocoding is one such key element in all statistical speech synthesizers that is known to affect the synthesis quality and naturalness. The present study focuses on a special type of vocoding, glottal vocoders, which aim to parameterize speech based on modelling the real excitation of (voiced) speech, the glottal flow. More specifically, we compare three different glottal vocoders by aiming at improved synthesis naturalness of female voices. Two of the vocoders are previously known, both utilizing an old glottal inverse filtering (GIF) method in estimating the glottal flow. The third on, denoted as Quasi Closed Phase - Deep Neural Net (QCP-DNN), takes advantage of a recently proposed new GIF method that shows improved accuracy in estimating the glottal flow from high-pitched speech. Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.

KW - Deep neural network

KW - Glottal inverse filtering

KW - Glottal vocoder

KW - QCP

KW - Statistical parametric speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=84973293681&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2016.7472653

DO - 10.1109/ICASSP.2016.7472653

M3 - Conference contribution

SN - 9781479999880

VL - 2016-May

T3 - Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing

SP - 5120

EP - 5124

BT - IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016

PB - Institute of Electrical and Electronics Engineers

ER -

ID: 3263829