TY - GEN
T1 - Lombard speech synthesis using long short-term memory recurrent neural networks
AU - Bollepalli, Bajibabu
AU - Airaksinen, Manu
AU - Alku, Paavo
PY - 2017/6/16
Y1 - 2017/6/16
N2 - In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks have demonstrated promising results in SPSS, specifically by using long short-term memory recurrent neural networks (LSTMs). The Lombard effect, however, has not been studied in the LSTM-based speech synthesis systems. In this study, we propose three methods for Lombard speech adaptation in LSTM-based speech synthesis. In particular, (1) we augment Lombard specific information with the linguistic features as input, (2) scale the hidden activations using the learning hidden unit contributions (LHUC) method, and (3) fine-tune the LSTMs trained on normal speech with a small Lombard speech data. To investigate the effectiveness of the proposed methods, we carry out experiments using small (10 utterances) and large (500 utterances) Lombard speech data. Experimental results confirm the adaptability of the LSTMs, and similarity tests show that the LSTMs can achieve significantly better adaptation performance than the HMMs in both small and large data conditions.
AB - In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks have demonstrated promising results in SPSS, specifically by using long short-term memory recurrent neural networks (LSTMs). The Lombard effect, however, has not been studied in the LSTM-based speech synthesis systems. In this study, we propose three methods for Lombard speech adaptation in LSTM-based speech synthesis. In particular, (1) we augment Lombard specific information with the linguistic features as input, (2) scale the hidden activations using the learning hidden unit contributions (LHUC) method, and (3) fine-tune the LSTMs trained on normal speech with a small Lombard speech data. To investigate the effectiveness of the proposed methods, we carry out experiments using small (10 utterances) and large (500 utterances) Lombard speech data. Experimental results confirm the adaptability of the LSTMs, and similarity tests show that the LSTMs can achieve significantly better adaptation performance than the HMMs in both small and large data conditions.
KW - adaptation
KW - Lombard speech synthesis
KW - LSTM-TTS
UR - http://www.scopus.com/inward/record.url?scp=85023763091&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2017.7953209
DO - 10.1109/ICASSP.2017.7953209
M3 - Conference article in proceedings
AN - SCOPUS:85023763091
T3 - Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
SP - 5505
EP - 5509
BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PB - IEEE
T2 - IEEE International Conference on Acoustics, Speech, and Signal Processing
Y2 - 5 March 2017 through 9 March 2017
ER -