FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Data and Resources

Cite this as

Yi Ren, Zhejiang University, Yangjun Ruan, Zhejiang University, Xu Tan, Microsoft Research, Tao Qin, Microsoft Research, Sheng Zhao, Microsoft STC Asia, Zhou Zhao, Zhejiang University, Tie-Yan Liu, Microsoft Research (2024). Dataset: FastSpeech: Fast, Robust and Controllable Text to Speech. https://doi.org/10.57702/jd8hw0cw

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Author Yi Ren
More Authors
Zhejiang University
Yangjun Ruan
Zhejiang University
Xu Tan
Microsoft Research
Tao Qin
Microsoft Research
Sheng Zhao
Microsoft STC Asia
Zhou Zhao
Zhejiang University
Tie-Yan Liu
Microsoft Research
Homepage https://speechresearch.github.io/fastspeech/