FastSpeech: Fast, Robust and Controllable Text to Speech

doi:doi:10.57702/jd8hw0cw

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has signiﬁcantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually ﬁrst generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

BibTex:

@dataset{Yi_Ren_and_Zhejiang_University_and_Yangjun_Ruan_and_Zhejiang_University_and_Xu_Tan_and_Microsoft_Research_and_Tao_Qin_and_Microsoft_Research_and_Sheng_Zhao_and_Microsoft_STC_Asia_and_Zhou_Zhao_and_Zhejiang_University_and_Tie-Yan_Liu_and_Microsoft_Research_2024,
    abstract = {Neural network based end-to-end text to speech (TTS) has signiﬁcantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually ﬁrst generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.},
    author = {Yi Ren and Zhejiang University and Yangjun Ruan and Zhejiang University and Xu Tan and Microsoft Research and Tao Qin and Microsoft Research and Sheng Zhao and Microsoft STC Asia and Zhou Zhao and Zhejiang University and Tie-Yan Liu and Microsoft Research},
    doi = {10.57702/jd8hw0cw},
    institution = {No Organization},
    keyword = {'Transformer', 'parallel mel-spectrogram generation', 'speech synthesis', 'text to speech'},
    month = {dec},
    publisher = {TIB},
    title = {FastSpeech: Fast, Robust and Controllable Text to Speech},
    url = {https://service.tib.eu/ldmservice/dataset/fastspeech--fast--robust-and-controllable-text-to-speech},
    year = {2024}
}