Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

doi:doi:10.57702/0yzgeweh

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results.

BibTex:

@dataset{Ziyue_Jiang_and_Yi_Ren_and_Zhenhui_Ye_and_Jinglin_Liu_and_Chen_Zhang_and_Qian_Yang_and_Shengpeng_Ji_and_Rongjie_Huang_and_Chunfeng_Wang_and_Xiang_Yin_and_Zejun_Ma_and_Zhou_Zhao_2024,
    abstract = {Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results.},
    author = {Ziyue Jiang and Yi Ren and Zhenhui Ye and Jinglin Liu and Chen Zhang and Qian Yang and Shengpeng Ji and Rongjie Huang and Chunfeng Wang and Xiang Yin and Zejun Ma and Zhou Zhao},
    doi = {10.57702/0yzgeweh},
    institution = {No Organization},
    keyword = {'intrinsic inductive bias', 'speech style generalization', 'zero-shot text-to-speech'},
    month = {dec},
    publisher = {TIB},
    title = {Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias},
    url = {https://service.tib.eu/ldmservice/dataset/mega-tts--zero-shot-text-to-speech-at-scale-with-intrinsic-inductive-bias},
    year = {2024}
}