-
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS.... -
Parallel WaveGAN-based waveform synthesis with voicing-aware conditional disc...
This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. -
QS-TTS: A Semi-Supervised Text-to-Speech Framework
QS-TTS is a semi-supervised TTS framework based on VQ-S3RL to effectively utilize more unlabeled speech audio to improve TTS quality while reducing its requirements for... -
BOFFIN TTS: Few-shot Speaker Adaptation by Bayesian Optimization
BOFFIN TTS is a novel approach for few-shot speaker adaptation. The task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. -
FastSpeech: Fast, Robust and Controllable Text to Speech
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate... -
Libri-Light
The dataset used in the paper is the Libri-Light dataset, which is a subset of the LibriSpeech dataset. The authors used this dataset to pre-train their proposed dual-mode ASR... -
Guided-TTS 2
Guided-TTS 2 is a diffusion-based generative model for high-quality adaptive text-to-speech with untranscribed data. -
SNIPER Training: Single-Shot Sparse Training for Text-to-Speech
Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models... -
FastSpeech
The FastSpeech dataset is a text-to-speech dataset used to train the FastSpeech model. -
Non-Attentive Tacotron
Non-Attentive Tacotron is a neural text-to-speech model that combines a robust duration predictor with an autoregressive decoder. -
MEGA-TTS 2: BOOSTING PROMPTING MECHANISMS FOR ZERO-SHOT SPEECH SYNTHESIS
Zero-shot text-to-speech aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping... -
Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
Global Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
Text-Predicted Global Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech
FastSpeech 2 is a fast and high-quality end-to-end text-to-speech system. It uses a multi-task learning approach to learn the mapping between phonemes and waveforms.