-
Neural Codec Language Models
Neural codec language models are zero-shot text to speech synthesizers. -
Diffusion Models for Minimally-Supervised Speech Synthesis
Minimally-supervised speech synthesis method based on diffusion models with minimal supervision. Introduces the CTAP method as an intermediate semantic representation and uses... -
TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. -
FastSpeech: Fast, Robust and Controllable Text to Speech
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate... -
SNIPER Training: Single-Shot Sparse Training for Text-to-Speech
Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models... -
CSTR VCTK Corpus
The CSTR VCTK Corpus is a dataset of speech recordings of 109 speakers, each with 20 utterances. -
VCTK Dataset
The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence. -
LJSpeech Dataset
The LJSpeech dataset is a collection of audio recordings of a single female speaker reading aloud. -
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
FastDiff is a fast conditional diffusion model for high-quality speech synthesis. It employs a stack of time-aware location-variable convolutions with diverse receptive field... -
LJ Speech Dataset
The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books. -
Hi-Fi Multi-Speaker English TTS dataset
The Hi-Fi Multi-Speaker English TTS dataset is used to generate training, validation and test inputs for the audio splicing detection and localization task. -
LibriSpeech dataset
The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.