-
VCTK Corpus
The VCTK corpus is an English multi-speaker dataset, with 44 hours of audio spoken by 109 native English speakers. -
CSTR VCTK Corpus
The CSTR VCTK Corpus is a dataset of speech recordings of 109 speakers, each with 20 utterances. -
Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
Global Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
Text-Predicted Global Style Tokens
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a... -
VCTK Dataset
The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence. -
LJSpeech Dataset
The LJSpeech dataset is a collection of audio recordings of a single female speaker reading aloud. -
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
FastDiff is a fast conditional diffusion model for high-quality speech synthesis. It employs a stack of time-aware location-variable convolutions with diverse receptive field... -
LJ Speech Dataset
The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books. -
LJSpeech and VCTK datasets
The LJSpeech dataset contains 13,100 22kHz audio clips of a female speaker. The VCTK dataset consists of 108 native English speakers with various accents. -
Hi-Fi Multi-Speaker English TTS dataset
The Hi-Fi Multi-Speaker English TTS dataset is used to generate training, validation and test inputs for the audio splicing detection and localization task. -
LibriSpeech dataset
The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.