Speech Synthesis - Groups

VCTK Corpus

The VCTK corpus is an English multi-speaker dataset, with 44 hours of audio spoken by 109 native English speakers.

Dataset
JSON

CSTR VCTK Corpus

The CSTR VCTK Corpus is a dataset of speech recordings of 109 speakers, each with 20 utterances.

Dataset
JSON

Style Tokens

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...

Dataset
JSON

Tacotron

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...

Dataset
JSON

Global Style Tokens

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...

Dataset
JSON

Text-Predicted Global Style Tokens

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...

Dataset
JSON

VCTK Dataset

The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence.

Dataset
JSON

LJSpeech Dataset

The LJSpeech dataset is a collection of audio recordings of a single female speaker reading aloud.

Dataset
JSON

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

FastDiff is a fast conditional diffusion model for high-quality speech synthesis. It employs a stack of time-aware location-variable convolutions with diverse receptive field...

Dataset
JSON

LJ Speech Dataset

The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books.

Dataset
JSON

LJSpeech and VCTK datasets

The LJSpeech dataset contains 13,100 22kHz audio clips of a female speaker. The VCTK dataset consists of 108 native English speakers with various accents.

Dataset
JSON

SC VALL-E

The proposed SC VALL-E model uses a Korean grapheme-to-phoneme converter to extract phonemes from Korean text and use them for training.

Dataset
JSON

Hi-Fi Multi-Speaker English TTS dataset

The Hi-Fi Multi-Speaker English TTS dataset is used to generate training, validation and test inputs for the audio splicing detection and localization task.

Dataset
JSON

VCTK

Voice conversion (VC) is a technique that alters the voice of a source speaker to a target style, such as speaker identity, prosody, and emotion, while keeping the linguistic...

Dataset
JSON

LibriSpeech dataset

The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.

Dataset
JSON

LibriTTS

A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation.

Dataset
JSON

56 datasets found