56 datasets found

Groups: Speech Synthesis Organizations: No Organization

Filter Results
  • VCTK Corpus

    The VCTK corpus is an English multi-speaker dataset, with 44 hours of audio spoken by 109 native English speakers.
  • CSTR VCTK Corpus

    The CSTR VCTK Corpus is a dataset of speech recordings of 109 speakers, each with 20 utterances.
  • Style Tokens

    Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...
  • Tacotron

    Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...
  • Global Style Tokens

    Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...
  • Text-Predicted Global Style Tokens

    Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a...
  • VCTK Dataset

    The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence.
  • LJSpeech Dataset

    The LJSpeech dataset is a collection of audio recordings of a single female speaker reading aloud.
  • FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

    FastDiff is a fast conditional diffusion model for high-quality speech synthesis. It employs a stack of time-aware location-variable convolutions with diverse receptive field...
  • LJ Speech Dataset

    The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books.
  • LJSpeech and VCTK datasets

    The LJSpeech dataset contains 13,100 22kHz audio clips of a female speaker. The VCTK dataset consists of 108 native English speakers with various accents.
  • SC VALL-E

    The proposed SC VALL-E model uses a Korean grapheme-to-phoneme converter to extract phonemes from Korean text and use them for training.
  • Hi-Fi Multi-Speaker English TTS dataset

    The Hi-Fi Multi-Speaker English TTS dataset is used to generate training, validation and test inputs for the audio splicing detection and localization task.
  • VCTK

    Voice conversion (VC) is a technique that alters the voice of a source speaker to a target style, such as speaker identity, prosody, and emotion, while keeping the linguistic...
  • LibriSpeech dataset

    The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.
  • LibriTTS

    A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation.