Dataset - LDM

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis. It is trained to estimate the gradient of the log conditional density of the waveform given a...
- Dataset
- JSON
ASVspoof 2021

The ASVspoof 2021 dataset is a large-scale public dataset for speaker verification and spoofing countermeasures. The dataset contains various types of audio files, including...
- Dataset
- JSON
LRS3

The LRS3 dataset is a large-scale dataset for visual speech recognition. It consists of thousands of spoken sentences from TED videos.
- Dataset
- JSON
BOFFIN TTS: Few-shot Speaker Adaptation by Bayesian Optimization

BOFFIN TTS is a novel approach for few-shot speaker adaptation. The task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances.
- Dataset
- JSON
TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
- Dataset
- JSON
ASVspoof 2019

The ASVspoof 2019 dataset is a large-scale public dataset for speaker verification and spoofing countermeasures. The dataset contains various types of audio files, including...
- Dataset
- JSON
FastSpeech2

FastSpeech2 is a text-to-speech model that uses a wavegan-based vocoder.
- Dataset
- JSON
SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models...
- Dataset
- JSON
MEGA-TTS 2: BOOSTING PROMPTING MECHANISMS FOR ZERO-SHOT SPEECH SYNTHESIS

Zero-shot text-to-speech aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping...
- Dataset
- JSON
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech

FastSpeech 2 is a fast and high-quality end-to-end text-to-speech system. It uses a multi-task learning approach to learn the mapping between phonemes and waveforms.
- Dataset
- JSON
LibriTTS

A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

11 datasets found