Dataset - LDM

Viola

Unified codec language models for speech recognition, synthesis, and translation.
- Dataset
- JSON
Neural Codec Language Models

Neural codec language models are zero-shot text to speech synthesizers.
- Dataset
- JSON
Audiopalm

A large language model that can speak and listen.
- Dataset
- JSON
CodecFake

A comprehensive collection of contemporary codec models, resulting in the creation of the CodecFake dataset.
- Dataset
- JSON
Diffusion Models for Minimally-Supervised Speech Synthesis

Minimally-supervised speech synthesis method based on diffusion models with minimal supervision. Introduces the CTAP method as an intermediate semantic representation and uses...
- Dataset
- JSON
TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
- Dataset
- JSON
FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has signiﬁcantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually ﬁrst generate...
- Dataset
- JSON
SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models...
- Dataset
- JSON
CSTR VCTK Corpus

The CSTR VCTK Corpus is a dataset of speech recordings of 109 speakers, each with 20 utterances.
- Dataset
- JSON
VCTK Dataset

The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence.
- Dataset
- JSON
LJSpeech Dataset

The LJSpeech dataset is a collection of audio recordings of a single female speaker reading aloud.
- Dataset
- JSON
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

FastDiff is a fast conditional diffusion model for high-quality speech synthesis. It employs a stack of time-aware location-variable convolutions with diverse receptive field...
- Dataset
- JSON
LJ Speech Dataset

The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books.
- Dataset
- JSON
Hi-Fi Multi-Speaker English TTS dataset

The Hi-Fi Multi-Speaker English TTS dataset is used to generate training, validation and test inputs for the audio splicing detection and localization task.
- Dataset
- JSON
LibriSpeech dataset

The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.
- Dataset
- JSON
LibriTTS

A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

16 datasets found