Dataset - LDM

TIMIT dataset

The dataset used in this paper is a collection of phonetically and phonologically local allophonic distribution in English, where voiceless stops surface as aspirated...
- Dataset
- JSON
VGGSound

The VGGSound dataset is a large-scale audio-visual dataset containing 10,000 10-second video clips with corresponding audio files.
- Dataset
- JSON
TED Gesture

TED Gesture is a large-scale English-based dataset for co-speech gesture generation, composed of 1766 TED videos from various topics and different narrators.
- Dataset
- JSON
TIMIT

The TIMIT corpus is a widely used benchmark for speech recognition tasks. It contains 3,696 training utterances from 462 speakers, excluding the SA sentences. The core test set...
- Dataset
- JSON
LRS2

The LRS2 dataset consists of 48,164 video clips from outdoor shows on BBC television. Each video is accompanied by an audio corresponding to a sentence with up to 100 characters.
- Dataset
- JSON
Google Speech Command Dataset

The Google Speech Command Dataset is a dataset for keyword spotting, which is a task in speech recognition. The dataset contains 12 classes, including 10 keywords and two extra...
- Dataset
- JSON
wav2vec 2.0

The wav2vec 2.0 dataset is a self-supervised learning dataset for speech recognition tasks.
- Dataset
- JSON
Switchboard Corpus

The Switchboard corpus is a dataset of speech recordings from a switchboard, which is a device that allows multiple people to speak at the same time.
- Dataset
- JSON
TIMIT Acoustic-Phonetic Continuous Speech Corpus

The TIMIT acoustic-phonetic continuous speech corpusCD-ROM contains a large collection of speech samples from 250 male and 250 female speakers.
- Dataset
- JSON
Speech Commands Dataset

The dataset used for training the keyword spotting model is the ESC: Dataset for Environmental Sound Classification, and the Speech Commands Dataset.
- Dataset
- JSON
LRS2 dataset

The LRS2 dataset consists of news recordings from the BBC, with different lighting, backgrounds, face poses, and people with different origins.
- Dataset
- JSON
GRID dataset

The GRID dataset was introduced by [5] as a corpus for tasks such as speech perception and speech recognition. GRID contains 33 unique speakers, articulating 1000 word sequences...
- Dataset
- JSON
Speech Commands

The Speech Commands dataset consists of 105809 one-second audio recordings of 35 spoken words sampled at 16kHz. The raw speech commands dataset presents audio recordings as a...
- Dataset
- JSON
Switchboard

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment.
- Dataset
- JSON
ESC-50

The dataset used for training the CNN in cough detection is composed of various modified audio clips gathered from open-source online sources. Each of these audio files...
- Dataset
- JSON
Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
- Dataset
- JSON
LibriLight

The dataset used in this paper is a large-scale production ASR system, which includes multi-domain (MD) data sets in English. The MD data sets include medium-form (MF) and...
- Dataset
- JSON
MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
- Dataset
- JSON
VoxCeleb

Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature...
- Dataset
- JSON
VCTK

Voice conversion (VC) is a technique that alters the voice of a source speaker to a target style, such as speaker identity, prosody, and emotion, while keeping the linguistic...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

40 datasets found