Dataset - LDM

CommonVoice

The sequence-to-sequence approach is widely used in speech recognition (SR) nowadays, and many research works are dedicated to show that their capabilities relying on a single...
- Dataset
- JSON
MuST-C

MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
- Dataset
- JSON
Europarl-v7

Multilingual document classification task, where labeled data is available only for one language (e.g. English) while classification must be performed in a different language...
- Dataset
- JSON
Dictation dataset

The dictation dataset across 39 locales, including Latin (Albanian, Icelandic, Slovak), Arabic (Levant, Maghrebi), Cyrillic (Macedonian, Kazakh), Devanagari (Nepali), etc.
- Dataset
- JSON
LJ Speech Dataset

The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books.
- Dataset
- JSON
Wikipedia Comparable Corpora

Multilingual dataset for topic modeling based on aligned Wikipedia articles extracted from Wikipedia Comparable Corpora
- Dataset
- JSON
VATEX

The dataset used in the paper is a video question answering dataset, which is a large-scale video-language pre-training task.
- Dataset
- JSON
Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
- Dataset
- JSON
LibriLight

The dataset used in this paper is a large-scale production ASR system, which includes multi-domain (MD) data sets in English. The MD data sets include medium-form (MF) and...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

49 datasets found