Dataset - LDM

TED-LIUM 2

Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks.
- Dataset
- JSON
TEDLIUM Corpus

The TEDLIUM corpus is a large-volume corpus used for speech recognition and text summarization.
- Dataset
- JSON
TED Speech Summarization Corpus

Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS).
- Dataset
- JSON
LRS3

The LRS3 dataset is a large-scale dataset for visual speech recognition. It consists of thousands of spoken sentences from TED videos.
- Dataset
- JSON
MuST-C

MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

5 datasets found