Dataset - LDM

MuST-C v1.0

MuST-C v1.0 is a multilingual corpus for end-to-end speech translation, containing 8 language pairs.
- Dataset
- JSON
Europarl-ST

Europarl-ST is a multilingual speech corpus that contains transcriptions of parliamentary debates in multiple languages.
- Dataset
- JSON
TED LIUM corpus

The dataset used in the paper is the TED LIUM corpus.
- Dataset
- JSON
Speech-translation TED corpus

The dataset used in the paper is the Speech-translation TED corpus.
- Dataset
- JSON
Fisher and Callhome Spanish-English Speech Translation Corpus

The dataset used in the paper is the Fisher and Callhome Spanish-English Speech Translation Corpus.
- Dataset
- JSON
IWSLT2018 Speech Translation Task

The dataset used in the paper is the IWSLT2018 speech translation task, which consists of five parts: TED corpus, Speech-translation TED corpus, TED LIUM corpus, WMT18 data and...
- Dataset
- JSON
TED2012 ASR and MT dataset

The dataset used in the paper is a collection of English ASR hypotheses from the eight submissions on the tst2012 test set in the IWSLT 2013 TED talk ASR track, along with...
- Dataset
- JSON
MuST-C

MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

8 datasets found