Dataset - LDM

Turkish-English and Uyghur-Chinese machine translation tasks

The dataset used in the paper is the Turkish-English and Uyghur-Chinese machine translation tasks.
- Dataset
- JSON
WMT22 Translation Suggestion Task

WMT22 Shared Task on Translation Suggestion (TS) dataset.
- Dataset
- JSON
WMT datasets

WMT datasets are large-scale machine translation datasets.
- Dataset
- JSON
LDC 2002 English-Chinese Dataset

The LDC 2002 English-Chinese dataset is used for testing the proposed approach.
- Dataset
- JSON
WMT 2016 English-German Dataset

The WMT 2016 English-German dataset is used for testing the proposed approach.
- Dataset
- JSON
WMT 2014 English-French Dataset

The WMT 2014 English-French dataset is used for testing the proposed approach.
- Dataset
- JSON
IWSLT'14 German-English Translation Dataset

The dataset contains 160K sentence pairs for German-English translation.
- Dataset
- JSON
WMT17 Chinese-English Translation Dataset

The dataset contains 20M sentence pairs for Chinese-English translation.
- Dataset
- JSON
IWSLT 2014

The IWSLT 2014 German-to-English dataset is a machine translation dataset, containing 153K sentence pairs.
- Dataset
- JSON
Workshop of Machine Translation 2018

The Workshop of Machine Translation 2018 dataset is used to train the text machine translation models.
- Dataset
- JSON
WMT 2014 English-German

The dataset used in the paper is WMT 2014 English-German dataset, which is a machine translation dataset.
- Dataset
- JSON
WMT'19

The dataset is used for machine translation, text summarization, and open-ended text generation tasks.
- Dataset
- JSON
IWSLT'14

The dataset is used for machine translation, text summarization, and open-ended text generation tasks.
- Dataset
- JSON
United Nations Parallel Corpus

High-quality human translations from books, leveraging the induction bias that high-quality human translations are superior to machine-generated translations.
- Dataset
- JSON
MuST-C

MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
- Dataset
- JSON
WeTS

Translation Suggestion (TS) dataset for the WMT22 Shared Task on Translation Suggestion (TS).
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
- Dataset
- JSON
IWSLT14 EN→DE, WMT14 EN→DE, WMT16 EN→DE

The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used the IWSLT14 EN→DE task, WMT14 EN→DE task, and WMT16 EN→DE task.
- Dataset
- JSON
WMT’16 English-Romanian dataset

The WMT’16 English-Romanian dataset was used for machine translation task.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

81 datasets found