Dataset - LDM

No language left behind: Scaling human-centered machine translation

The dataset is used for training and testing the performance of multilingual language models.
- Dataset
- JSON
PADIC

Machine translation experiments on PADIC: A parallel Arabic dialect corpus
- Dataset
- JSON
WMT'14 English-German, WAT'17 Japanese-English, and WMT'17 Chinese-English tr...

The dataset used in the paper is WMT'14 English-German, WAT'17 Japanese-English, and WMT'17 Chinese-English translation tasks.
- Dataset
- JSON
OCR4MT

OCR4MT is a benchmark for OCR systems on low-resource languages and scripts.
- Dataset
- JSON
WMT14 English-French

The dataset used for bilingual resynchronization task, which includes WMT14 English-French data and the small parallel sentence compression dataset.
- Dataset
- JSON
Bilingual Synchronization

The dataset used for bilingual synchronization task, which includes simulated interactive MT, translating with Translation Memory (TM) and TM cleaning.
- Dataset
- JSON
WMT14

The dataset used in the paper is a large corpus of text data, used for training and testing machine translation models.
- Dataset
- JSON
Diabla: A Corpus of Bilingual Spontaneous Written Dialogues

A corpus of bilingual spontaneous written dialogues for machine translation.
- Dataset
- JSON
Various Machine Translation datasets

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used various datasets for machine translation tasks.
- Dataset
- JSON
Moses Toolkit dataset

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used the Moses toolkit to tokenize sentences and split words into subword units.
- Dataset
- JSON
IT, Koran, Medical, and Law datasets

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used four commonly-used benchmarks, including IT, Koran, Medical, and Law.
- Dataset
- JSON
IWSLT 2014 Shared Task Dataset

The IWSLT 2014 shared task dataset contains 152K, 156K, 141K and 172K training sentences for the de-en, zh-en, en-tr and en-es language pairs, respectively.
- Dataset
- JSON
OPUS-100

The dataset used in the paper is a subset of the OPUS-MT dataset, containing 1M randomly sampled examples from the OPUS-100 dataset.
- Dataset
- JSON
WMT17 Zh-En

Non-autoregressive machine translation dataset
- Dataset
- JSON
WMT14 En-De

The WMT14 En-De dataset contains 4.5M pairs of English and German sentences.
- Dataset
- JSON
newstest2019.orig-en.p

The paraphrased reference translations used for the experiments in the paper.
- Dataset
- JSON
newstest2018.orig-en.p

The paraphrased reference translations used for the experiments in the paper.
- Dataset
- JSON
WMT 2019 English-German news translation task

The dataset used for the experiments in the paper, containing English-German news translation task.
- Dataset
- JSON
IWSLT 2015 English-Vietnamese

The IWSLT 2015 English-Vietnamese language data set, which has around 133k training sentence pairs.
- Dataset
- JSON
COMET: A neural framework for MT evaluation

The COMET dataset contains human-annotated scores for machine translation candidates.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

100 datasets found