-
Machine Translation Datasets
The dataset used in the paper is a collection of adversarial examples and natural examples for machine translation tasks. -
KFTT datasets
KFTT English↔Japanese translation datasets. -
NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) datasets
Chinese↔English translation tasks, KFTT English↔Japanese translation datasets. -
A New Aligned Simple German Corpus
A new sentence-aligned monolingual corpus for Simple German – German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment... -
Samanantar dataset
The Samanantar dataset containing 49.6 million sentence pairs between English and 11 Indian languages. -
DiscEvalMT
The dataset used for the experiments with document-level metrics for machine translation. -
OpenSubtitles2018
This dataset is used to evaluate the performance of context-aware machine translation systems. It consists of English-Russian subtitles with varying levels of context. -
WMT 2021 metrics shared task
The dataset used for the experiments with document-level metrics for machine translation. -
MultiLexNorm dataset
The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization. -
MTNT dataset
The MTNT dataset is used to evaluate the robustness of MT models to noisy text. -
FLORES-200 devtest dataset
The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations. -
Covid-19 MLIA @ Eval initiative
The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal... -
TED2012 ASR and MT dataset
The dataset used in the paper is a collection of English ASR hypotheses from the eight submissions on the tst2012 test set in the IWSLT 2013 TED talk ASR track, along with... -
JW300 Dataset
A multilingual parallel corpus for low-resource languages -
Tanzil Dataset
A parallel corpus for low-resource languages -
China Workshop on Machine Translation in 2017
The dataset used in the paper is the news data from China Workshop on Machine Translation in 2017. -
WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017
The dataset used in the paper is the WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017. -
Turkish-English and Uyghur-Chinese machine translation tasks
The dataset used in the paper is the Turkish-English and Uyghur-Chinese machine translation tasks. -
WMT22 Translation Suggestion Task
WMT22 Shared Task on Translation Suggestion (TS) dataset. -
WMT datasets
WMT datasets are large-scale machine translation datasets.