-
Tatoeba and IWSLT2014 datasets
Simultaneous machine translation (SMT) datasets for Tatoeba and IWSLT2014. -
IWSLT-14 DE-EN
The dataset used in this paper is a machine translation dataset, specifically IWSLT-14 DE-EN. -
WMT16 English-Romanian
Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the... -
WMT14 English-German
Translation Given Non-Autoregressive a source sentence x, an AT model generates each target word yt conditioned on previously generated ones y<t, leading to high latency on... -
IWSLT14 German-English
Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the... -
WMT19 English-German
Two widely-used resource-rich benchmarks, WMT17 English-Chinese (20M) and WMT19 English-German (36M) translation tasks -
WMT17 English-Chinese
Two widely-used resource-rich benchmarks, WMT17 English-Chinese (20M) and WMT19 English-German (36M) translation tasks -
KFTT datasets
KFTT English↔Japanese translation datasets. -
NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) datasets
Chinese↔English translation tasks, KFTT English↔Japanese translation datasets. -
A New Aligned Simple German Corpus
A new sentence-aligned monolingual corpus for Simple German – German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment... -
Samanantar dataset
The Samanantar dataset containing 49.6 million sentence pairs between English and 11 Indian languages. -
DiscEvalMT
The dataset used for the experiments with document-level metrics for machine translation. -
OpenSubtitles2018
This dataset is used to evaluate the performance of context-aware machine translation systems. It consists of English-Russian subtitles with varying levels of context. -
WMT 2021 metrics shared task
The dataset used for the experiments with document-level metrics for machine translation. -
Covid-19 MLIA @ Eval initiative
The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal... -
TED2012 ASR and MT dataset
The dataset used in the paper is a collection of English ASR hypotheses from the eight submissions on the tst2012 test set in the IWSLT 2013 TED talk ASR track, along with... -
JW300 Dataset
A multilingual parallel corpus for low-resource languages -
Tanzil Dataset
A parallel corpus for low-resource languages -
China Workshop on Machine Translation in 2017
The dataset used in the paper is the news data from China Workshop on Machine Translation in 2017. -
WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017
The dataset used in the paper is the WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017.