Machine Translation - Groups

Machine Translation Datasets

The dataset used in the paper is a collection of adversarial examples and natural examples for machine translation tasks.
- Dataset
- JSON
KFTT datasets

KFTT English↔Japanese translation datasets.
- Dataset
- JSON
NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) datasets

Chinese↔English translation tasks, KFTT English↔Japanese translation datasets.
- Dataset
- JSON
A New Aligned Simple German Corpus

A new sentence-aligned monolingual corpus for Simple German – German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment...
- Dataset
- JSON
Samanantar dataset

The Samanantar dataset containing 49.6 million sentence pairs between English and 11 Indian languages.
- Dataset
- JSON
DiscEvalMT

The dataset used for the experiments with document-level metrics for machine translation.
- Dataset
- JSON
OpenSubtitles2018

This dataset is used to evaluate the performance of context-aware machine translation systems. It consists of English-Russian subtitles with varying levels of context.
- Dataset
- JSON
WMT 2021 metrics shared task

The dataset used for the experiments with document-level metrics for machine translation.
- Dataset
- JSON
MultiLexNorm dataset

The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization.
- Dataset
- JSON
MTNT dataset

The MTNT dataset is used to evaluate the robustness of MT models to noisy text.
- Dataset
- JSON
FLORES-200 devtest dataset

The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations.
- Dataset
- JSON
Covid-19 MLIA @ Eval initiative

The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal...
- Dataset
- JSON
TED2012 ASR and MT dataset

The dataset used in the paper is a collection of English ASR hypotheses from the eight submissions on the tst2012 test set in the IWSLT 2013 TED talk ASR track, along with...
- Dataset
- JSON
JW300 Dataset

A multilingual parallel corpus for low-resource languages
- Dataset
- JSON
Tanzil Dataset

A parallel corpus for low-resource languages
- Dataset
- JSON
China Workshop on Machine Translation in 2017

The dataset used in the paper is the news data from China Workshop on Machine Translation in 2017.
- Dataset
- JSON
WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017

The dataset used in the paper is the WIT corpus, SETimes corpus, newsdev2016, newstest2016, and newstest2017.
- Dataset
- JSON
Turkish-English and Uyghur-Chinese machine translation tasks

The dataset used in the paper is the Turkish-English and Uyghur-Chinese machine translation tasks.
- Dataset
- JSON
WMT22 Translation Suggestion Task

WMT22 Shared Task on Translation Suggestion (TS) dataset.
- Dataset
- JSON
WMT datasets

WMT datasets are large-scale machine translation datasets.
- Dataset
- JSON

106 datasets found