-
XNMT: The eXtensible Neural Machine Translation Toolkit
XNMT is a neural machine translation toolkit that focuses on modular code design, making it easy to swap in and out different parts of the model. -
MADAR dataset
The MADAR dataset is a parallel corpus for low-resource languages. -
ArzEnSEG corpus
The ArzEnSEG corpus is a morphologically annotated dataset for code-switched Egyptian Arabic-English. -
ArzEn parallel corpus
The ArzEn parallel corpus consists of speech transcriptions gathered through informal interviews with bilingual Egyptian Arabic-English speakers, as well as their English... -
English-to-Chinese Controlled Machine Translation
The dataset for English-to-Chinese controlled machine translation. -
Chinese-to-English Controlled Machine Translation
The dataset for Chinese-to-English controlled machine translation. -
English Controlled Machine Translation
The dataset for English controlled machine translation. -
IWSLT 2017
The dataset used in the paper is a collection of text for machine translation, using a single machine translation system for multiple language directions. -
WMT 2023 Metrics Shared Task
Findings of the WMT 2023 shared task on automatic post-editing -
Europarl English Romanian dataset
Europarl English Romanian dataset. -
IWSLT Vietnamese→English and ACL Romanian→English datasets
IWSLT Vietnamese→English and ACL Romanian→English datasets. -
French-English Translation Task
The dataset used in the paper is a French-English translation task. -
Vietnamese Diacritic Restoration Dataset
The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs. -
IWSLT 2014 English-to-Turkish
English-to-Turkish task of the IWSLT 2014 dataset -
IWSLT 2014 English-to-Portuguese
English-to-Portuguese task of the IWSLT 2014 dataset -
IWSLT 2014 English-to-German
English-to-German task of the IWSLT 2014 dataset -
WMT 2010 and WMT 2012 datasets
The dataset used in the paper is WMT 2010 and WMT 2012 datasets, which contain machine translation tasks. -
Zh-En Multi-Domain Dataset
The Zh-En multi-domain dataset consists of four balanced domains: news, patent, subtitles, and COVID-19.