-
English-Hindi Parallel Corpus
The dataset used for training and testing the machine translation systems. -
English-Hindi Outputs Quality Estimation using Naive Bayes Classifier
The dataset used for training and testing the Naive Bayes classifier for quality estimation of English-Hindi outputs. -
XNMT: The eXtensible Neural Machine Translation Toolkit
XNMT is a neural machine translation toolkit that focuses on modular code design, making it easy to swap in and out different parts of the model. -
Vietnamese Diacritic Restoration Dataset
The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs. -
Zh-En Multi-Domain Dataset
The Zh-En multi-domain dataset consists of four balanced domains: news, patent, subtitles, and COVID-19. -
Machine Translation and Automated Analysis of the Sumerian Language Dataset
The Machine Translation and Automated Analysis of the Sumerian Language dataset, which contains Sumerian texts in cuneiform script. -
MultiLexNorm dataset
The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization. -
MTNT dataset
The MTNT dataset is used to evaluate the robustness of MT models to noisy text. -
FLORES-200 devtest dataset
The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations. -
Covid-19 MLIA @ Eval initiative
The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal... -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.