Machine Translation - Groups

English-Hindi Parallel Corpus

The dataset used for training and testing the machine translation systems.

Dataset
JSON

English-Hindi Outputs Quality Estimation using Naive Bayes Classifier

The dataset used for training and testing the Naive Bayes classifier for quality estimation of English-Hindi outputs.

Dataset
JSON

XNMT: The eXtensible Neural Machine Translation Toolkit

XNMT is a neural machine translation toolkit that focuses on modular code design, making it easy to swap in and out different parts of the model.

Dataset
JSON

XTOWER

A multilingual LLM for explaining and correcting translation errors

Dataset
JSON

Vietnamese Diacritic Restoration Dataset

The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs.

Dataset
JSON

Zh-En Multi-Domain Dataset

The Zh-En multi-domain dataset consists of four balanced domains: news, patent, subtitles, and COVID-19.

Dataset
JSON

Machine Translation and Automated Analysis of the Sumerian Language Dataset

The Machine Translation and Automated Analysis of the Sumerian Language dataset, which contains Sumerian texts in cuneiform script.

Dataset
JSON

MultiLexNorm dataset

The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization.

Dataset
JSON

MTNT dataset

The MTNT dataset is used to evaluate the robustness of MT models to noisy text.

Dataset
JSON

FLORES-200 devtest dataset

The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations.

Dataset
JSON

Covid-19 MLIA @ Eval initiative

The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal...

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

12 datasets found