-
Vietnamese Diacritic Restoration Dataset
The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs. -
Machine Translation and Automated Analysis of the Sumerian Language Dataset
The Machine Translation and Automated Analysis of the Sumerian Language dataset, which contains Sumerian texts in cuneiform script. -
Covid-19 MLIA @ Eval initiative
The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal... -
Penn Treebank
The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.