-
Indian Legal Documents Corpus
The Indian Legal Documents Corpus (ILDC) dataset contains cases from the Indian Supreme Court, published in English. -
Arabic Names Transiterated in English
The dataset used for training the Arabic names transliteration model, containing 3,600 Arabic names transliterated in English. -
Hebrew Names Transliterated in English
The dataset used for training the language identification model, containing 16,500 Hebrew names transliterated in English, 3,600 Arabic names transliterated in English, and... -
Historical texts for spelling variation analysis
A dataset of historical texts in English and German, used for spelling variation analysis. -
Dx dataset
A dataset for stance detection also exists in other languages such as English. -
WMT14 En-De
The WMT14 En-De dataset contains 4.5M pairs of English and German sentences. -
Hindi-English Code-Switched Sentences
The dataset used in the paper is a collection of Hindi-English code-switched sentences. -
CoNLL-2009
The CoNLL-2009 dataset is used for semantic role labeling (SRL) task. It contains 10,177 sentences in English and 10,177 sentences in Chinese. -
ArzEnSEG corpus
The ArzEnSEG corpus is a morphologically annotated dataset for code-switched Egyptian Arabic-English. -
ArzEn parallel corpus
The ArzEn parallel corpus consists of speech transcriptions gathered through informal interviews with bilingual Egyptian Arabic-English speakers, as well as their English... -
English-to-Chinese Controlled Machine Translation
The dataset for English-to-Chinese controlled machine translation. -
English Controlled Machine Translation
The dataset for English controlled machine translation. -
English Controlled Paraphrase Generation
The dataset for English controlled paraphrase generation. -
LDC2015E86
LDC2015E86 is a dataset of abstract meaning representation (AMR) annotations for English. -
SemEval07 corpus
The SemEval07 corpus is a dataset for semantic frame parsing in English.