Dataset - LDM

Indian Legal Documents Corpus

The Indian Legal Documents Corpus (ILDC) dataset contains cases from the Indian Supreme Court, published in English.
- Dataset
- JSON
Arabic Names Transiterated in English

The dataset used for training the Arabic names transliteration model, containing 3,600 Arabic names transliterated in English.
- Dataset
- JSON
Hebrew Names Transliterated in English

The dataset used for training the language identification model, containing 16,500 Hebrew names transliterated in English, 3,600 Arabic names transliterated in English, and...
- Dataset
- JSON
WMT'15

Character-level neural machine translation (NMT) dataset for English to German, English to Czech and English to Finnish language pairs
- Dataset
- JSON
Historical texts for spelling variation analysis

A dataset of historical texts in English and German, used for spelling variation analysis.
- Dataset
- JSON
Dx dataset

A dataset for stance detection also exists in other languages such as English.
- Dataset
- JSON
WMT14 En-De

The WMT14 En-De dataset contains 4.5M pairs of English and German sentences.
- Dataset
- JSON
WAT2015

The dataset used in the paper is the WAT2015 translation task from Japanese (ja) to/from English (en) and Chinese (zh).
- Dataset
- JSON
LRW

The LRW dataset is an English language lip reading dataset, containing 500 different words, each spoken by over 1,000 persons.
- Dataset
- JSON
Hindi-English Code-Switched Sentences

The dataset used in the paper is a collection of Hindi-English code-switched sentences.
- Dataset
- JSON
WMT 2015

The dataset used for building the NMT model, which is a German-to-English parallel corpus.
- Dataset
- JSON
CoNLL-2009

The CoNLL-2009 dataset is used for semantic role labeling (SRL) task. It contains 10,177 sentences in English and 10,177 sentences in Chinese.
- Dataset
- JSON
ArzEnSEG corpus

The ArzEnSEG corpus is a morphologically annotated dataset for code-switched Egyptian Arabic-English.
- Dataset
- JSON
ArzEn parallel corpus

The ArzEn parallel corpus consists of speech transcriptions gathered through informal interviews with bilingual Egyptian Arabic-English speakers, as well as their English...
- Dataset
- JSON
PPDB 2.0

The PPDB 2.0 dataset is a paraphrase database.
- Dataset
- JSON
English-to-Chinese Controlled Machine Translation

The dataset for English-to-Chinese controlled machine translation.
- Dataset
- JSON
English Controlled Machine Translation

The dataset for English controlled machine translation.
- Dataset
- JSON
English Controlled Paraphrase Generation

The dataset for English controlled paraphrase generation.
- Dataset
- JSON
LDC2015E86

LDC2015E86 is a dataset of abstract meaning representation (AMR) annotations for English.
- Dataset
- JSON
SemEval07 corpus

The SemEval07 corpus is a dataset for semantic frame parsing in English.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

44 datasets found