-
AIDA-CoNLL
The AIDA-CoNLL dataset consists of annotated entities in a large corpus for named entity disambiguation tasks. -
FQuAD: French Question Answering Dataset
The French Question Answering Dataset (FQuAD) is a native Reading Comprehension dataset comprising questions and answers extracted from Wikipedia articles. It aims to provide a... -
English Web Treebank
The English Web Treebank is part of the Universal Dependencies framework and serves as a syntactically and semantically annotated corpus for training and evaluating dependency... -
EmoContext Dataset
The EmoContext task dataset consists of conversations extracted from social media, intended for emotion detection, annotated with four main emotions: Happy, Sad, Angry, and... -
ClueWeb09-B
ClueWeb includes documents from ClueWeb09-B and queries from the TREC Web Track ad hoc retrieval task 2009-2012. The dataset consists of 200 queries with relevance judgements... -
LINNAEUS Dataset
The LINNAEUS dataset is a system for species name identification in biomedical literature. -
Species-800 Corpus
The Species-800 corpus is used for species name recognition in text. -
JNLPBA Corpus
The JNLPBA corpus serves as a benchmark forbio-entity recognition tasks. -
BioCreative V CDR Corpus
The BioCreative V CDR task corpus is a resource for chemical disease relation extraction. -
English-Finnish and English-Estonian Datasets
Monolingual English datasets consisting of backtranslated and parallel data used for training the translation models between English, Finnish, and Estonian. -
Finnish-Estonian Parallel Data
A bilingual corpus created by triangulating English–Finnish and English–Estonian parallel data, resulting in a set of 679,252 sentence pairs used to extract cognates and improve... -
WMT 2014 English-German Translation Dataset
The WMT 2014 English-German translation dataset consists of parallel sentences in English and German used to evaluate machine translation models.