-
ClueWeb09-B
ClueWeb includes documents from ClueWeb09-B and queries from the TREC Web Track ad hoc retrieval task 2009-2012. The dataset consists of 200 queries with relevance judgements... -
LINNAEUS Dataset
The LINNAEUS dataset is a system for species name identification in biomedical literature. -
Species-800 Corpus
The Species-800 corpus is used for species name recognition in text. -
JNLPBA Corpus
The JNLPBA corpus serves as a benchmark forbio-entity recognition tasks. -
BioCreative V CDR Corpus
The BioCreative V CDR task corpus is a resource for chemical disease relation extraction. -
English-Finnish and English-Estonian Datasets
Monolingual English datasets consisting of backtranslated and parallel data used for training the translation models between English, Finnish, and Estonian. -
Finnish-Estonian Parallel Data
A bilingual corpus created by triangulating English–Finnish and English–Estonian parallel data, resulting in a set of 679,252 sentence pairs used to extract cognates and improve... -
WMT 2014 English-German Translation Dataset
The WMT 2014 English-German translation dataset consists of parallel sentences in English and German used to evaluate machine translation models. -
Commoncrawl Dataset
The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems. -
Europarl Corpus
The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems. -
WMT Biomedical Test Set
The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts. -
WMT 2016 News Translation Test Set
The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles. -
Multi30k dataset
The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images. -
Heuristic-based Adversarial Dataset
A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment... -
Error-Analysis Motivated Attacks
This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models. -
Single Word Replacement Attacks
The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness...