No Organization - Organizations

ClueWeb09-B

ClueWeb includes documents from ClueWeb09-B and queries from the TREC Web Track ad hoc retrieval task 2009-2012. The dataset consists of 200 queries with relevance judgements...

Dataset
JSON

LINNAEUS Dataset

The LINNAEUS dataset is a system for species name identification in biomedical literature.

Dataset
JSON

Species-800 Corpus

The Species-800 corpus is used for species name recognition in text.

Dataset
JSON

JNLPBA Corpus

The JNLPBA corpus serves as a benchmark forbio-entity recognition tasks.

Dataset
JSON

BioCreative V CDR Corpus

The BioCreative V CDR task corpus is a resource for chemical disease relation extraction.

Dataset
JSON

English-Finnish and English-Estonian Datasets

Monolingual English datasets consisting of backtranslated and parallel data used for training the translation models between English, Finnish, and Estonian.

Dataset
JSON

Finnish-Estonian Parallel Data

A bilingual corpus created by triangulating English–Finnish and English–Estonian parallel data, resulting in a set of 679,252 sentence pairs used to extract cognates and improve...

Dataset
JSON

WMT 2014 English-German Translation Dataset

The WMT 2014 English-German translation dataset consists of parallel sentences in English and German used to evaluate machine translation models.

Dataset
JSON

Commoncrawl Dataset

The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems.

Dataset
JSON

Europarl Corpus

The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems.

Dataset
JSON

WMT Biomedical Test Set

The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts.

Dataset
JSON

WMT 2016 News Translation Test Set

The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles.

Dataset
JSON

Multi30k dataset

The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images.

Dataset
JSON

MultiWOZ

The MultiWOZ dataset is a large-scale multi-domain dialogue dataset containing fully-annotated human-human conversations related to tourists, spanning across seven domains...

Dataset
JSON

Heuristic-based Adversarial Dataset

A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment...

Dataset
JSON

Error-Analysis Motivated Attacks

This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models.

Dataset
JSON

Single Word Replacement Attacks

The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness...

Dataset
JSON

MNLI

The Multi-Genre Natural Language Inference (MNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information, covering multiple...

Dataset
JSON

FewRel

FewRel is a relation classification dataset that consists of sentences with annotated subject and object entity mentions, used for supervised relation classification tasks.

Dataset
JSON

AIDA

AIDA dataset is annotated with Wikipedia URLs for entity linking, containing documents that require linking textual entities to knowledge base entities.

Dataset
JSON

24,167 datasets found