No Organization - Organizations

Commoncrawl Dataset

The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems.

Dataset
JSON

Europarl Corpus

The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems.

Dataset
JSON

WMT Biomedical Test Set

The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts.

Dataset
JSON

WMT 2016 News Translation Test Set

The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles.

Dataset
JSON

Multi30k dataset

The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images.

Dataset
JSON

MultiWOZ

The MultiWOZ dataset is a large-scale multi-domain dialogue dataset containing fully-annotated human-human conversations related to tourists, spanning across seven domains...

Dataset
JSON

Heuristic-based Adversarial Dataset

A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment...

Dataset
JSON

Error-Analysis Motivated Attacks

This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models.

Dataset
JSON

Single Word Replacement Attacks

The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness...

Dataset
JSON

MNLI

The Multi-Genre Natural Language Inference (MNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information, covering multiple...

Dataset
JSON

FewRel

FewRel is a relation classification dataset that consists of sentences with annotated subject and object entity mentions, used for supervised relation classification tasks.

Dataset
JSON

AIDA

AIDA dataset is annotated with Wikipedia URLs for entity linking, containing documents that require linking textual entities to knowledge base entities.

Dataset
JSON

LAMA

LAMA (Language Model Analysis) dataset is designed to probe the factual and commonsense knowledge of pretrained language models through cloze-style questions based on knowledge...

Dataset
JSON

Multi-Genre Natural Language Inference (MultiNLI) dataset

The MultiNLI corpus consists of 433k sentence pairs containing nine genres which contribute to the concept of multi-genres. It features matched and mismatched development/test...

Dataset
JSON

Stanford Natural Language Inference (SNLI) dataset

The Stanford Natural Language Inference (SNLI) dataset consists of pairs of sequences that represent certain semantic attributes. In this work, the authors ignore the labels and...

Dataset
JSON

Ubuntu Dialogue

The Ubuntu Dialogue dataset is extracted from the Ubuntu Relay Chat Channel and contains about 1.85 million conversations with an average of 5 utterances per conversation, ideal...

Dataset
JSON

Movie Triples

The Movie Triples dataset contains about 240,000 dialogue triples covering a wide range of topics, making it suitable for studying the relevance-diversity tradeoff in multi-turn...

Dataset
JSON

BERT Pretraining Dataset

The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training.

Dataset
JSON

WMT14 English-to-German

The WMT14 English-to-German dataset consists of about 4.5M training parallel sentence pairs utilized for machine translation.

Dataset
JSON

IWSLT14 German-to-English

The IWSLT14 German-to-English dataset contains approximately 153K sentence pairs used for the machine translation task.

Dataset
JSON

20,499 datasets found