-
Commoncrawl Dataset
The commoncrawl dataset provides parallel text collected from the web, serving as a resource for training and evaluating machine translation systems. -
Europarl Corpus
The Europarl corpus is an English-French dataset derived from the proceedings of the European Parliament, used for training and evaluating SMT systems. -
WMT Biomedical Test Set
The WMT Biomedical test set includes Medline abstracts intended for evaluating the translation of scientific texts. -
WMT 2016 News Translation Test Set
The WMT 2016 news translation test set is used for evaluating translation performance in the context of news articles. -
Multi30k dataset
The Multi30k dataset is a multilingual extension of the Flickr30k image-captioning dataset, containing English and German language captions for images. -
Heuristic-based Adversarial Dataset
A dataset generated from templates that create sentences designed to reveal weaknesses in NLI models based on syntactic heuristics, focusing on entailment and non-entailment... -
Error-Analysis Motivated Attacks
This dataset consists of various tests categorized based on mistakes made by NLI models, including tests like antonyms and negation words, designed to stress test NLP models. -
Single Word Replacement Attacks
The dataset is created by modifying SNLI examples with single word replacements that test lexical inferences and simple world knowledge, aimed at analyzing model robustness... -
Multi-Genre Natural Language Inference (MultiNLI) dataset
The MultiNLI corpus consists of 433k sentence pairs containing nine genres which contribute to the concept of multi-genres. It features matched and mismatched development/test... -
Stanford Natural Language Inference (SNLI) dataset
The Stanford Natural Language Inference (SNLI) dataset consists of pairs of sequences that represent certain semantic attributes. In this work, the authors ignore the labels and... -
Ubuntu Dialogue
The Ubuntu Dialogue dataset is extracted from the Ubuntu Relay Chat Channel and contains about 1.85 million conversations with an average of 5 utterances per conversation, ideal... -
Movie Triples
The Movie Triples dataset contains about 240,000 dialogue triples covering a wide range of topics, making it suitable for studying the relevance-diversity tradeoff in multi-turn... -
BERT Pretraining Dataset
The BERT dataset includes the English Wikipedia corpus and BookCorpus, totaling roughly 3.4B words, used for unsupervised pre-training. -
WMT14 English-to-German
The WMT14 English-to-German dataset consists of about 4.5M training parallel sentence pairs utilized for machine translation. -
IWSLT14 German-to-English
The IWSLT14 German-to-English dataset contains approximately 153K sentence pairs used for the machine translation task.