-
Natural Questions (NQ)
Natural Questions (NQ) is a dataset of Google search queries with answers from Wikipedia pages provided by human annotators. -
Stanford Question Answering Dataset 2.0 (SQuAD 2.0)
SQuAD 2.0 is a dataset of questions from Wikipedia passages, proposed by human annotators while viewing these Wikipedia passages. -
Synthetic Parallel Corpus
A dataset comprised of synthetic parallel sentences generated from English monolingual data through translation. -
Chinese Gigaword
The Chinese monolingual corpus used for training models, selected based on quality metrics. -
English Gigaword Corpus
The English monolingual corpus used to create synthetic data for training models by back-translation. -
CWMT Corpus
The bilingual training corpus for Chinese to English translation, consisting of parallel sentences selected to maximize translation quality. -
NIST Chinese-to-English dataset
The NIST Chinese-to-English dataset includes bilingual sentence pairs for training, development, and evaluation in neural machine translation. -
IWSLT 2017 German−English (DE−EN)
The IWSLT 2017 dataset is used for German to English translations, focusing on spoken language translations for machine translation tasks. -
KFTT Japanese−English (JA−EN)
The KFTT dataset is used for the Japanese-English machine translation task, facilitating the evaluation of translation performance. -
Music Glove Sensor Dataset
The dataset consists of sensor readings collected from a music glove instrument, including pressure sensors, flex sensors, and IMU data, alongside MIDI outputs from a connected... -
Google Distant Supervision (GDS)
The Google Distant Supervision (GDS) dataset is an extension of the Google relation extraction corpus with additional instances from entity pairs. -
WMT18 Dataset
The dataset comes from WMT18 training data and includes randomly selected samples of human references and machine translations alongside their sources for training purposes. -
WMT14 Dataset
The dataset consists of translations produced by a state-of-the-art neural machine translation (NMT) Transformer model. It follows the WMT14 data setup, optimized on the test... -
Google Universal Dependency Treebanks
Google Universal Dependency Treebanks provide syntactic annotations that are used for training and evaluation across various languages including Indonesian, Korean, and Japanese. -
CoNLL 2003 Named Entity Recognition Dataset
The CoNLL 2003 dataset is used for Named Entity Recognition (NER), containing annotations for four types of named entities. -
CoNLL 2012 Coreference Resolution Shared Task
The CoNLL 2012 shared task dataset is designed for coreference resolution, consisting of articles annotated with mentions that refer to the same entities. -
Stanford Natural Language Inference (SNLI) Corpus
The SNLI dataset consists of human-written English sentences annotated with the labels entailment, contradiction, or neutral, aimed at measuring textual entailment. -
One Billion Word Benchmark
The One Billion Word Benchmark is a dataset used for measuring progress in statistical language modeling, featuring a large unannotated text corpus.