No Organization - Organizations

Natural Questions (NQ)

Natural Questions (NQ) is a dataset of Google search queries with answers from Wikipedia pages provided by human annotators.

Dataset
JSON

Stanford Question Answering Dataset 2.0 (SQuAD 2.0)

SQuAD 2.0 is a dataset of questions from Wikipedia passages, proposed by human annotators while viewing these Wikipedia passages.

Dataset
JSON

Synthetic Parallel Corpus

A dataset comprised of synthetic parallel sentences generated from English monolingual data through translation.

Dataset
JSON

Chinese Gigaword

The Chinese monolingual corpus used for training models, selected based on quality metrics.

Dataset
JSON

English Gigaword Corpus

The English monolingual corpus used to create synthetic data for training models by back-translation.

Dataset
JSON

CWMT Corpus

The bilingual training corpus for Chinese to English translation, consisting of parallel sentences selected to maximize translation quality.

Dataset
JSON

NIST Chinese-to-English dataset

The NIST Chinese-to-English dataset includes bilingual sentence pairs for training, development, and evaluation in neural machine translation.

Dataset
JSON

IWSLT 2017 German−English (DE−EN)

The IWSLT 2017 dataset is used for German to English translations, focusing on spoken language translations for machine translation tasks.

Dataset
JSON

KFTT Japanese−English (JA−EN)

The KFTT dataset is used for the Japanese-English machine translation task, facilitating the evaluation of translation performance.

Dataset
JSON

Music Glove Sensor Dataset

The dataset consists of sensor readings collected from a music glove instrument, including pressure sensors, flex sensors, and IMU data, alongside MIDI outputs from a connected...

Dataset
JSON

Google Distant Supervision (GDS)

The Google Distant Supervision (GDS) dataset is an extension of the Google relation extraction corpus with additional instances from entity pairs.

Dataset
JSON

WMT18 Dataset

The dataset comes from WMT18 training data and includes randomly selected samples of human references and machine translations alongside their sources for training purposes.

Dataset
JSON

WMT14 Dataset

The dataset consists of translations produced by a state-of-the-art neural machine translation (NMT) Transformer model. It follows the WMT14 data setup, optimized on the test...

Dataset
JSON

IMST-UD

IMST-UD is the Turkish Dependency Treebank which contains syntactically annotated data used for extracting dependencies and parsing tasks.

Dataset
JSON

Google Universal Dependency Treebanks

Google Universal Dependency Treebanks provide syntactic annotations that are used for training and evaluation across various languages including Indonesian, Korean, and Japanese.

Dataset
JSON

EuroParl

The EuroParl dataset consists of multi-parallel sentences in several European languages, used for training and evaluation with automatic Universal Dependencies annotations.

Dataset
JSON

CoNLL 2003 Named Entity Recognition Dataset

The CoNLL 2003 dataset is used for Named Entity Recognition (NER), containing annotations for four types of named entities.

Dataset
JSON

CoNLL 2012 Coreference Resolution Shared Task

The CoNLL 2012 shared task dataset is designed for coreference resolution, consisting of articles annotated with mentions that refer to the same entities.

Dataset
JSON

Stanford Natural Language Inference (SNLI) Corpus

The SNLI dataset consists of human-written English sentences annotated with the labels entailment, contradiction, or neutral, aimed at measuring textual entailment.

Dataset
JSON

One Billion Word Benchmark

The One Billion Word Benchmark is a dataset used for measuring progress in statistical language modeling, featuring a large unannotated text corpus.

Dataset
JSON

20,499 datasets found