No Organization - Organizations

USPTO-50k

The USPTO-50k dataset is a curated subset of chemical reaction examples from patent literature, where each reaction is labeled with one of ten reaction classes, focusing on the...

Dataset
JSON

FLoRes Benchmark

The FLoRes dataset is a benchmark designed for low-resource machine translation. It includes English-to-Nepali translations with approximately 564,000 parallel sentences, making...

Dataset
JSON

IWSLT14

The IWSLT14 dataset is a low-resource dataset used for machine translation tasks, specifically for the German-to-English translation direction, containing approximately 160,000...

Dataset
JSON

TREC CAR

TREC CAR is a synthetic dataset based on Wikipedia that consists of about 29 million passages, using titles and section headings to generate queries associated with relevance...

Dataset
JSON

Audio Visual Scene-aware Dialog dataset

The Audio Visual Scene-aware Dialog (AVSD) dataset requires systems to generate answers about events observed in a video through previous dialogs.

Dataset
JSON

VisDial dataset

The VisDial dataset consists of dialogs composed of question-answer pairs about an image, aiming to enhance visual dialog systems.

Dataset
JSON

JCR, Europarl, news-commentary, and wikititles corpora

The training data is made up of the JCR, Europarl, news-commentary and wikititles corpora, utilized for training their machine translation systems between Spanish and Portuguese.

Dataset
JSON

French-20K

The French-20K dataset is used for cross-lingual evaluation of the semantic parsing approach, where training data from English and German is leveraged due to limited French data.

Dataset
JSON

German-20K

The German-20K dataset is utilized for training and evaluating the model's performance on semantic parsing tasks in the German language.

Dataset
JSON

English-Wiki

The English-Wiki dataset is used for training and evaluation of the UCCA semantic parsing model, consisting of syntactic and semantic structures in the English language.

Dataset
JSON

SQuAD 1.1

The SQuAD 1.1 dataset serves as a preliminary dataset for generating questions from provided answer spans.

Dataset
JSON

ASPEC

The ASPEC dataset is a collection of scientific papers used for neural machine translation tasks between English and Japanese, providing parallel corpora for model training and...

Dataset
JSON

French-English Translation Dataset

For FR-EN tasks, it contains around 0.2M sentence pairs.

Dataset
JSON

German-English Translation Dataset

The training data for the DE-EN task consists of 4.6M sentence pairs.

Dataset
JSON

NIST Chinese-English Translation Dataset

The training data for ZH-EN task consists of 1.8M sentence pairs. The development set is chosen as NIST02 and test sets are NIST05, 06, 08.

Dataset
JSON

CASP12 ProteinNet dataset

The CASP12 ProteinNet dataset consists of around 50,000 protein structures used to evaluate models for protein structure prediction, specifically in the context of free modeling...

Dataset
JSON

Propaganda Techniques Corpus

The Propaganda Techniques Corpus (PTC) is a dataset consisting of news articles with sentences annotated for the presence of specific propaganda techniques, aimed at binary...

Dataset
JSON

VQA1.0

VQA1.0 is a dataset used to derive VQG data, consisting of 82783 training images, 40504 validation images, and 81434 testing images, where each image has 3 associated questions.

Dataset
JSON

VQGCOCO

VQGCOCO is a dataset consisting of 2500 training images, 1250 validation images, and 1250 test images from MS COCO, each with 5 corresponding questions and 5 ground-truth captions.

Dataset
JSON

CoNLL 2002/2003 NER

The CoNLL 2002/2003 NER corpus is a standard dataset for named entity recognition, providing annotated data for various languages.

Dataset
JSON

20,499 datasets found