20,499 datasets found

Filter Results
  • USPTO-50k

    The USPTO-50k dataset is a curated subset of chemical reaction examples from patent literature, where each reaction is labeled with one of ten reaction classes, focusing on the...
  • FLoRes Benchmark

    The FLoRes dataset is a benchmark designed for low-resource machine translation. It includes English-to-Nepali translations with approximately 564,000 parallel sentences, making...
  • IWSLT14

    The IWSLT14 dataset is a low-resource dataset used for machine translation tasks, specifically for the German-to-English translation direction, containing approximately 160,000...
  • TREC CAR

    TREC CAR is a synthetic dataset based on Wikipedia that consists of about 29 million passages, using titles and section headings to generate queries associated with relevance...
  • Audio Visual Scene-aware Dialog dataset

    The Audio Visual Scene-aware Dialog (AVSD) dataset requires systems to generate answers about events observed in a video through previous dialogs.
  • VisDial dataset

    The VisDial dataset consists of dialogs composed of question-answer pairs about an image, aiming to enhance visual dialog systems.
  • JCR, Europarl, news-commentary, and wikititles corpora

    The training data is made up of the JCR, Europarl, news-commentary and wikititles corpora, utilized for training their machine translation systems between Spanish and Portuguese.
  • French-20K

    The French-20K dataset is used for cross-lingual evaluation of the semantic parsing approach, where training data from English and German is leveraged due to limited French data.
  • German-20K

    The German-20K dataset is utilized for training and evaluating the model's performance on semantic parsing tasks in the German language.
  • English-Wiki

    The English-Wiki dataset is used for training and evaluation of the UCCA semantic parsing model, consisting of syntactic and semantic structures in the English language.
  • SQuAD 1.1

    The SQuAD 1.1 dataset serves as a preliminary dataset for generating questions from provided answer spans.
  • ASPEC

    The ASPEC dataset is a collection of scientific papers used for neural machine translation tasks between English and Japanese, providing parallel corpora for model training and...
  • French-English Translation Dataset

    For FR-EN tasks, it contains around 0.2M sentence pairs.
  • German-English Translation Dataset

    The training data for the DE-EN task consists of 4.6M sentence pairs.
  • NIST Chinese-English Translation Dataset

    The training data for ZH-EN task consists of 1.8M sentence pairs. The development set is chosen as NIST02 and test sets are NIST05, 06, 08.
  • CASP12 ProteinNet dataset

    The CASP12 ProteinNet dataset consists of around 50,000 protein structures used to evaluate models for protein structure prediction, specifically in the context of free modeling...
  • Propaganda Techniques Corpus

    The Propaganda Techniques Corpus (PTC) is a dataset consisting of news articles with sentences annotated for the presence of specific propaganda techniques, aimed at binary...
  • VQA1.0

    VQA1.0 is a dataset used to derive VQG data, consisting of 82783 training images, 40504 validation images, and 81434 testing images, where each image has 3 associated questions.
  • VQGCOCO

    VQGCOCO is a dataset consisting of 2500 training images, 1250 validation images, and 1250 test images from MS COCO, each with 5 corresponding questions and 5 ground-truth captions.
  • CoNLL 2002/2003 NER

    The CoNLL 2002/2003 NER corpus is a standard dataset for named entity recognition, providing annotated data for various languages.