20,491 datasets found

Formats: JSON

Filter Results
  • Spherical MNIST

    Spherical MNIST is constructed from the MNIST dataset by back projecting the digits into equirectangular projection with a resolution of 160x80. The digit labels are used to...
  • 1 Billion Word Language Model Benchmark

    The 1 Billion Word Language Model Benchmark is a dataset used for measuring progress in statistical language modeling, consisting of a large collection of text data.
  • Caltech-UCSD Birds 200 dataset (CUB-200)

    The 2011 Caltech-UCSD Birds 200 dataset (CUB-200) contains 11,788 images of 200 different types of birds, widely used as a benchmark for text-to-image generation.
  • TaoMultimodal Dataset

    A large-scale dataset for multi-modal pretraining in Chinese, consisting of 3.1M image-text pairs from the mobile Taobao platform.
  • French Street Name Signs Dataset

    The French Street Name Signs (FSNS) dataset contains images of French street name signs extracted from Google Streetview, featuring low resolution text lines in natural scenes...
  • WFLW

    WFLW contains 10,000 faces with 98 fully manually annotated landmarks, designed to be a challenging dataset with rich attribute annotations.
  • 300-W

    300-W is currently the most widely used dataset for facial landmark detection, created from four datasets including AFW, LFPW, HELEN, and IBUG, with each image annotated with 68...
  • Google Billion Word dataset

    The Google Billion Word dataset is one of the largest language modeling datasets with almost one billion tokens and a vocabulary of over 800K words, based on an English corpus...
  • MojiTalk

    MojiTalk dataset consists of 596,959 post and response pairs from Twitter, where each response is labeled by one of 64 emojis indicating the response emotion.
  • CNN/Daily Mail corpus

    The CNN/Daily Mail corpus contains pairs of online news articles and their summaries, consisting of approximately 287,000 training pairs, 13,368 validation pairs, and 11,490...
  • TL;DR Reddit corpus

    The TL;DR Reddit corpus consists of approximately 3 million content-summary pairs mined from Reddit, designed for the TL;DR challenge focusing on text summarization.