No Organization - Organizations

CIFAR10

The CIFAR10 dataset is used for training and evaluating deep neural networks, specifically in this study for assessing the performance of decision gates in ResNet-101 and...

Dataset
JSON

SQuAD dataset

The dataset used for training BERT consists of a concatenation of Wikipedia and BooksCorpus, specifically focused on the SQuAD task.

Dataset
JSON

VIPeR

VIPeR dataset consists of 632 persons with two images captured from different cameras, used for person re-identification tasks.

Dataset
JSON

CUHK01

CUHK01 is a medium-sized dataset containing 3,884 images of 971 identities, intended for testing person re-identification methods.

Dataset
JSON

CUHK03

CUHK03 is a large dataset containing 13,164 images for 1,360 identities captured by 6 cameras. It includes both detected and labeled images for training and testing.

Dataset
JSON

PennTreebank

The PennTreebank dataset is used for language modeling, containing a large annotated corpus of English text to evaluate the task of predicting the next character or word based...

Dataset
JSON

Nottingham

The Nottingham dataset contains British and American folk tunes and is used to evaluate models' capabilities in polyphonic music modeling.

Dataset
JSON

Pano2Vid

Pano2Vid is a real-world 360-degree video dataset containing videos from several categories. The frames are sampled and resized to 640x320 resolution for training and testing,...

Dataset
JSON

Spherical MNIST

Spherical MNIST is constructed from the MNIST dataset by back projecting the digits into equirectangular projection with a resolution of 160x80. The digit labels are used to...

Dataset
JSON

1 Billion Word Language Model Benchmark

The 1 Billion Word Language Model Benchmark is a dataset used for measuring progress in statistical language modeling, consisting of a large collection of text data.

Dataset
JSON

Caltech-UCSD Birds 200 dataset (CUB-200)

The 2011 Caltech-UCSD Birds 200 dataset (CUB-200) contains 11,788 images of 200 different types of birds, widely used as a benchmark for text-to-image generation.

Dataset
JSON

TaoMultimodal Dataset

A large-scale dataset for multi-modal pretraining in Chinese, consisting of 3.1M image-text pairs from the mobile Taobao platform.

Dataset
JSON

French Street Name Signs Dataset

The French Street Name Signs (FSNS) dataset contains images of French street name signs extracted from Google Streetview, featuring low resolution text lines in natural scenes...

Dataset
JSON

WFLW

WFLW contains 10,000 faces with 98 fully manually annotated landmarks, designed to be a challenging dataset with rich attribute annotations.

Dataset
JSON

300-W

300-W is currently the most widely used dataset for facial landmark detection, created from four datasets including AFW, LFPW, HELEN, and IBUG, with each image annotated with 68...

Dataset
JSON

Google Billion Word dataset

The Google Billion Word dataset is one of the largest language modeling datasets with almost one billion tokens and a vocabulary of over 800K words, based on an English corpus...

Dataset
JSON

MojiTalk

MojiTalk dataset consists of 596,959 post and response pairs from Twitter, where each response is labeled by one of 64 emojis indicating the response emotion.

Dataset
JSON

CNN/Daily Mail corpus

The CNN/Daily Mail corpus contains pairs of online news articles and their summaries, consisting of approximately 287,000 training pairs, 13,368 validation pairs, and 11,490...

Dataset
JSON

TL;DR Reddit corpus

The TL;DR Reddit corpus consists of approximately 3 million content-summary pairs mined from Reddit, designed for the TL;DR challenge focusing on text summarization.

Dataset
JSON

20,499 datasets found