You're currently viewing an old version of this dataset. To see the current version, click here.

Colored MNIST dataset

The dataset used in the paper is a binary classification task in a 300-dimensional space. The procedure for generating the training dataset is as follows: Each label y ∈ {−1, 1} is sampled uniformly at random. The first component x1 is sampled from a mixture of two Gaussian distributions with a variance of 0.15, centered at y and 1 − y respectively, with mixing proportions of 0.9 and 0.1. As the training dataset size increases, the model’s ability to learn this feature improves, thereby improving the test accuracy. The remaining 299 dimensions (x2,..., x300) are drawn from a standard normal distribution with zero mean and a variance of 0.1. They constitute the nuisance subspace, primarily used to memorise label noise.

Data and Resources

This dataset has no data

Cite this as

Borja Rodr´ıguez-G´alvez, Ragnar Thobaben, Mikael Skoglund (2024). Dataset: Colored MNIST dataset. https://doi.org/10.57702/vhvkfs6c

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2406.19049
Citation	https://doi.org/10.48550/arXiv.1811.00073 https://doi.org/10.48550/arXiv.2006.06332
Author	Borja Rodr´ıguez-G´alvez
More Authors	Ragnar Thobaben Mikael Skoglund
Homepage	https://arxiv.org/abs/1904.01059