Colored MNIST dataset

The dataset used in the paper is a binary classification task in a 300-dimensional space. The procedure for generating the training dataset is as follows: Each label y ∈ {−1, 1} is sampled uniformly at random. The first component x1 is sampled from a mixture of two Gaussian distributions with a variance of 0.15, centered at y and 1 − y respectively, with mixing proportions of 0.9 and 0.1. As the training dataset size increases, the model’s ability to learn this feature improves, thereby improving the test accuracy. The remaining 299 dimensions (x2,..., x300) are drawn from a standard normal distribution with zero mean and a variance of 0.1. They constitute the nuisance subspace, primarily used to memorise label noise.

Data and Resources

Cite this as

Borja Rodr´ıguez-G´alvez, Ragnar Thobaben, Mikael Skoglund (2024). Dataset: Colored MNIST dataset. https://doi.org/10.57702/vhvkfs6c

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2406.19049
Citation
  • https://doi.org/10.48550/arXiv.1811.00073
  • https://doi.org/10.48550/arXiv.2006.06332
Author Borja Rodr´ıguez-G´alvez
More Authors
Ragnar Thobaben
Mikael Skoglund
Homepage https://arxiv.org/abs/1904.01059