The dataset used in the paper is the Shoes dataset, which consists of c.50,000 examples of shoes in RGB color, from 4 different categories and over 3000 different subcategories.
The Birds-to-Words dataset contains 15,931 images (12,770 training and 3,151 testing) tagged with descriptions of fine-grained differences between pairwise bird images.
The FashionIQ dataset contains images of fashion products over 3 categories: Dress, Toptee, and Shirt, with 46,609 images in the training and 31,075 images in the validation set.