-
R4R Dataset
The R4R dataset is a larger VLN dataset than R2R and with more complicated navigation paths. -
R2R Dataset
The R2R dataset is a dataset based on real photos taken in indoor environments. It attracts massive attention for its simple-form task, which at the same time requires complex... -
Unsupervised alignment of embeddings with Wasserstein procrustes
This study introduces a new method for unsupervised alignment of embeddings with Wasserstein procrustes. -
Discovering Universal Geometry in Embeddings with ICA
This study utilizes Independent Component Analysis (ICA) to unveil a consistent semantic structure within embeddings of words or images. -
One-stage Visual Grounding
A fast and accurate one-stage approach to visual grounding -
InstanceRefer
Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring -
Free-form description guided 3D visual graph network for object grounding in ...
Free-form description guided 3D visual graph network for 3D object grounding in point clouds -
CIFAR-10, FEMNIST, and IMDB
The dataset used in the paper is CIFAR-10, FEMNIST, and IMDB. The authors used these datasets to evaluate the performance of the EmbracingFL framework. -
Room-to-Room (R2R) dataset
The Room-to-Room (R2R) dataset is a benchmark for vision-and-language navigation tasks. It consists of 7,189 paths sampled from its navigation graphs, each with three... -
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Tra...
The dataset used in this paper is ImageNet and SQuAD and GLUE datasets. -
Data-driven Instruction Augmentation for Language-conditioned Control
Data-driven Instruction Augmentation for Language-conditioned Control (DIAL) is a method that uses pre-trained vision-language models (VLMs) to label offline datasets for... -
Vision-and-Language Navigation
The Vision-and-Language Navigation (VLN) task gives a global natural sentence I = {w0,..., wl} as an instruction, where wi is a word token while the l is the length of the... -
PhotoBot: Reference-Guided Interactive Photography via Natural Language
PhotoBot is a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
Validation Dataset
The Validation Dataset is used for validation, it contains 1428 images from nine distinct rooms. -
CIFAR-10, CIFAR-100, Stanford background dataset, VOC2012 dataset, Rotten Tom...
The dataset used in the paper is not explicitly described. However, it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and Stanford... -
DEMYSTIFYING CLIP DATA
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative... -
Various Datasets
The datasets used in the paper are described as follows: WikiMIA, BookMIA, Temporal Wiki, Temporal arXiv, ArXiv-1 month, Multi-Webdata, LAION-MI, Gutenberg.