-
The AMI Meeting Corpus: A Multimodal Corpus for Meeting Transcription
The AMI Meeting Corpus is a multimodal corpus containing audio and video recordings of meetings. -
EasyCom: An Augmented Reality Dataset for Easy Communication in Noisy Environ...
The EasyCom dataset is a relatively new dataset, recorded using Meta’s Augmented-Reality (AR) glasses set. -
RWTH-PHOENIX-Weather
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. -
ShapeNeRF–Text
The ShapeNeRF–Text dataset consists of 40K paired NeRFs and language annotations for ShapeNet objects. -
IMAGINE: An Imagination-Based Automatic Evaluation Metric for Natural Languag...
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with the text references. This is different from... -
Multimodal Variational Autoencoder for Cardiac Hemodynamics Instability Detec...
A multimodal variational autoencoder for low-cost cardiac hemodynamics instability detection from CXR and ECG. -
BioMedClip
BioMedClip: A CLIP model pretrained on image-text pairs extracted from PubMed Central repository. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks. -
SSv2-Small
Few-shot image classification is a research area that focuses on identifying new classes with a small number of samples. -
CLIP-guided Prototype Modulating for Few-shot Action Recognition
Few-shot action recognition is a promising direction to alleviate the data labeling problem, which aims to identify unseen classes with a few labeled videos. -
Conceptual 12m
Conceptual 12m dataset for automatic image captioning