Dataset - LDM

Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classific...

An audio-visual dataset of five crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'.
- Dataset
- JSON
CORSMAL Containers Manipulation

The CORSMAL Containers Manipulation dataset is a dataset of audio-visual recordings of people interacting with containers.
- Dataset
- JSON
SoundNet

The dataset is used for learning general and effective models for both audio and video analysis from self-supervised temporal synchronization.
- Dataset
- JSON
VGGSound

The VGGSound dataset is a large-scale audio-visual dataset containing 10,000 10-second video clips with corresponding audio files.
- Dataset
- JSON
LRS2

The LRS2 dataset consists of 48,164 video clips from outdoor shows on BBC television. Each video is accompanied by an audio corresponding to a sentence with up to 100 characters.
- Dataset
- JSON
LRS3

The LRS3 dataset is a large-scale dataset for visual speech recognition. It consists of thousands of spoken sentences from TED videos.
- Dataset
- JSON
LRS2 dataset

The LRS2 dataset consists of news recordings from the BBC, with different lighting, backgrounds, face poses, and people with different origins.
- Dataset
- JSON
GRID dataset

The GRID dataset was introduced by [5] as a corpus for tasks such as speech perception and speech recognition. GRID contains 33 unique speakers, articulating 1000 word sequences...
- Dataset
- JSON
Estimating visual information from audio through manifold learning

Estimating visual information from audio through manifold learning.
- Dataset
- JSON
Visual to sound: Generating natural sound for videos in the wild

Visual to sound: Generating natural sound for videos in the wild.
- Dataset
- JSON
Deep cross-modal audio-visual generation

Deep cross-modal audio-visual generation.
- Dataset
- JSON
Sound2Scene

Sound2Scene is a sound-to-image generative model and training procedure that addresses the challenges of dealing with the large gaps that often exist between sight and sound.
- Dataset
- JSON
AVQA

The AVQA dataset contains 57,015 videos and 57,335 question-and-answer pairs.
- Dataset
- JSON
Music-AVQA

The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs.
- Dataset
- JSON
Audio-Visual Question Answering

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.
- Dataset
- JSON
Visually Indicated Sounds

A dataset of audio-visual pairs where the audio is visually indicated.
- Dataset
- JSON
Vggsound: A large-scale audio-visual dataset

A large-scale audio-visual dataset containing audio-visual pairs.
- Dataset
- JSON
HDTF

The dataset used in the paper for 3D head avatar reconstruction from monocular RGB videos.
- Dataset
- JSON
CREMA-D

The CREMA-D dataset is an audio-visual dataset for emotion recognition task, each video in which consists of both facial and acoustic emotional expressions.
- Dataset
- JSON
MEAD

The MEAD dataset is a large-scale, high-quality emotional audio-visual dataset, which consists of 60 actors, including 8 basic emotions and 3 different emotional-intensity...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

20 datasets found