-
Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classific...
An audio-visual dataset of five crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'. -
CORSMAL Containers Manipulation
The CORSMAL Containers Manipulation dataset is a dataset of audio-visual recordings of people interacting with containers. -
LRS2 dataset
The LRS2 dataset consists of news recordings from the BBC, with different lighting, backgrounds, face poses, and people with different origins. -
GRID dataset
The GRID dataset was introduced by [5] as a corpus for tasks such as speech perception and speech recognition. GRID contains 33 unique speakers, articulating 1000 word sequences... -
Estimating visual information from audio through manifold learning
Estimating visual information from audio through manifold learning. -
Visual to sound: Generating natural sound for videos in the wild
Visual to sound: Generating natural sound for videos in the wild. -
Deep cross-modal audio-visual generation
Deep cross-modal audio-visual generation. -
Sound2Scene
Sound2Scene is a sound-to-image generative model and training procedure that addresses the challenges of dealing with the large gaps that often exist between sight and sound. -
Music-AVQA
The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs. -
Audio-Visual Question Answering
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. -
Visually Indicated Sounds
A dataset of audio-visual pairs where the audio is visually indicated. -
Vggsound: A large-scale audio-visual dataset
A large-scale audio-visual dataset containing audio-visual pairs.