-
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Fa...
The proposed Deep Visual Forced Alignment (DVFA) for time-aligning the input transcription with the input talking face video without using speech audio. -
The AMI Meeting Corpus: A Multimodal Corpus for Meeting Transcription
The AMI Meeting Corpus is a multimodal corpus containing audio and video recordings of meetings. -
EasyCom: An Augmented Reality Dataset for Easy Communication in Noisy Environ...
The EasyCom dataset is a relatively new dataset, recorded using Meta’s Augmented-Reality (AR) glasses set. -
LRS2 dataset
The LRS2 dataset consists of news recordings from the BBC, with different lighting, backgrounds, face poses, and people with different origins. -
GRID dataset
The GRID dataset was introduced by [5] as a corpus for tasks such as speech perception and speech recognition. GRID contains 33 unique speakers, articulating 1000 word sequences... -
Lip Reading in the Wild
Lip reading in the wild: A large-scale audio-visual dataset for lip reading and audio-visual speech recognition.