Audio-Visual Speech Recognition - Groups

VisSpeech

The dataset used for the audio-visual speech recognition task, which consists of instructional videos with semantically related visual content.

Dataset
JSON

MuAViC

MuAViC is a multilingual corpus for audio-visual speech recognition and audio-visual speech-to-text translation.

Dataset
JSON

Voxceleb2

The Voxceleb2 dataset is a large-scale speaker recognition dataset, containing 2442 hours raw speech from 6112 speakers.

Dataset
JSON

VGGSound

The VGGSound dataset is a large-scale audio-visual dataset containing 10,000 10-second video clips with corresponding audio files.

Dataset
JSON

LRS2

The LRS2 dataset consists of 48,164 video clips from outdoor shows on BBC television. Each video is accompanied by an audio corresponding to a sentence with up to 100 characters.

Dataset
JSON

Deep Visual Forced Alignment: Learning to Align Transcription with Talking Fa...

The proposed Deep Visual Forced Alignment (DVFA) for time-aligning the input transcription with the input talking face video without using speech audio.

Dataset
JSON

LRS3

The LRS3 dataset is a large-scale dataset for visual speech recognition. It consists of thousands of spoken sentences from TED videos.

Dataset
JSON

The AMI Meeting Corpus: A Multimodal Corpus for Meeting Transcription

The AMI Meeting Corpus is a multimodal corpus containing audio and video recordings of meetings.

Dataset
JSON

EasyCom: An Augmented Reality Dataset for Easy Communication in Noisy Environ...

The EasyCom dataset is a relatively new dataset, recorded using Meta’s Augmented-Reality (AR) glasses set.

Dataset
JSON

LRS2 dataset

The LRS2 dataset consists of news recordings from the BBC, with different lighting, backgrounds, face poses, and people with different origins.

Dataset
JSON

GRID dataset

The GRID dataset was introduced by [5] as a corpus for tasks such as speech perception and speech recognition. GRID contains 33 unique speakers, articulating 1000 word sequences...

Dataset
JSON

Lip Reading in the Wild

Lip reading in the wild: A large-scale audio-visual dataset for lip reading and audio-visual speech recognition.

Dataset
JSON

12 datasets found