VisSpeech

The dataset used for the audio-visual speech recognition task, which consists of instructional videos with semantically related visual content.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath (2025). Dataset: VisSpeech. https://doi.org/10.57702/ct0blch5

DOI retrieved: January 3, 2025

Field	Value
Created	January 3, 2025
Last update	January 3, 2025
Defined In	https://doi.org/10.48550/arXiv.2305.11095
Author	Puyuan Peng
More Authors	Brian Yan Shinji Watanabe David Harwath