VisSpeech

The dataset used for the audio-visual speech recognition task, which consists of instructional videos with semantically related visual content.

Data and Resources

Cite this as

Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath (2025). Dataset: VisSpeech. https://doi.org/10.57702/ct0blch5

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.48550/arXiv.2305.11095
Author Puyuan Peng
More Authors
Brian Yan
Shinji Watanabe
David Harwath