Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

The proposed Deep Visual Forced Alignment (DVFA) for time-aligning the input transcription with the input talking face video without using speech audio.

Data and Resources

Cite this as

Minsu Kim, Chae Won Kim, Yong Man Ro (2024). Dataset: Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video. https://doi.org/10.57702/pyhelu8l

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2303.08670
Author Minsu Kim
More Authors
Chae Won Kim
Yong Man Ro