Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

The proposed Deep Visual Forced Alignment (DVFA) for time-aligning the input transcription with the input talking face video without using speech audio.

BibTex: