The LRS2 dataset consists of 48,164 video clips from outdoor shows on BBC television. Each video is accompanied by an audio corresponding to a sentence with up to 100 characters.
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate...