The LRS2 dataset consists of 48,164 video clips from outdoor shows on BBC television. Each video is accompanied by an audio corresponding to a sentence with up to 100 characters.
The WHAM! dataset is used for testing the proposed Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer models.