MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Al-though current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: i) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. ii) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs.

BibTex: