Matching Latent Encoding for Audio-Text based Keyword Spotting

The proposed end-to-end model architecture for flexible keyword spotting, consisting of encoder, projector, and audio-text aligner modules.

BibTex: