Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA). The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction.

BibTex: