Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA). The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction.

Data and Resources

Cite this as

Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng Zhang, Jun Xiao (2024). Dataset: Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives. https://doi.org/10.57702/f4746e2g

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2204.11544
Author Shaoning Xiao
More Authors
Long Chen
Kaifeng Gao
Zhao Wang
Yi Yang
Zhimeng Zhang
Jun Xiao