Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives

doi:doi:10.57702/f4746e2g

Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA). The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng Zhang, Jun Xiao (2024). Dataset: Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives. https://doi.org/10.57702/f4746e2g

DOI retrieved: December 16, 2024

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2204.11544
Author	Shaoning Xiao
More Authors	Long Chen Kaifeng Gao Zhao Wang Yi Yang Zhimeng Zhang Jun Xiao