-
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization i...
This paper proposes a video question answering model that effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.