-
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization i...
This paper proposes a video question answering model that effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. -
Slot-VLM: SlowFast Slots for Video-Language Modeling
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the... -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form. -
Zero-shot video question answering via frozen bidirectional language models
Zero-shot video question answering via frozen bidirectional language models.