TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment

TOPA is a text-only pre-alignment framework for extending large language models for video understanding without the need for pre-training on real video data.

BibTex: