Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

A video-LLaMA model for video understanding, comprising 100k videos with detailed captions.

BibTex: