Valley: A Video Assistant with Large Language Model Enhanced Ability

A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description instruction pairs.

Data and Resources

Cite this as

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei (2024). Dataset: Valley: A Video Assistant with Large Language Model Enhanced Ability. https://doi.org/10.57702/5o0eqfc5

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Defined In https://doi.org/10.48550/arXiv.2306.07207
Author Ruipu Luo
More Authors
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Minghui Qiu
Zhongyu Wei
Homepage https://arxiv.org/abs/2305.06500