LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video series models are Video-LLMs fully trained on our high-quality synthetic dataset, LLaVA-Video-178K, which comprises 178K video captions and 1.15M video QAs. Our models demonstrate strong performance across 10+ video understanding tasks.