Pretrained Image-Text Models are Secretly Video Captioners

Zhang, Chunhui; Jian, Yiren; Ouyang, Zhongyu; Vosoughi, Soroush

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.13363 (cs)

[Submitted on 19 Feb 2025]

Title:Pretrained Image-Text Models are Secretly Video Captioners

Authors:Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, Soroush Vosoughi

View PDF HTML (experimental)

Abstract:Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

Comments:	Accepted to the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). The first two authors contributed equally and were listed in random order
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2502.13363 [cs.CV]
	(or arXiv:2502.13363v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.13363

Submission history

From: Chunhui Zhang [view email]
[v1] Wed, 19 Feb 2025 01:53:03 UTC (2,484 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Pretrained Image-Text Models are Secretly Video Captioners

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Pretrained Image-Text Models are Secretly Video Captioners

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators