Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Wu, Shengqiong; Ye, Weicai; Wang, Jiahao; Liu, Quande; Wang, Xintao; Wan, Pengfei; Zhang, Di; Gai, Kun; Yan, Shuicheng; Fei, Hao; Chua, Tat-Seng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.24379 (cs)

[Submitted on 31 Mar 2025]

Title:Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Authors:Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua

View PDF HTML (experimental)

Abstract:To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: this https URL

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.24379 [cs.CV]
	(or arXiv:2503.24379v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.24379

Submission history

From: Shengqiong Wu [view email]
[v1] Mon, 31 Mar 2025 17:59:01 UTC (17,074 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators