LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Zhu, Chenming; Wang, Tai; Zhang, Wenwei; Pang, Jiangmiao; Liu, Xihui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.18125 (cs)

[Submitted on 26 Sep 2024 (v1), last revised 27 Apr 2025 (this version, v3)]

Title:LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Authors:Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu

View PDF HTML (experimental)

Abstract:Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2409.18125 [cs.CV]
	(or arXiv:2409.18125v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.18125

Submission history

From: Chenming Zhu [view email]
[v1] Thu, 26 Sep 2024 17:59:11 UTC (37,097 KB)
[v2] Sat, 1 Feb 2025 12:01:50 UTC (9,103 KB)
[v3] Sun, 27 Apr 2025 06:50:23 UTC (40,869 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators