VHELM: A Holistic Evaluation of Vision Language Models

Lee, Tony; Tu, Haoqin; Wong, Chi Heem; Zheng, Wenhao; Zhou, Yiyang; Mai, Yifan; Roberts, Josselin Somerville; Yasunaga, Michihiro; Yao, Huaxiu; Xie, Cihang; Liang, Percy

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.07112 (cs)

[Submitted on 9 Oct 2024 (v1), last revised 24 Oct 2024 (this version, v2)]

Title:VHELM: A Holistic Evaluation of Vision Language Models

Authors:Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, Percy Liang

View PDF HTML (experimental)

Abstract:Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (this https URL). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

Comments:	NeurIPS 2024. First three authors contributed equally
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.07112 [cs.CV]
	(or arXiv:2410.07112v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.07112

Submission history

From: Chi Heem Wong [view email]
[v1] Wed, 9 Oct 2024 17:46:34 UTC (9,764 KB)
[v2] Thu, 24 Oct 2024 05:17:36 UTC (7,473 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VHELM: A Holistic Evaluation of Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VHELM: A Holistic Evaluation of Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators