IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

Hongcheng Guo; Wei Zhang; Junhao Chen; Yaonan Gu; Jian Yang; Junjia Du; Shaosheng Cao; Binyuan Hui; Tianyu Liu; Jianxin Ma; Chang Zhou; Zhoujun Li

doi:10.18653/v1/2025.findings-acl.334

IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Shaosheng Cao, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, Zhoujun Li

Abstract

Recently, advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of a robust benchmark specifically for assessing the image‐to‐web conversion proficiency of these large models. It is essential to ensure the integrity of the web elements generated, which comprise both visible and invisible categories. Previous evaluation methods (e.g., BLEU) are notably susceptible to significant alterations due to the presence of invisible elements. Furthermore, it is crucial to measure the layout information of web pages—i.e., the positional relationships between elements—which has been overlooked by prior work. To address these challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-bench). Specifically, we propose Element Accuracy, which tests the completeness of elements by parsing the Document Object Model (DOM) tree. We also introduce Layout Accuracy to analyze positional relationships by converting the DOM tree into a common subsequence. In addition, we design a five‐hop multimodal Chain‐of‐Thought prompting strategy for improved performance, consisting of: 1) SoM prompt injection, 2) inferring elements, 3) inferring layout, 4) inferring web code, and 5) reflection. Our benchmark comprises 1,200 image–code pairs with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, providing insights into their performance and identifying areas for improvement in the image‐to‐web domain.

Anthology ID:: 2025.findings-acl.334
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6449–6466
Language:
URL:: https://aclanthology.org/2025.findings-acl.334/
DOI:: 10.18653/v1/2025.findings-acl.334
Bibkey:
Cite (ACL):: Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Shaosheng Cao, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, and Zhoujun Li. 2025. IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web (Guo et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.334.pdf

PDF Cite Search Fix data