ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Li, Kaixin; Meng, Ziyang; Lin, Hongzhan; Luo, Ziyang; Tian, Yuchen; Ma, Jing; Huang, Zhiyong; Chua, Tat-Seng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.07981 (cs)

[Submitted on 4 Apr 2025]

Title:ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Authors:Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua

View PDF HTML (experimental)

Abstract:Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at this https URL.

Comments:	13pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
MSC classes:	68-11 68-04
ACM classes:	I.2.7; I.2.10
Cite as:	arXiv:2504.07981 [cs.CV]
	(or arXiv:2504.07981v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.07981

Submission history

From: Meng Ziyang [view email]
[v1] Fri, 4 Apr 2025 14:25:17 UTC (16,239 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators