0% found this document useful (0 votes)
260 views27 pages

Web Voyager

Uploaded by

cinba2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views27 pages

Web Voyager

Uploaded by

cinba2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

WebVoyager : Building an End-to-End Web Agent with

Large Multimodal Models

Hongliang He1,3∗, Wenlin Yao2 , Kaixin Ma2 , Wenhao Yu2 , Yong Dai2 ,
Hongming Zhang2 , Zhenzhong Lan3 , Dong Yu2
1
Zhejiang University, 2 Tencent AI Lab, 3 Westlake University
hehongliang@westlake.edu.cn, wenlinyao@global.tencent.com

Abstract 2023). The primary challenge in these works lies


in managing complex and verbose HTML texts,
The rapid advancement of large language mod-
and solutions include simplifying and structuring
els (LLMs) has led to a new era marked by
the development of autonomous applications HTML (Nakano et al., 2021; Zhou et al., 2023; Gur
et al., 2023; Deng et al., 2023).
arXiv:2401.13919v4 [cs.CL] 6 Jun 2024

in real-world scenarios, which drives innova-


tion in creating advanced web agents. Exist- However, existing approaches overlook a critical
ing web agents typically only handle one input functionality of browsing: rendering HTML into
modality and are evaluated only in simplified visual webpages. Particularly, vision capability is
web simulators or static web snapshots, greatly
crucial for utilizing tools such as web browsers,
limiting their applicability in real-world scenar-
ios. To bridge this gap, we introduce WebVoy- as rendered web pages are inherently designed
ager, an innovative Large Multimodal Model with user experience (UX), emphasizing intuitive
(LMM) powered web agent that can complete information and structured presentation. This de-
user instructions end-to-end by interacting with sign principle of rendering makes visual analysis
real-world websites. Moreover, we establish a more effective than mere HTML representation. At
new benchmark by compiling real-world tasks present, large multimodal models (LMMs), partic-
from 15 popular websites and introduce an au-
ularly GPT-4V(ision) (OpenAI, 2023) and Gemini
tomatic evaluation protocol leveraging multi-
modal understanding abilities of GPT-4V to (Team et al., 2023), demonstrate a remarkable abil-
evaluate open-ended web agents. We show that ity to integrate intricate visual cues with textual
WebVoyager achieves a 59.1% task success rate information. Existing studies such as Pix2Struct
on our benchmark, significantly surpassing the (Lee et al., 2023) and WebArena (Zhou et al., 2023),
performance of both GPT-4 (All Tools) and the have initiated explorations into using screenshots
WebVoyager (text-only) setups, underscoring as inputs for decision-making in web navigation,
the exceptional capability of WebVoyager. The yet these are preliminary and do not represent a
proposed automatic evaluation metric achieves
deep exploration. Therefore, building multimodal
85.3% agreement with human judgment, indi-
cating its effectiveness in providing reliable and web agents to leverage the environment rendered
accurate assessments of web agents.1 by browsers through screenshots, thus mimicking
human web browsing behavior, is now a viable
1 Introduction approach to enhance web navigation abilities.
The recent advancement of large language models We introduce WebVoyager (Figure 1), a multi-
(LLMs), such as ChatGPT and GPT-4 (OpenAI, modal web agent designed to autonomously ac-
2023), have sparked significant interest in devel- complish web tasks online from start to finish,
oping LLM-based autonomous agents (AutoGPT, managing the entire process end-to-end without
2022) for complex task execution (Qin et al., 2023; any intermediate human intervention. WebVoyager
Schick et al., 2023). Recent studies have explored processes the user query by making observations
the construction of text-based web browsing envi- from screenshots and textual content in interactive
ronments and how to instruct large language model web elements, formulates a thought on what ac-
agents to perform web navigation (Nakano et al., tion to take (such as clicking, typing, or scrolling,
2021; Gur et al., 2023; Zhou et al., 2023; Lu et al., etc.), and then executes that action on the web-
∗ sites. Inspired by Set-of-Mark Prompting (Yang
Work done during the internship at Tencent AI Lab.
1
Our code and data will be released at https://github. et al., 2023a), we mark interactive web elements
com/MinorJerry/WebVoyager on screenshots (see Figure 2) to facilitate decision-
User
Thought: Based on observation...
S
Qu end Click? Type? Scroll? ...
Screenshot (Image) er
y
Re
An tur Thought
sw n Available Websites
er

Web Elements (Text) Observation Action


[1]: Delivering to Santa Clara 95050
[2]: <input> "Search Amazon"
... ...

Figure 1: The overall workflow of WebVoyager. WebVoyager takes web tasks assigned by a human and automatically
browses the web online. At each step, WebVoyager selects actions based on screenshots and text (the ‘type’ of the
web element and its contents). Once the task is completed, the answers will be returned to the user. For example, for
a user query: "Find the cost of a 2-year protection for PS4 on Amazon.", the agent interacts with Amazon online,
locates the PS4, identifies the 2-year protection price, and returns "$30.99" to the user.

making for WebVoyager. The results show that WebVoyager achieves a Task
Another challenge is the evaluation of an end- Success Rate of 59.1% on our new benchmark, sig-
to-end web agent. Existing benchmarks, such as nificantly outperforming GPT-4 (All Tools) with
Mind2Web (Deng et al., 2023), primarily focus a rate of 30.8% and the text-only setting with a
on stepwise and offline evaluation, where agents rate of 40.1%, demonstrating the effectiveness of
follow a predefined “golden” trajectory for action our method. Our research demonstrates the effec-
selection. This approach, however, may not fully tiveness of the WebVoyager method for web tasks,
account for the variety of viable strategies to ac- offering insights into the development of more in-
complish a task, as it only reflects one possible plan. telligent and efficient web automation solutions.
This limitation could lead to a biased evaluation
and difficulties in fairly comparing different meth- 2 Related Work
ods. To accurately evaluate the capabilities of web
agents in end-to-end task completion, we propose Autonomous web navigation (Shi et al., 2017; Yang
an automated evaluation protocol using GPT-4V. et al., 2023b) requires an agent to follow instruc-
Specifically, we save screenshots throughout the tions, construct plans, comprehend complex web
online navigation process and then use GPT-4V structures, and decompose tasks into step-by-step
to evaluate these trajectories together with the fi- decisions (Weng, 2023). To study web agents
nal results automatically. Human evaluations are in a controlled environment, previous works con-
also conducted to verify the results and the analysis structed web simulators that contain simplified
shows that our evaluation protocol achieves 85.3% websites (Shi et al., 2017; Yao et al., 2022a). More
agreement with human judges, indicating GPT-4V recently, there has been a surge of interest in build-
can serve as a reliable evaluator for online agents. ing more challenging and realistic benchmarks
We conduct evaluations on a newly collected such as Mind2Web (Deng et al., 2023) and We-
dataset, which is semi-automatically generated us- bArena (Zhou et al., 2023).
ing a self-instruct (Wang et al., 2022) method, com- Along with these new benchmarks, numerous
prising 643 web tasks from 15 commonly accessed efforts have been made to build autonomous web
websites. We also evaluate WebVoyager on 90 web- agents. WebGPT (Nakano et al., 2021) constructs
related tasks of level 1 and level 2 from the GAIA a text-based web browsing environment and fine-
(Mialon et al., 2023), and 50 interactive open-web tunes GPT-3 as a web agent. WebAgent (Gur et al.,
tasks from SeeAct (Zheng et al., 2024). We com- 2023) pretrains a T5 model to extract HTML snip-
pare our WebVoyager with 1) GPT-4 (All Tools)2 , pets and leverages Flan-U-PaLM (Chowdhery et al.,
and 2) WebVoyager in a text-only setting which em- 2023) to generate Python code to interact with the
ploys the textual accessibility tree proposed in We- environment. Besides fine-tuning, another line of
bArena (Zhou et al., 2023) to describe web pages. work tries to build web agents by prompting LLMs
2
GPT-4 (All Tools) is an integrated tool-based agent re- (Yao et al., 2022b; Shinn et al., 2023; Ma et al.,
leased by OpenAI in Oct. 2023. See https://chat.openai.com/ 2023). Multimodal web agents that integrate visual
signals have also been explored, WebGUM (Fu-
ruta et al., 2023) combines T5 (Raffel et al., 2020)
with a Vision Transformer (ViT) to navigate using
both screenshots and HTML text. PIX2ACT (Shaw
et al., 2023) instead solely relies on web screen-
shots as inputs to predict agent actions. Unlike pre-
vious works that only consider a single modality
or simplified web environments, we build a multi-
modal agent that can complete tasks on real-world
websites in this work. Concurrently with our work,
SeeAct (Zheng et al., 2024) also leverages Large
Multimodal Models (LMMs) for integrated visual
Figure 2: Examples of webpage screenshots provided
understanding and actions on websites. However,
to the agent. We add borders to most of the interactive
the best SeeAct agent still relies on a finetuned elements on the web pages and label them with numeri-
cross-encoder model to select candidate elements cal tags in the top left corner.
for interaction. In contrast, WebVoyager does not
require any additional modules. 3.2 Interaction Formulation
Formally, we denote the Environment as E, the
3 WebVoyager
large Multimodal Model as M, the Observa-
We aim to build an agent that can browse the open tion Space as O, and the Action Space as A.
web autonomously without human intervention to At time step t, the model receives the con-
complete user instructions. Given an instruction, text ct as inputs, which consist of historical ac-
our WebVoyager first instantiates a web browser tions ai and observations oi , defined as: ct =
and then performs actions with visual (i.e., screen- (o1 , a1 , ..., ot−1 , at−1 , ot , I) The the model pro-
shots) and textual (i.e., HTML elements) signals duces the action at at time t, at = M(ct ), which
from the web. The agent produces an action based is then executed in the environment. After execu-
on the inputs at every step, which is then executed tion, the environment sends back the observation at
in the browser environment. The process continues time t + 1, ot+1 = E(ot , at ). Then the context will
until the agent decides to stop. The details of Web- be updated and this interaction process continues
Voyager, including environment, interaction cycle, until the model generates a terminating action or
observation space, and action space, are as follows. the maximum step is reached.
Inspired by the paradigm of ReAct Prompting
3.1 Browsing Environment (Yao et al., 2022b), we also prompt our agent to
generate a thought process first before generating
We develop an automated web-browsing environ- the action code. Hence at can be further composed
ment using Selenium3 . Unlike WebArena (Zhou into (st , ât ) where st and ât represent the natural
et al., 2023), we do not host any websites locally language thought and action code respectively. Fig-
and allow the agent to explore the open web instead, ure 7 in Appendix A presents the System Prompt
which poses unique challenges such as floating ads, we designed for the action prediction step. Also,
pop-up windows, constant updates, etc.4 Still, we it’s worth noting that excessive observations of web
opt for online interaction with real websites as we pages from longer episodes may confuse the agent.
believe that this setting truly reflects the real-world Therefore, we perform context clipping to remove
use cases (e.g., the agent needs access to real-time outdated web page information and only keep the
information from the web), and a successful web three most recent observations in the inputs, and
agent should be able to adapt to these challenges we keep the entire history of thoughts and actions
and consistently solve the problem robustly. to better guide the agent.
3
https://www.selenium.dev/
4 3.3 Observation Space
Regarding CAPTCHAs (Completely Automated Public
Turing test to tell Computers and Humans Apart) challenges, Similar to how humans browse the web, our agent
we believe it is important to respect the rules of these websites
and prompt the agent to retrieve information from alternative also takes the visual information from the web
sources. (screenshots) as the primary source of input. Using
screenshots allows us to avoid the burden of pro- for the task. With the help of numerical labels in
cessing HTML DOM tree or accessibility tree to screenshots, we enable the agent to respond with
portray the overall structure of webpages, which a concise Action Format. This approach precisely
can lead to overly verbose texts and impact the locates the elements requiring interaction and ex-
decision-making process of the agent. Inspired by ecutes the corresponding actions. The usage of
Set-of-Mark Prompting (Yang et al., 2023a), we actions is as follows (more details in Appendix
overlay bounding boxes of the interactive elements C): 1) Click. This action involves clicking on an
on the websites to better guide the agent’s action element within a webpage, typically a link or a
prediction. Unlike Yang et al. (2023a), we do not button. 2) Input. This composite action involves
need any object detection module (Zou et al., 2023). selecting a text box, deleting any existing content
Instead, we utilize GPT-4V-ACT5 , a Javascript tool within it, and then inputting new content. 3) Scroll.
to extracts the interactive elements based on web Scrolling is a common operation for browsing web-
element types and then overlays bounding boxes pages, usually involving the vertical movement of
with numerical labels on the respective regions of the entire page. 4) Wait. Action execution requires
the elements. GPT-4V-Act is efficient since it is time, and this action is often used to wait for web
rule-based without incorporating any object detec- pages to load. 5) Back. This action is used to
tion model. return to the previous page. 6) Jump to Search
As illustrated in Figure 2, the nature of webpages Engine. There are often situations where agents
allows us to locate and outline each interactive el- get stuck at a certain website without finding an
ement using this tool precisely. The numerical answer. This action enables the agent to jump to
labels assigned to each element are also essential a search engine and start anew. 7) Answer. Once
for the model to identify the elements requiring all questions in the task are resolved, this action
interaction, thereby facilitating accurate action de- concludes the iteration and provides an answer in
termination. We empirically choose black color line with the task requirements.
for the borders and the background of the labels
to enhance clarity. We observe that using a single 4 Benchmark for WebVoyager
black color yields higher success rates than using
4.1 Website Selection
multiple colors. We also provide the agent with
auxiliary text as inputs, including the textual con- We select 15 representative websites that cover dif-
tent embedded within the interactive element, the ferent aspects of our daily life to ensure diversity in
type of the element, and possibly some comment our evaluation, including Allrecipes, Amazon, Ap-
text in the aria-label attribute. To simplify the ob- ple, ArXiv, BBC News, Booking, Cambridge Dic-
servation, we have disabled multiple tabs, i.e., all tionary, Coursera, ESPN, GitHub, Google Flights,
interactions occur within the current tab instead of Google Map, Google Search, Huggingface, and
opening new ones. Wolfram Alpha. Due to technical limitations,
At every step, the agent receives the current we regretfully omit websites requiring login or
screenshot, auxiliary text, and history as inputs, CAPTCHA to access their content. Additionally,
as discussed in (§3.2). In case the agent’s action Google Search is a universal website that can serve
raised an exception during execution, we addition- as a starting point for any website, making our
ally incorporated the error messages in the prompt framework applicable to various scenarios.
and asked the model to regenerate the response.
4.2 Data Construction
Note that each error correction attempt also con-
sumes one step from the total exploration budget. We employ a combination of self-instruct (Wang
et al., 2022) and human verification to construct our
3.4 Action Space evaluation set. Figure 3 illustrates our data creation
process. Initially, we manually sample and rewrite
We define the action space of our agent similar
some tasks from Mind2Web (Yin et al., 2023; Deng
to how humans browse the web. To this end, we
et al., 2023) for websites including Google Flights,
implement the most commonly used mouse and
Google Map, Google Search, Booking, and Wol-
keyboard actions, sufficient for the agent to browse
fram Alpha. This process yields initial seed tasks
various web pages and locate the content required
in the Task Pool for subsequent generation. In step
5
https://github.com/ddupont808/GPT-4V-Act two, we sample tasks from Task Pool as in-context
Several tasks Step 2: seed tasks for each web
for 15 webs
labeled as “Golden,” we provide a comprehensive
Manual Filtering
Filtered listing of possible responses and consider them
(Quality of Tasks) tasks
stable in the short term. The “Possible” category
In-context covers the following scenarios: 1) Answers for
examples
GPT-4
open-ended tasks where it’s hard to find an exact
match answer, such as summarization. 2) multiple
Task Pool
answers satisfy the task, making it impractical to
Step 1: Tasks written by b) (15 webs)
Human (for 3 - 5 webs) g We list all of them. Therefore, we provide a partial
e rin on
ilt s
l F er In-context listing. 3) Tasks related to real-time information,
ua sw GPT-4
an an examples
M ind
(F where the answer might change, e.g., flight ticket
prices. Hence, the “Possible” answers were also
Generated tasks
correct during our experiments. In total, 22.3% of
Final Web Tasks Step 3: Generate web tasks and
check answers online. questions are annotated with golden responses, and
the rest only have possible answers.
Figure 3: Data creation process using self-instruct.

examples (Dong et al., 2022) and prompt GPT-4 5 Experiment


Turbo to generate approximately 100 new tasks (20
iterations). Then we manually verify each gener- Dataset and Metrics Besides the benchmark in-
ated task and rewrite them if necessary to ensure troduced in Section 4, we also evaluated on 90
its high quality and the answers can be found on web browsing tasks (Level 1 and Level 2) from
the corresponding website, then we add them to the the GAIA dataset (Mialon et al., 2023), which also
Task Pool as additional seed tasks. This step allows come with golden responses. Since GAIA does not
us to create human-validated seed tasks for each provide specific websites, we instruct the agent to
website. Finally, in step three, we sample more start with Google Search. We further evaluated our
diverse in-context examples in the Task Pool and agent on the 50 tasks used in SeeAct agent’s online
directly add the generated tasks to the Task Pool evaluation (Zheng et al., 2024) and compared with
in each iteration. We manually verify that the gen- their results. Following WebArena (Zhou et al.,
erated tasks have low repetition, and the answers 2023), the primary evaluation metric we adopt is
to the generated tasks can be found on the web- the Task Success Rate, measuring the successful
sites. In total, we collected 40+ tasks per website, completion of tasks without considering whether
resulting in a total of 643 tasks. the steps are optimal.
To further confirm that the generated tasks in the
dataset have low repetition, We use the all-mpnet- Experimental Details We employ GPT-4 Turbo
base-v26 model to calculate pairwise similarity for with vision (gpt-4-vision-preview) as the backbone
643 ques. Out of a total of 206,403 pairs, only 49 model of our agent, which showcases strong se-
pairs have a similarity greater than 0.8, and 140 mantic and visual understanding capabilities equiv-
pairs have a similarity between 0.7 and 0.8. All of alent to GPT-4V (OpenAI, 2023). Furthermore,
these have been manually checked and found to be we incorporate two additional backbone models,
acceptable. 99.68% of pairs have a similarity of namely Claude 3 Opus (Anthropic, 2024) and GPT-
less than 0.6. This demonstrates the diversity of 4o (GPT-4 Omni, OpenAI, 2024), to augment the
our tasks and the robustness of our approach. diversity of our research. We include the GPT-4
(All Tools) for baselines, which integrates vision,
4.3 Annotation Process
web browsing, code analysis, and various plugins
After collecting the full task pool, we annotate an- in one model. Additionally, we consider a text-only
swers for each task. Since some questions are open- baseline where the agent only receives the websites’
ended and the web information may change, these accessibility tree as input to prediction actions. We
questions may not have a fixed golden response. used a fixed browser window size of 1024 * 768
Thus, we label each data entry with an answer, cat- pixels for our environment, ensuring a consistent
egorized as “Possible” or “Golden.” For answers size for the screenshots in our observations. We set
6
https://huggingface.co/sentence-transformers/all-mpnet- the temperature to 1 during generation and allow
base-v2 the agent to explore at most 15 steps.
[10]: Accessories
[46]: Smart Folio ...

[17]: Search accessories

Step 1: Cilck [10] Step 2: Type [17]: Smart Folio for iPad Step 3: Cilck [46]

[0]: City or Zip

[22]: Apple Valley Fair

Step 4: Cilck [22] Step 5: Type [0]: 90038 Step 6: ANSWER

Figure 4: Screenshots of a complete trajectory of online web browsing. Given the task: ‘Search Apple for the
accessory Smart Folio for iPad and check the closest pickup availability next to zip code 90038.’ The agent interacts
with the Apple website and obtains the answer: ‘Apple Tower Theatre.’

5.1 Evaluation Methods respond and conclude the navigation. Additional


We adopt human evaluation as our main evaluation examples are provided in the Appendix D.
metric since most of the questions in our bench- We present the results for our dataset and the
mark have open-ended answers. In particular, we extracted GAIA web tasks in Table 1 and Figure
provide the human evaluators with the complete 5. WebVoyager outperforms text-only and GPT-
trajectories of the agent’s interaction with the web 4 (All Tools) baselines by large margins in most
(all screenshots and all actions), and ask them to website tasks, while it is slightly lower than Text-
provide a binary judgment of whether the agent suc- only on Allrecipes and similar to Text-only on
cessfully completed the task. For a subset of 300 Github, ESPN, Cambridge Dictionary and Wolfram
tasks, we invite three annotators to judge each tra- Alpha. This is primarily because these websites
jectory to understand the agreement among human are more text-heavy than others. Since WebVoy-
annotators. ager mostly relies on web screenshots for decision-
Even though human evaluations are accurate, making, dense text might not be easily recogniz-
they are often not scalable. Hence, we want to see able from the image. We think extracting such text
if leveraging an LMM for automatic evaluation is from the HTML to augment the input could be a
feasible. To this end, we propose to use GPT-4V potential solution to this problem, suggesting a di-
as an auto-evaluator that emulates the behavior of rection for future work. In Figure 5, WebVoyager
human evaluators to evaluate the navigation trajec- also achieves much stronger performance than both
tories of WebVoyager. In particular, we provide the baselines. Finally, WebVoyager has a success rate
task, the responses from WebVoyager, and the last of 30% on the SeeAct online test set whereas the
k screenshots to the evaluator and ask it to judge best SeeAct autonomous agent has 26%, showing
whether the agent has completed the task, where k the efficacy of our proposed agent.
is a hyper-parameter. The prompt of the GPT-4V We report the Agreement (the ratio of overlap)
evaluator is shown in Appendix B. and Kappa (κ; Cohen 1960) between consolidated
human labels7 and GPT-4V’s judgments on the
5.2 Result subset of 300 tasks in Table 2. Here, k denotes the
number of screenshots provided to GPT-4V, with
Figure 4 presents an example that demonstrates
“full” implying the full trajectory. GPT-4V’s agree-
how the agent interacts with the Apple website step
ment with human annotators gradually improves as
by step in an online fashion to complete a task. In
the final screenshot, the Agent acquires the desired 7
the Fleiss’s Kappa (Fleiss, 1971) of human annotators
information, then selects the “ANSWER” action to before any discussion is 0.7, which is substantial agreement.
Allrecipes Amazon Apple ArXiv GitHub Booking ESPN Coursera
GPT-4 (All Tools) 11.1% 17.1% 44.2% 14.0% 48.8% 22.7% 31.8% 31.0%
WebVoyagerText-only 55.6% 31.7% 34.9% 32.6% 61.0% 2.3% 36.4% 23.8%
WebVoyager 53.3% 58.5% 65.1% 51.2% 63.4% 43.2% 38.6% 73.8%
WebVoyagerText-only ∗ 57.8%±0.0% 43.1%±1.4% 36.4%±3.5% 50.4%±1.4% 63.4%±2.5% 2.3%±0.0% 38.6%±2.3% 24.6%±1.4%
WebVoyager∗ 51.1%±2.2% 52.9%±1.4% 62.8%±2.3% 52.0%±1.3% 59.3%±3.7% 32.6%±2.7% 47.0%±1.3% 57.9%±2.7%
WebVoyagerClaude ∗ 45.9%±3.4% 58.6%±4.2% 58.1%±4.0% 55.0%±7.0% 56.9%±1.4% 19.0%±1.3% 46.2%±1.3% 68.2%±1.3%
WebVoyagerGPT-4o ∗ 56.3%±1.3% 53.7%±2.5% 56.6%±1.3% 60.5%±0.0% 57.7%±3.7% 43.9%±3.5% 44.0%±2.7% 65.1%±2.8%
Cambridge BBC Google Google Google Wolfram
Huggingface Overall
Dictionary News Flights Map Search Alpha
GPT-4 (All Tools) 25.6% 9.5% 2.4% 53.7% 60.5% 37.2% 52.2% 30.8%
WebVoyagerText-only 62.8% 45.2% 7.1% 61.0% 67.4% 20.9% 58.7% 40.1%
WebVoyager 65.1% 61.9% 59.5% 70.7% 76.7% 44.2% 63.0% 59.1%
WebVoyagerText-only ∗ 66.7%±3.6% 45.2%±2.4% 7.1%±0.0% 62.6%±2.8% 75.2%±1.3% 31.0%±1.4% 60.2%±1.3% 44.3%±0.6%
WebVoyager∗ 71.3%±1.3% 60.3%±2.8% 51.6%±1.4% 64.3%±2.8% 77.5%±2.7% 55.8%±2.3% 60.9%±2.2% 57.1%±0.2%
WebVoyagerClaude ∗ 71.3%±3.6% 66.7%±4.8% 15.1%±5.5% 55.3%±1.4% 72.9%±1.3% 53.5%±4.7% 51.5%±5.4% 52.8%±1.4%
WebVoyagerGPT-4o ∗ 82.2%±1.3% 54.8%±2.4% 28.6%±0.0% 56.9%±2.8% 63.6%±1.3% 42.6%±3.6% 65.2%±2.2% 55.5%±0.8%

Table 1: The main result for WebVoyager. Each website contains 40 to 45 tasks, and we report the Task Success
Rate in the table. We show the results of GPT-4 (All Tools), WebVoyagerText-only using the accessibility tree, and
WebVoyager by comparing with human expert labels. WebVoyager∗ , WebVoyagerText-only ∗ , WebVoyagerClaude ∗ and
WebVoyagerGPT-4o ∗ are results evaluated by GPT-4V (full trajectory, kappa = 0.70). For each automatic evaluation,
we run GPT-4V evaluator three times to calculate the performance mean and standard deviation.

Consistency WebVoyager Evaluator


Success Rate
Backbone
Agreement κ GPT-4V Claude-3-Opus GPT-4o
k=1 47.7% 75.3% 0.51 GPT-4V 57.1 55.1 63.0
k=2 55.3% 79.7% 0.59 Claude-3-Opus 52.8 61.6 55.4
k=3 54.3% 81.3% 0.62 GPT-4o 55.5 54.9 64.1
Full 58.3% 85.3% 0.70
Table 3: Overall Task Success Rate of WebVoyager us-
Table 2: Consistency between GPT-4V and Human. ing automatic evaluation. We employ GPT-4V, Claude-
Success Rate is the overall success rate of all tasks given 3-Opus, and GPT-4o as backbones for WebVoyager and
by GPT-4V. Based on the annotations given by GPT-4V run all tasks, followed by automatic evaluation using
and Human (after alignment), we compute Agreement, these three models.
i.e., the label overlap, and the Kappa values.

encounters difficulties in correctly interacting with


it receives more information, and its final Kappa web elements while inputting basic flight informa-
score also reaches 0.7, which is on par with the tion. Modifying the system prompt for GPT-4o or
agreement among human annotators. The consis- Claude may potentially improve the performance.
tency between GPT-4V and humans suggests that Besides, we conduct the Claude-3-Opus based
GPT-4V is a promising automatic evaluator for evaluation and the GPT-4o based evaluation. When
multi-modal web agents. Accordingly, we report provided with the full trajectory, the Claude-3-
the automatic evaluation results based on GPT-4V Opus achieves a kappa value of 0.6 with humans,
in Table 1. The automatic evaluation results of indicating that it is less reliable than the GPT-4V.
three backbone models, GPT-4V, Claude 3 Opus, And the kappa value between GPT-4o and humans
and GPT-4o, are relatively close, and their perfor- is 0.72, slightly higher than that of GPT-4V. Table 3
mance is significantly better than the Text-only illustrates the Task Success Rate when using GPT-
setting (with GPT-4 as the backbone). However, 4V, Claude-3-Opus, and GPT-4o as backbones and
there is a performance decline for both Claude and auto evaluators. We observe that GPT-4o exhibits
GPT-4o on Google Flights. Upon reviewing the a more lenient attitude towards task performance
trajectories, it is observed that GPT-4o falls into results, while GPT-4V tends to be relatively strict.
a cognitive bias, where it fails to correctly select However, both models agree that Claude-3-Opus
the ‘one way’ option for one-way trip tasks, mistak- performs the worst. Claude-3-Opus, on the other
enly assuming that only the departure date needs hand, demonstrates a clear preference for its own re-
to be entered. On the other hand, Claude-3-Opus sults, believing that GPT-4V and GPT-4o are simi-
80
38.5 ESPN Booking
40 GPT-4V (All Tools) 70

Average elements num per step


WebVoyager (Text only)
Task Success Rate (%) 70
30 WebVoyager 60

Task Success Rate


23.1 ArXiv
19.2
50 Google Flights 60
20
15.6 Amazon
12.5 12.5 40 Huggingface
Cambridge Dict. 50
10 Google Map GitHub
30 Google Search
BBC News
Wolfram Alpha Apple 40
0
20
Level 1 Level 2 Coursera Allrecipes
30
Figure 5: Success Rate for GAIA Level 1 and Level 2. 3 4 5 6 7 8 9 10
Average Trajectory Length
lar but considers itself to have the best performance. Figure 6: Factors related to task success rate. We show
GPT-4o and GPT-4V also exhibit a certain bias to- the average number of elements per page and the aver-
wards their own results, with each considering itself age trajectory length for each website type. The darker
to be superior to the other. colors indicate a higher task success rate.

5.3 Discussions the complexity of a webpage. Intuitively, websites


Direct interaction with the websites is necessary depicted in the lower-left corner of Figure 6, char-
From our experience of using GPT-4 (All Tools), acterized by relatively simple webpages and shorter
the primary limitation of GPT-4 (All Tools)’s per- trajectory lengths, are expected to exhibit higher
formance is rooted in its reliance on Bing search for Task Success Rates. As observed in Figure 6, the
web browsing, predominantly depending on web results largely align with this intuition.
pages fetched by Bing. It cannot directly access cer- Why not use Open Source models. There are a
tain websites (such as Apple, Amazon, BBC News, few critical limitations of the existing open-sourced
etc.) for searching, clicking, or utilizing their sort- LMMs that prevent us from exploring other alterna-
ing functions. This greatly limits the agent’s ability tives. Specifically, the web navigation task requires
to complete certain types of tasks. the agent to process fine-grained details from web
Both text and vision are necessary for gener- page screenshots to make decisions, hence high-
alist web agents. As discussed earlier, WebVoy- resolution is required to preserve the information
ager struggles with text-heavy websites. On the from the web. However, most open-sourced LMMs
other hand, we observe that the text-only agent such as LLaVA (Liu et al., 2024) reduce the image
demonstrates significantly poorer performance on resolution to 224x224 or 336x336, which makes
websites with complex visual elements, such as text with smaller fonts unrecognizable, hence they
Booking and Flights, which require interactions are unsuitable for web navigation tasks. Moreover,
with calendars and other intricate components. In models like LLaVA have a max context length of
these scenarios, the textual input such as the acces- 4096. In our case, the agent needs to handle trajec-
sibility tree becomes highly complex and verbose, tories as long as 15 steps, and it requires approxi-
making it far less intuitive than using screenshots. mately 7000+ tokens, which does not fit in those
Hence it’s necessary to incorporate both modalities models’ context size.
of inputs when building the general purpose agents.
Websites with more interactable elements are 5.4 Error Analysis
more challenging for agents. We also calculate In this section, we discuss and summarize the pri-
the average trajectory length of tasks completed mary issues encountered by WebVoyager in the task
within the maximum number of iterations, as well completion process. These challenges will serve
as the average number of interactive web elements as critical entry points for future enhancements of
present on the webpage screenshots. Figure 6 il- the Task Success Rate and for devising strategies
lustrates their relationship with the Task Success to obtain an Optimal Trajectory. We sampled 300
Rate. We posit that the average trajectory length tasks from our benchmark and manually labeled
serves as a measure of a task’s complexity to some the error category for each failed case, we show the
extent, while the average number of numerical la- distribution of errors in table 4. In Appendix F, we
bels related to the decision-making process reflects also provide specific examples for each issue.
Main reasons for
Ratio
respond with a cheap product visible in a screen-
Failure shot, neglecting the need to sort the options first. 2)
Navigation Stuck 44.4% The agent might execute a seemingly correct action
Visual Grounding Issue 24.8%
Hallucination 21.8%
without raising any errors, which deviate it from
Prompt Misalignment 9.0% the correct reasoning path. For example, inputting
content to the wrong text box when there are many
Table 4: Distribution of main failure reasons. text boxes on the webpage is still valid, yet it would
guide the agent to obtain a wrong answer.
Navigation Stuck The most common failure is
running out of steps before completing the task. Prompt Misalignment Understanding and fol-
There are three failure scenarios: 1) When the lowing complex prompts, as illustrated in Figure
agent’s search query is not precise and explicit 7, often poses significant challenges. Moreover,
enough, it will be overwhelmed by irrelevant search longer trajectories may result in excessively lengthy
results. The agent may prefer to browse different contexts, hindering effective instruction following.
results or wait for incorrect outcomes rather than Although many of the errors in Navigation Stuck
correct its previous action; 2) When the scroll-able and Hallucination categories can also be attributed
area is very small, the agent might not be able to to prompt design, we use Prompt Misalignment to
locate the correct scrolling area and repeatedly re- categorize the following situations: 1) the agent
quest the execution of useless scrolling actions; 3) fails to generate outputs that can be parsed into ex-
Sometimes in the middle of the page, the agent has ecutable actions, e.g. providing only the ‘Thought’
trouble deciding whether to scroll up or down. The without the corresponding ‘Action’; 2) Prematurely
agent also tends to repeat its previous mistakes due terminating the process using the ANSWER ac-
to the input clipping, as mentioned in section 3.2. tion, though agent knows that the task is not yet
These meaningless or repetitive actions may hinder complete (explicitly mentioned in its answer).
the completion of the task.
6 Conclusion
Visual Grounding Issue The visual grounding
ability of our agent still has a large room for im- We introduce WebVoyager, an innovative web
provement. We observe the following issues: 1) agent powered by large multimodal models (LMM)
The agent cannot interpret less frequently observed that can complete real-world web tasks end-to-
patterns, such as misidentifying characters repre- end by interacting with websites. We have shown
senting the pronunciations or math formulas; 2) through evaluations that WebVoyager outperforms
The agent doesn’t recognize the subtle difference several baselines by leveraging both visual and tex-
between two observations and thinks the execution tual signals. We also propose an automatic evalu-
has failed; 3) The agent selects the wrong element ation protocol by leveraging GPT-4V as the eval-
for action execution due to proximity. For example, uator for online agents. Our work demonstrates
the model sometimes confuses adjacent elements the promise of using advanced LMM capabilities
and misinterprets numbers on a calendar as numer- in building intelligent web agents. We hope We-
ical labels. Sometimes textual information plays a bVoyager provides a strong foundation for future
significant role, offering valuable cues and assist- research toward building more versatile and capa-
ing in distinguishing between overly dense web el- ble web assistants.
ements. We find that incorporating the text content
included in Web Elements can alleviate these prob- Limitations
lems to some extent. However, a stronger visual
We recognize the following limitations of our work.
encoder or additional text inputs might be needed.
First, we haven’t supported all possible actions in
Hallucination Agents sometimes produce seem- our environment compared to actions a human user
ingly correct answers, which may require checking might take when browsing the web. e.g. the Drag
carefully to identify errors. We mainly see the fol- action on web pages. Supporting such an action is
lowing two scenarios: 1) The agent may overlook challenging since the degree of Drag is not a finite
certain task requirements and settle for an answer set. We may allow it to choose the pixel values to
that is only partially correct. For instance, when be dragged if the Visual Grounding capabilities of
asked for the cheapest product, the agent might LMMs are further enhanced. Second, our agent
currently can only analyze basic file formats (such AI Anthropic. 2024. Introducing the next generation of
as text files and PDF files) and doesn’t support all claude.
file formats, especially videos. Enhancing support AutoGPT. 2022. AutoGPT.
for additional file formats is a crucial step in the
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
development of web agents and we leave it for
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
future work. Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Regarding the potential risks of our work, we be- Askell, et al. 2020. Language models are few-shot
lieve that it requires a substantial amount of safety learners. Advances in neural information processing
systems, 33:1877–1901.
checks before deploying web agents like WebVoya-
gar into real-world applications, as the agent might Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
unintentionally download malicious content from Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
unauthorized websites, or input private/confidential Greg Brockman, et al. 2021. Evaluating large
information on public websites. Also the agent language models trained on code. arXiv preprint
might send fake requests to website servers or gen- arXiv:2107.03374.
erate fake user activities, which might be harmful
Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze
to website owners. Therefore it’s necessary to take Zhou, Hsiang-Ting Chen, and Qi Wu. 2023. Webvln:
extra caution when using and testing our agent. Vision-and-language navigation on websites.
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu,
Ethics Statement Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024.
Seeclick: Harnessing gui grounding for advanced
Our experiments have been designed to operate visual gui agents.
within strict ethical guidelines. Specifically, we Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
restrict our web agent to perform only non-login Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
tasks. This approach is in full compliance with Barham, Hyung Won Chung, Charles Sutton, Sebas-
the terms of service and user agreements of the tian Gehrmann, et al. 2023. Palm: Scaling language
modeling with pathways. Journal of Machine Learn-
websites our agent interacts with. Furthermore, ing Research, 24(240):1–113.
we closely monitor the agent’s activities during its
online evaluations. This monitoring is designed to Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and psychological mea-
identify and prevent any actions that could lead to surement, 20(1):37–46.
potentially harmful consequences. By taking these
precautions, we ensure that our research does not Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan,
Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan
cross ethical boundaries or cause unintended harm.
Zhang, Xueyu Hu, and Shuming Shi. 2022. One
Additionally, all task queries for evaluation un- model, multiple modalities: A sparsely activated ap-
dergo thorough manual inspection to ensure they proach for text, sound, image, video and code. arXiv
are harmless and ethically sound. This manual in- preprint arXiv:2205.06126.
spection process is aimed at ensuring that every Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen,
query is harmless and does not promote or prop- Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su.
agate harmful content or actions. Our work aims 2023. Mind2web: Towards a generalist agent for the
web. arXiv preprint arXiv:2306.06070.
to enhance user experience and accessibility while
mitigating potential negative societal impacts. By Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
proactively addressing ethics concerns, we dedicate ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and
Zhifang Sui. 2022. A survey for in-context learning.
to conducting research that benefits society while arXiv preprint arXiv:2301.00234.
upholding high ethical standards.
Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias
References Minderer, Georg Heigold, Sylvain Gelly, et al. 2020.
An image is worth 16x16 words: Transformers
Armen Aghajanyan, Bernie Huang, Candace Ross, for image recognition at scale. arXiv preprint
Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro arXiv:2010.11929.
Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis,
et al. 2022. Cm3: A causal masked multi- Joseph L Fleiss. 1971. Measuring nominal scale agree-
modal model of the internet. arXiv preprint ment among many raters. Psychological bulletin,
arXiv:2201.07520. 76(5):378.
Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yu- Grégoire Mialon, Clémentine Fourrier, Craig Swift,
taka Matsuo, Shixiang Shane Gu, and Izzeddin Thomas Wolf, Yann LeCun, and Thomas Scialom.
Gur. 2023. Multimodal web navigation with 2023. Gaia: a benchmark for general ai assistants.
instruction-finetuned foundation models. arXiv arXiv preprint arXiv:2311.12983.
preprint arXiv:2305.11854.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Long Ouyang, Christina Kim, Christopher Hesse,
Batra, and Devi Parikh. 2017. Making the v in vqa Shantanu Jain, Vineet Kosaraju, William Saunders,
matter: Elevating the role of image understanding et al. 2021. Webgpt: Browser-assisted question-
in visual question answering. In Proceedings of the answering with human feedback. arXiv preprint
IEEE conference on computer vision and pattern arXiv:2112.09332.
recognition, pages 6904–6913.
OpenAI. 2023. Gpt-4 technical report.
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa
Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan- OpenAI. 2024. Hello gpt-4o.
dra Faust. 2023. A real-world webagent with plan- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
ning, long context understanding, and program syn- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
thesis. arXiv preprint arXiv:2307.12856. Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
Jack Hessel, Jena D Hwang, Jae Sung Park, Rowan
tions with human feedback. Advances in Neural
Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate
Information Processing Systems, 35:27730–27744.
Saenko, and Yejin Choi. 2022. The abduction of
sherlock holmes: A dataset for visual abductive rea- Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
soning. In European Conference on Computer Vision, Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
pages 558–575. Springer. Bill Qian, et al. 2023. Toolllm: Facilitating large
language models to master 16000+ real-world apis.
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram arXiv preprint arXiv:2307.16789.
Duvvur, Ming Chong Lim, Po-Yu Huang, Graham
Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Daniel Fried. 2024. Visualwebarena: Evaluating Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
multimodal agents on realistic visual web tasks. Wei Li, and Peter J Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexi- former. The Journal of Machine Learning Research,
ang Hu, Fangyu Liu, Julian Martin Eisenschlos, Ur- 21(1):5485–5551.
vashi Khandelwal, Peter Shaw, Ming-Wei Chang,
and Kristina Toutanova. 2023. Pix2struct: Screen- Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
shot parsing as pretraining for visual language under- Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
standing. In International Conference on Machine Cancedda, and Thomas Scialom. 2023. Toolformer:
Learning, pages 18893–18912. PMLR. Language models can teach themselves to use tools.
arXiv preprint arXiv:2302.04761.
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
Hsieh, and Kai-Wei Chang. 2019. Visualbert: A sim- Peter Shaw, Mandar Joshi, James Cohan, Jonathan Be-
ple and performant baseline for vision and language. rant, Panupong Pasupat, Hexiang Hu, Urvashi Khan-
arXiv preprint arXiv:1908.03557. delwal, Kenton Lee, and Kristina Toutanova. 2023.
From pixels to ui actions: Learning to follow instruc-
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae tions via graphical user interfaces. arXiv preprint
Lee. 2024. Visual instruction tuning. Advances in arXiv:2306.00245.
neural information processing systems, 36.
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Her-
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- nandez, and Percy Liang. 2017. World of bits: An
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter open-domain platform for web-based agents. In In-
Clark, and Ashwin Kalyan. 2022. Learn to explain: ternational Conference on Machine Learning, pages
Multimodal reasoning via thought chains for science 3135–3144. PMLR.
question answering. Advances in Neural Information
Processing Systems, 35:2507–2521. Noah Shinn, Federico Cassano, Ashwin Gopinath,
Karthik R Narasimhan, and Shunyu Yao. 2023. Re-
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai- flexion: Language agents with verbal reinforcement
Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jian- learning. In Thirty-seventh Conference on Neural
feng Gao. 2023. Chameleon: Plug-and-play compo- Information Processing Systems.
sitional reasoning with large language models. arXiv
preprint arXiv:2304.09842. Gemini Team, Rohan Anil, Sebastian Borgeaud,
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiao- Radu Soricut, Johan Schalkwyk, Andrew M Dai,
man Pan, and Dong Yu. 2023. Laser: Llm agent with Anja Hauth, et al. 2023. Gemini: a family of
state-space exploration for web navigation. arXiv highly capable multimodal models. arXiv preprint
preprint arXiv:2309.08172. arXiv:2312.11805.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo,
isa Liu, Noah A Smith, Daniel Khashabi, and Han- and Jieping Ye. 2023. Object detection in 20 years:
naneh Hajishirzi. 2022. Self-instruct: Aligning lan- A survey. Proceedings of the IEEE.
guage model with self generated instructions. arXiv
preprint arXiv:2212.10560.
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
lia Tsvetkov, and Yuan Cao. 2021. Simvlm: Simple
visual language model pretraining with weak super-
vision. arXiv preprint arXiv:2108.10904.
Lilian Weng. 2023. Llm-powered autonomous agents.
lilianweng.github.io.
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin,
Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong,
Julian McAuley, Jianfeng Gao, Zicheng Liu, and
Lijuan Wang. 2023. Gpt-4v in wonderland: Large
multimodal models for zero-shot smartphone gui nav-
igation.
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-
yuan Li, and Jianfeng Gao. 2023a. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v. arXiv preprint arXiv:2310.11441.
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng
Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan
Wang. 2023b. The dawn of lmms: Preliminary
explorations with gpt-4v (ision). arXiv preprint
arXiv:2309.17421, 9(1).
Shunyu Yao, Howard Chen, John Yang, and Karthik
Narasimhan. 2022a. Webshop: Towards scalable
real-world web interaction with grounded language
agents.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2022b.
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629.
Da Yin, Faeze Brahman, Abhilasha Ravichander, Khy-
athi Chandu, Kai-Wei Chang, Yejin Choi, and
Bill Yuchen Lin. 2023. Lumos: Learning agents
with unified data, modular design, and open-source
llms. arXiv preprint arXiv:2311.05657.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
Choi. 2019. From recognition to cognition: Vi-
sual commonsense reasoning. In Proceedings of the
IEEE/CVF conference on computer vision and pat-
tern recognition, pages 6720–6731.
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023.
Appagent: Multimodal agents as smartphone users.
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
Yu Su. 2024. Gpt-4v (ision) is a generalist web agent,
if grounded. arXiv preprint arXiv:2401.01614.
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena:
A realistic web environment for building autonomous
agents. arXiv preprint arXiv:2307.13854.
A Prompt for WebVoyager cases, we expect the Agent to select an ele-
ment within the scrollable area and navigate
The System Prompt for WebVoyager is shown in to that particular region for scrolling. Ac-
Figure 7. The Prompt’s guidelines hold potential tion Format: Scroll [Numerical_Label or
for optimization and should be generalist rather WINDOW]; [up or down].
than website-specific in design. Incorporating spe-
cific issues from websites into the system prompt • Wait. Action execution requires time, and this
may compromise the agent’s universality. action is often used to wait for web pages to
load. Action Format: Wait.
B Prompt for Auto Evaluation
• Back. This action is used to return to the
Figure 8 demonstrates using GPT-4V as an evalu- previous page. We consider the forward action
ator for web tasks, involving web task instruction, unnecessary because it can be achieved by
screenshots in the trajectory, and WebVoyager re- repeating previous actions. Action Format:
sponses. We require GPT-4V to mark success or GoBack.
not success. The temperature is set to 0 to reduce
randomness during evaluation. • Jump to Search Engine. There are often situa-
tions where agents get stuck at a certain web-
C Action Space site, without finding an answer. This action
enables the agent to jump to a search engine
We detail the interaction actions that WebVoyager and start anew. In this work, we just adopt
employs to navigate and operate within web envi- Google Search. Action Format: Google.
ronments. These actions are fundamental to how
the agent interacts with web pages, retrieves in- • Answer. Once all questions in the task are
formation, and performs specific tasks as part of resolved, this action concludes the iteration
its operational protocol. The actions range from and provides an answer in line with the
basic web navigation to more complex operations, task requirements. Action Format: ANSWER;
tailored to efficiently gather data and respond to [Content].
queries. Each action is designed with a specific
format for easy identification and execution. D Additional Trajectories
• Click. This action involves clicking on an ele- In Figure 4, we demonstrate how WebVoyager auto-
ment within a webpage, typically a link or a matically navigates the Apple website and accom-
button. If clicking a link results in the down- plishes the task. In this section, we exhibit the tra-
load of a PDF file, we automatically parse its jectories for the remaining websites that complete
content using the OpenAI Assistant API8 and web tasks successfully. We provide a screenshot
incorporate it into the Observation. Action for each step, accompanied by the action generated
Format: Click [Numerical_Label]. by WebVoyager. The specific navigation trajecto-
ries for each website are illustrated in Figures 9 to
• Input. This is a composite action that in- 22. In addition, we also explore the performance of
volves selecting a text box, deleting any ex- WebVoyager on websites in other languages, and
isting content within it, and then inputting we provide two example trajectories in Chinese and
new content. To minimize interaction fre- Spanish in Figure 23 and Figure 24.
quency, an automatic ENTER key press fol-
lows the input completion. Action Format: E Additional Related Work
Type [Numerical_Label]; [Content].
Vision-based Agents Concurrent to our work, a
few related works also studied vision-based au-
• Scroll. Scrolling is a common operation for
tonomous agents. VisualWebArena (Koh et al.,
browsing webpages, usually involving the ver-
2024) extends WebArena with additional websites
tical movement of the entire page. However,
and tasks that focus on visual reasoning to facili-
there are instances where only a specific sec-
tate research on vision-based web agents. SeeClick
tion within the webpage is scrollable. In such
(Cheng et al., 2024) focused on finetuning an LMM
8
https://platform.openai.com/docs/assistants/overview to solely leverage screenshots as inputs to interact
Imagine you are a robot browsing the web, just like humans. Now you need to complete a task. In each iteration,
you will receive an Observation that includes a screenshot of a webpage and some texts. This screenshot will
feature Numerical Labels placed in the TOP LEFT corner of each Web Element. Carefully analyze the visual
information to identify the Numerical Label corresponding to the Web Element that requires interaction, then follow
the guidelines and choose one of the following actions:
1. Click a Web Element.
2. Delete existing content in a textbox and then type content.
3. Scroll up or down.
...

Correspondingly, Action should STRICTLY follow the format:


- Click [Numerical_Label]
- Type [Numerical_Label]; [Content]
- Scroll [Numerical_Label or WINDOW]; [up or down]
- Wait
- GoBack
- Google
- ANSWER; [content]

Key Guidelines You MUST follow:


* Action guidelines *
1) Execute only one action per iteration.
...
* Web Browsing Guidelines *
1) Don't interact with useless web elements like Login, Sign-in, donation that appear in Webpages.
...

Your reply should strictly follow the format:


Thought: {Your brief thoughts (briefly summarize the info that will help ANSWER)}
Action: {One Action format you choose}

Then the User will provide:


Observation: {A labeled screenshot Given by User}

Figure 7: System Prompt for WebVoyager. We instruct agents to perform web navigation, along with specific
browsing actions and action formats. To enhance efficiency and accuracy, we can incorporate additional general into
the prompts. These guidelines should be generic and not about a specific website to ensure generalizability.

with websites. WebVLN (Chen et al., 2023) intro- LMMs have demonstrated impressive performance
duced a web simulator that provides both HTML on benchmarks (Goyal et al., 2017; Lu et al., 2022;
text and screenshots to finetune supervised vision- Zellers et al., 2019; Hessel et al., 2022), establish-
language models. GPT-4V Wonderland (Yan et al., ing a foundation for the construction of multimodal
2023) and AppAgent (Zhang et al., 2023) instead agents in subsequent research.
focus on building agents that can operate smart-
phone apps using the GPT-4V as the backbone. F Error Cases
These works further underscore the promising
prospects in this field. In this section, we provide specific examples of
Large Multimodal Models. In recent years, sig- the four types of errors mentioned in the Error
nificant strides have been made in unifying image Analysis section. Figure 25 illustrates a snippet of
and text representations within a single multimodal WebVoyager navigating on Google Flights and a
model through joint training with image and text Visual Grounding Issue appears. The task is to re-
(Li et al., 2019; Dosovitskiy et al., 2020; Wang trieve one-way flight information for January 22nd;
et al., 2021; Dai et al., 2022; Aghajanyan et al., however, it selects December 22nd on the Calendar
2022). Large Multimodal Models (LMMs), fol- and fails to make the necessary corrections. Al-
lowing in the footsteps of Large Language Models though it attempts to modify the date in step 6, it
(Brown et al., 2020; Chen et al., 2021; Chowdh- ultimately fails to do so. Figure 26 illustrates a
ery et al., 2023), attain the capability of instruction situation of WebVoyager navigating on Allrecipes,
following (Ouyang et al., 2022) and exhibit robust encountering the Navigation Stuck issue. The agent
multimodal comprehension. Represented by GPT- requires multiple downward scrolls to locate the
4V (OpenAI, 2023) and Gemini (Team et al., 2023), correct ingredients. However, it experiences confu-
As an evaluator, you will be presented with three primary components to assist you in your role:

1. Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried
out. These requirements may include conducting searches, verifying information, comparing prices, checking availability, or any other
action relevant to the specified web service (such as Amazon, Apple, ArXiv, BBC News, Booking etc).

2. Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task.
It serves as visual proof of the actions taken in response to the instruction.

3. Result Response: This is a textual response obtained after the execution of the web task. It serves as textual result in response to
the instruction.

-- You DO NOT NEED to interact with web pages or perform actions such as booking flights or conducting searches on websites.
-- You SHOULD NOT make assumptions based on information not presented in the screenshot when comparing it to the instructions.
-- Your primary responsibility is to conduct a thorough assessment of the web task instruction against the outcome depicted in the
screenshot and in the response, evaluating whether the actions taken align with the given instructions.
-- NOTE that the instruction may involve more than one task, for example, locating the garage and summarizing the review. Failing to
complete either task, such as not providing a summary, should be considered unsuccessful.
-- NOTE that the screenshot is authentic, but the response provided by LLM is generated at the end of web browsing, and there may
be discrepancies between the text and the screenshots.
-- Note the difference: 1) Result response may contradict the screenshot, then the content of the screenshot prevails, 2) The content
in the Result response is not mentioned on the screenshot, choose to believe the content.

You should elaborate on how you arrived at your final evaluation and then provide a definitive verdict on whether the task has been
successfully accomplished, either as 'SUCCESS' or 'NOT SUCCESS'.

Figure 8: System Prompt for Auto Evaluation by GPT-4V.

sion during the process, and it is uncertain whether


to scroll up or down. Figure 27 depicts the Hal-
lucination issue encountered by WebVoyager on
the Coursera website. In the task, we query the
number of quizzes in the "Artificial Intelligence
for Healthcare" course. However, the agent only
identifies the quiz in module 1 of the course, which
is not the optimal answer and does not fulfill the
task requirements. Figure 28 illustrates the issue of
Prompt Misalignment encountered while browsing
BBC News. WebVoyager fails to fulfill all the task
requirements. Instead of completing the navigation,
it provides partial answers and tells me how to find
complete answers, which is not end-to-end.
Step 1: Click [2] Step 2: Type [2]; Baked Salmon Step 3: Scroll down

Step 4: Click [6] Step 5: Scroll down Step 6: ANSWER

Figure 9: Screenshots of a complete trajectory of browsing Allrecipes. Given the task: “Find a recipe for Baked
Salmon that takes less than 30 minutes to prepare and has at least a 4-star rating based on user reviews.” The agent
interacts with the Allrecipes website and obtains the answer: “The "Baked Dijon Salmon" recipe meets the user’s
criteria, with a 4.6-star rating and a preparation time of 15 minutes.”

Step 1: Type [2]; green


Step 2: Click [33] Step 3: ANSWER
Xbox Wireless controller
4 stars

Figure 10: Screenshots of a complete trajectory of browsing Amazon. Given the task: “Search for an Xbox Wireless
controller with green color and rated above 4 stars.” The agent interacts with the Amazon website and obtains the
answer: “The green Xbox Wireless controller ("Xbox Core Wireless Gaming Controller – Velocity Green") rated
above 4 stars has been found on Amazon with a rating of 4.7 out of 5 stars.”
Step 1: Click [5] Step 2: Click [8] Step 3: Click [9]

Step 4: Click [10] Step 5: Click [13] Step 6: Click [34]

Step 7: ANSWER

Figure 11: Screenshots of a complete trajectory of browsing ArXiv. Given the task: “For non-English submissions,
do I need to provide a multi-language abstract, if needed, answer the separator between the multiple abstracts.” The
agent interacts with the ArXiv website and obtains the answer: “Line ("-----")”

Step 1: Click [10] Step 2: Click [17] Step 3: ANSWER

Figure 12: Screenshots of a complete trajectory of browsing BBC News. Given the task: “Find out which musician
made the headlines in Music News.” The agent interacts with the BBC News website and obtains the answer: “The
musician who made the headlines in Music News is Taylor Swift.”
Step 1: Type [15]; Jakarta Step 2: Click [60] Step 3: Click [63]

Step 4: Click [99] Step 5: Scroll down Step 6: Scroll up

Step 7: Click [41] Step 8: Click [54] Step 9: ANSWER

Figure 13: Screenshots of a complete trajectory of browsing Booking. Given the task: “Find the cheapest available
hotel room for a three-night stay from 1st Jan in Jakarta. The room is for 2 adults, just answer the cheapest hotel
room and the price.” The agent interacts with the Booking website and obtains the answer: “The cheapest hotel
room is at OYO 3755 Sweet Home for US$14 for a three-night stay.”
Step 1: Click [24] Step 2: Click [25] Step 3: Click [21]

Step 4: Click [3] Step 5: Click [22] Step 6: Click [25]

Step 7: Click [27] Step 8: Click [26] Step 9: Click [25]

Step 10: Click [27] Step 11: Click [26] Step 12: ANSWER

Figure 14: Screenshots of a complete trajectory of browsing Cambridge Dictionary. Given the task: “Go to the
Plus section of Cambridge Dictionary, find Image quizzes and do an easy quiz about Animals and tell me your final
score.” The agent interacts with the Cambridge Dictionary website and obtains the answer: “The final score on the
easy Animals quiz is 6 out of 6.”
Step 2: Type [0]; Introduction
Step 1: Click [2] Step 3: Click [17]
to Finance: The Basics

Step 4: Type [2]; Xi Yang Step 5: ANSWER

Figure 15: Screenshots of a complete trajectory of browsing Coursera. Given the task: “Identify a course on
Coursera named ’Introduction to Finance: The Basics’, who is the course instructor, and what other courses does
he/she teach.” The agent interacts with the Coursera website and obtains the answer: The course instructor for
‘Introduction to Finance: The Basics’ is Xi Yang. Xi Yang also teaches ‘Introduction to Finance: The Role of
Financial Markets’ and ‘Introduction to Financial Analysis - The "Why?"’

Step 1: Click [16] Step 2: Click [23] Step 3: Scroll down

Step 4: Scroll down Step 5: ANSWER

Figure 16: Screenshots of a complete trajectory of browsing ESPN. Given the task: “How many NBA teams are
there and list all the teams with ‘New’ in their name.” The agent interacts with the ESPN website and obtains the
answer: “There are 30 NBA teams. The teams with ‘New’ in their name are the New York Knicks and the New
Orleans Pelicans.”
Step 2: Type [0]; climate
Step 1: Click [5] Step 3: Click [29]
change data visualization

Step 4: Click [52] Step 5: ANSWER

Figure 17: Screenshots of a complete trajectory of browsing GitHub. Given the task: “Search for an open-source
project related to ‘climate change data visualization’ on GitHub and report the project with the most stars.” The
agent interacts with the GitHub website and obtains the answer: “resource-watch/resource-watch with 63 stars”

Step 1: Click [2] Step 2: Type [7]; Boston Logan Airport Step 3: Type [8]; North Station

Step 4: ANSWER

Figure 18: Screenshots of a complete trajectory of browsing Google Map. Given the task: “Plan a trip from Boston
Logan Airport to North Station.” The agent interacts with the Google Map website and obtains the answer: “The
fastest route from Boston Logan International Airport to North Station is via MA-1A S and takes approximately 8
minutes in current traffic conditions.”
Step 1: Click [10] Step 2: Click [12] Step 3: Type [13]; Dublin

Step 4: Type [14]; Athens, Greece Step 5: Click [16] Step 6: Click [41]

Step 7: Click [78] Step 8: Click [19] Step 9: Click [34]

Step 10: Click [2] Step 11: ANSWER

Figure 19: Screenshots of a complete trajectory of browsing Google Flights. Given the task: “Search a one-way
flight from Dublin To Athens Greece for 1 Adult that leaves on December 30 and analyse the price graph for the
next 2 months.” The agent interacts with the Google Flights website and obtains the answer: “The price graph has
been analyzed, and it shows the trends for flight prices from Dublin to Athens over the next two months.”
Step 1: Type [7]; comedy
Step 2: Click [29] Step 3: Scroll down
movies sorted by user ratings

Step 4: Scroll down Step 5: ANSWER

Figure 20: Screenshots of a complete trajectory of browsing Google Search. Given the task: “Show me a list of
comedy movies, sorted by user ratings. Show me the Top 5 movies.” The agent interacts with the Google Search
website and obtains the answer: “The Top 5 comedy movies sorted by user ratings are:Life Is Beautiful; Back to the
Future; The Intouchables; City Lights; Modern Times”

Step 1: Click [1] Step 2: Click [15] Step 3: Click [30]

Step 4: Click [68] Step 5: Click [70] Step 6: ANSWER

Figure 21: Screenshots of a complete trajectory of browsing Huggingface. Given the task: “Look up a model
with a license of cc-by-sa-4.0 with the most likes on Huggingface.” The agent interacts with the Huggingface
website and obtains the answer: “The model with a license of cc-by-sa-4.0 with the most likes on Hugging Face is
"replit/replit-code-v1-3b" with 703 likes.”
Step 1: Type [8]; x^5-20x^4
Step 2: Scroll down Step 3: ANSWER
+163x^3-676x^2+1424x-1209

Figure 22: Screenshots of a complete trajectory of browsing Wolfram Alpha. Given the task: “Simplify x^5-
20x^4+163x^3-676x^2+1424x-1209 so that it has fewer items.” The agent interacts with the Wolfram Alpha website
and obtains the answer: “(x - 4)^5 + 3(x - 4)^3 + 7”

Step 1: Type [13]; 杭州 Step 2: Type [15]; 深圳 Step 3: Click [16]

Step 4: Click [58] Step 5: Click [66] Step 6: Click [80]

TASK: 请帮我查找一个航班,2024
年6月15日杭州飞往深圳,6月20日
返回,晚上出发,回答一个符合要
求的航班,并回答其具体出发时间,
航空公司以及价格。

ANSWER: 出发时间:17:35,航空
公司:深航,价格:HK$2,680

Step 7: Click [22] Step 8: ANSWER

Figure 23: Screenshots of a complete trajectory of browsing Google Flights in Chinese. Given the task: “Find
a flight from Hangzhou to Shenzhen on June 15, 2024, returning on June 20, departing at night, answer a flight
that meets the requirements, and answer its specific departure time, airline and price.” The agent interacts with
the Google Flights website and obtains the answer: “Departure time: 17:35, airline: Shenzhen Airlines, price:
HK$2,680”
Step 1: Type [14]; sostenibilidad Step 2: Click [20] Step 3: ANSWER

TASK: Busca la pronunciación y la definición de la palabra "sostenibilidad" en el diccionario.


(Look up the pronunciation and definition of the word "sustainability" in the dictionary.)

ANSWER: The pronunciation of "sostenibilidad" is /sosteniβiliˈðað/ and it is a feminine noun. The definition relates
to the quality of being able to continue over a period of time, causing minimal damage to the environment, which
implies the term is equivalent to "sustainability" in English.

Figure 24: Screenshots of a complete trajectory of browsing Cambridge Dictionary in Spanish. The description of
task and answer are shown in Figure.

Step 1: Click [34] Step 2: Click [13] Step 3: Click [14]

Step 4: Click [16] Step 5: Click [78] Step 6: Click [16]

Figure 25: An error case for Google Flights. Given the task:“Find the lowest fare from all eligible one-way flights
for 1 adult from JFK to Heathrow on Jan. 22.” Agent fails to select the correct numerical label though it really wants
to select 22 January.
Step 1: Scroll down Step 2: Scroll down Step 3: Scroll down

Step 4: Scroll up Step 5: Scroll up Step 6: Scroll down

Figure 26: An error case for Allrecipes. Given the task:“Search for a recipe for Beef Wellington on Allrecipes that
has at least 200 reviews and an average rating of 4.5 stars or higher. List the main ingredients required for the dish.”
Agent fails to scroll the page correctly and find ingredients.

Step 1: Click [2] Step 2: Type [0]; Artificial Step 3: Click [17]
Intelligence for Healthcare

Step 4: Click [13] Step 5: Click [8] Step 6: ANSWER

Figure 27: An error case for Coursera. Given the task: “Identify a course on Coursera related to ’Artificial
Intelligence for Healthcare’ and note the course duration along with the number of quizzes in Assessments.” The
agent provides the answer: “Module 1 includes 3 quizzes in the Assessments.” However, we also require other
Modules and agent fails to provide the results.
Step 1: Click [44] Step 2: Click [24] Step 3: ANSWER

Figure 28: An error case for BBC News. Given the task: “Find out how many teams are in the Scottish Premiership
of the Football Tournament and when did the Hibernian team’s most recent match start?” The agent provides the
answer: “There are 12 teams in the Scottish Premiership. To find out the exact start time of Hibernian’s most recent
match, further interaction with the website would be required.” The Agent knows that the task is not yet complete,
but it ends its navigation early, even though it can find the Hibernian team’s most recent match.

You might also like