Computer Use Is 45x More Expensive Than Structured APIs

We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).
Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project.
Why vision agents?
Vision agents are the default for letting AI agents operate web apps that don't expose APIs. The alternative, writing an MCP or REST surface per app, is its own engineering project across the 20+ internal tools most teams have. Most teams default to vision agents not because they are better, but because the alternative is too expensive to build. The cost of the vision approach is treated as a fixed price.
We wanted to measure the price.
The setup
The test app is an admin panel for managing customers, orders, and reviews, modeled on the react-admin Posters Galore demo. Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly. Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable.
The task: find the customer named "Smith" with the most orders, locate their most recent pending order, accept all of their pending reviews, and mark the order as delivered. This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. It is the shape of work a typical internal tool sees daily.
Path A: Vision agent. Claude Sonnet driving the UI via browser-use 0.12. Vision mode, taking screenshots and executing clicks.
Path B: API agent. Claude Sonnet with tool-use, calling the handlers the UI calls. Each tool maps to one or more event handlers on the app's State, the same functions a button click would trigger. The agent gets the structured response back instead of a rendered page.
All code is open source.
The vision agent couldn't complete the task
We started by giving both agents the same six-sentence task above and seeing what happened.
The API agent completed it in 8 calls. It listed the customer's reviews filtered by pending status, accepted each one, and marked the order as delivered. Both agents are calling into the same application logic; the API agent just reads the structured response directly instead of looking at a rendered page.
The vision agent, on the same prompt, found one of four pending reviews, accepted it, and moved on. It never paginated. The remaining three reviews were below the visible fold of the reviews page and the agent had no signal to scroll for them.
This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything. The API agent calls the same handler the UI calls, but the response includes the full result set the handler returned, not just the rows currently rendered. The agent reads "page 1 of 4 with 50 results per page" directly instead of having to interpret pagination controls from pixels.
With a 14-step walkthrough, it succeeded
To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
With the walkthrough, the vision agent completed the task. It also ran for fourteen minutes and consumed about half a million input tokens.
The walkthrough is itself a finding. Each numbered instruction is engineering work that doesn't show up in token counts but represents real cost. Anyone deploying a vision agent against an internal tool is either writing prompts at this level of specificity or accepting that the agent will silently miss work.
How we ran it
We ran the API path five times and the vision path three times. The vision path was capped at three trials because each run takes 14-22 minutes and consumes 400-750k tokens.
Variance was the most surprising part of the vision results. Across three trials the wall-clock time spanned 749s to 1257s, and input tokens spanned 407k to 751k. The agent took 43 cycles in the shortest run and 68 in the longest. The screenshot-reason-click loop has enough non-determinism that a single run is not a representative cost estimate.
The API path had no such variance. Sonnet hit identical 8 tool calls on every trial, with input token counts varying by ±27 across all five runs. The agent calls the same handlers in the same order because the structured responses give it no reason to deviate.
The full results
| Metric | Vision agent (Sonnet) | API (Sonnet) | API (Haiku) |
|---|---|---|---|
| Steps / calls | 53 ± 13 | 8 ± 0 | 8 ± 0 |
| Wall-clock time | 1003s ± 254s (~17 min) | 19.7s ± 2.8s | 7.7s ± 0.5s |
| Input tokens | 550,976 ± 178,849 | 12,151 ± 27 | 9,478 ± 809 |
| Output tokens | 37,962 ± 10,850 | 934 ± 41 | 819 ± 52 |
Numbers are mean ± sample standard deviation (n−1), with n=5 per API path and n=3 for the vision path. Full run details are available in the repo.
Haiku could not complete the vision path. The failure was specific to browser-use 0.12's structured-output schema, which Haiku could not reliably produce in either vision or text-only mode. On the API path, Haiku finished in under 8 seconds for under 10k input tokens, which is the cheapest configuration we tested.
The structural gap
The cost difference follows directly from the architecture. An agent that must see in order to act will always pay for the seeing, regardless of how good the model gets. Better vision models reduce error rates per screenshot, but they do not reduce the number of screenshots required to reach the relevant data. Each render is a screenshot is thousands of input tokens.
Both agents in this benchmark walk through the same application logic. They both filter, paginate, and update the same way the UI does. The difference is what they read at each step. The vision agent reads pixels and has to render every intermediate state to interpret it. The API agent reads the structured response from the same handlers, which already contains the data the UI was going to display.
Better models will narrow the cost per step. They will not narrow the step count, because the step count is set by the interface.
How we justify the API engineering cost
The benchmark was made cheap to run by Reflex 0.9, which includes a plugin that auto-generates HTTP endpoints from a Reflex application's event handlers. None of the structural argument depends on Reflex specifically, but it is what made running the API path possible without writing a second codebase.
The interesting question is what becomes possible when the engineering cost of an API surface drops to zero. Vision agents remain the right tool for applications you do not control: third-party SaaS products, legacy systems, anything you cannot modify. For internal tools you build yourself, the math now points the other way.
Notes
Vision results are specific to browser-use 0.12 in vision mode, and other vision agents may behave differently. The Path B runner shapes the auto-generated endpoints into a small REST tool surface of about thirty lines, which the agent sees as list_customers, update_order, and similar. The dataset is pinned and small (900 customers, 600 orders, 324 reviews), so behavior on production-scale data is not measured here. The vision agent runs through LangChain's ChatAnthropic, and the API agent runs through the Anthropic SDK directly. Reported token counts are uncached input tokens.
Reproduce it
All benchmark code is open-source: github.com/reflex-dev/agent-benchmark
The repo includes seed data generation, the patched react-admin demo, both agent scripts, and raw results.
Reflex agent-native plugin docs: reflex/docs/enterprise/event-handler-api.md
More Posts
3 minute readApr 30, 2026Upgrading Reflex Apps from 0.8 to 0.9A step-by-step migration guide for upgrading your Reflex app to v0.9.0, covering the new optional database extra, single-port production builds, the auto-setter change, and removed APIs.
6 minute readApr 27, 2026500,000 Tokens to Click a DropdownWe benchmarked browser-use vision agents against auto-generated API endpoints on the same Reflex app. 47 steps and 495k tokens vs 8 calls and 12k tokens.
8 minute readApr 21, 2026How to Build a Chat App With LLM in Python: Complete Guide April 2026Learn how to build a chat app with LLM in Python with streaming, RAG, and production interfaces. Complete implementation guide for April 2026.
Схожі новини
Новая ИИ-модель по умолчанию. OpenAI выпустила GPT-5.5 Instant с новыми функциями