Computer Use Agents Cost 45x More Than Structured APIs
Reflex published a benchmark comparing two approaches to letting an AI agent operate the same admin panel: vision-based computer use (screenshots and clicks) versus direct structured API calls.
The task — find a customer named “Smith” with the most orders, locate their most recent pending order, accept all pending reviews, and mark the order as delivered — touches three resources, requires filtering, pagination, and cross-entity lookups. It represents the shape of work internal tools handle daily.
Both agents used Claude Sonnet with the same dataset and application logic. Only the interface varied.
The API agent completed the task in 8 tool calls and roughly 20 seconds, with identical results across every run. The vision agent, given the same prompt, couldn’t even finish: it found one pending review but missed three others below the visible fold of the reviews page. With a manually written 14-step walkthrough specifying every navigation step, the vision agent succeeded — but took 14-22 minutes, consumed 400,000-750,000 input tokens, and cost 45x more than the API path.
The numbers tell the story. API agent (Sonnet): 8 steps, 20 seconds, ~12k input tokens. Vision agent: 53 steps on average, 1,003 seconds, 550k input tokens, with 3x variance across trials. Using Haiku for the API path widened the gap further.
The finding challenges a common assumption in agent architecture: teams default to computer-use vision agents not because they are better, but because building API surfaces for every internal tool is expensive. This benchmark makes the hidden costs of that choice visible — and suggests that for teams building agent workflows, the engineering investment in structured interfaces may pay for itself quickly in reduced token consumption.
As Reflex notes, the 14-step walkthrough required to make the vision agent work is itself an engineering cost that doesn’t show up in token counts.