Skip to main content

5 posts tagged with "computer-use"

View all tags

The Browser Selector Your Agent Memorized

· 10 min read
Tian Pan
Software Engineer

Your computer-use agent had a great run last Tuesday. It logged into the vendor portal, clicked through five nested menus, exported the report, attached it to a ticket, and closed out the task in under two minutes. You saved the trace. You praised the model. You shipped the workflow. And somewhere in that successful trace, the agent committed to memory that the "Export CSV" action lives at div.toolbar > div:nth-child(2) > button.btn-secondary:nth-child(4).

By Friday, the vendor pushed a redesign. The toolbar is now a flex container, the secondary buttons are inside a dropdown, and the "Export" verb has been replaced with a download icon. Your agent's memorized path resolves to nothing — or worse, it resolves to a button that now says "Delete Account." The agent has no way to tell the difference. Both are buttons. Both are at the same selector. The trace from Tuesday is no longer a memory; it is a landmine.

The GUI Agent That Clicked the Right Button on the Wrong Screen

· 10 min read
Tian Pan
Software Engineer

A computer-use agent takes a screenshot, reasons about it, decides to click the "Confirm" button at pixel (840, 612), and dispatches the click. By the time the cursor lands, a modal has appeared. The pixel that was "Confirm" three seconds ago is now "Delete." The agent did exactly what it planned. It planned against a screen that no longer exists.

This is not a grounding error. The model correctly identified the button. It is not a reasoning error. The plan was sound. It is a timing error — the most under-instrumented failure class in GUI automation — and your test suite almost certainly does not cover it, because your test environment never moves between the observation and the action.

The uncomfortable measurement: one recent study of desktop agents on real Ubuntu workloads found a mean gap of 6.51 seconds between when an agent observes the screen and when it acts on that observation. Six and a half seconds is an eternity for a UI. Notifications fire, lazy lists finish loading, animations settle, focus shifts. The agent's mental model of the screen has a shelf life, and almost no agent framework treats it that way.

Browser Agent Session Bleed: When One Profile Serves Many Tenants

· 10 min read
Tian Pan
Software Engineer

A computer-use agent finishes a task on a customer's CRM, the worker pool returns the browser to its idle ring, the next request lands a few hundred milliseconds later, and the navigation to the dashboard succeeds — except it succeeds as the wrong user. The OAuth cookie from the previous session was still on the profile. The trace shows navigation succeeded, screenshot captured, action performed. Nothing in the run log says the agent was acting as someone who never asked it to.

This is the failure class that browser agents inherit silently from the libraries they're built on. Headless browser frameworks were designed for one user per profile because that's how a browser has worked for thirty years. When a worker pool reuses profiles to amortize the eight-second cold start of a fresh Chromium instance, that one-user assumption breaks, and the breakage is invisible to every layer of telemetry the team usually trusts.

Browser Agents in Production: The DOM Fragility Tax

· 13 min read
Tian Pan
Software Engineer

A calendar date picker broke a production browser agent for three days before anyone noticed. The designer had swapped a native <input type="date"> for a custom React component during a minor UI refresh. No API changed. No content moved. Just 24px cells in a new layout — and the vision model that had been reliably clicking the right dates now missed by one cell, silently booking appointments on the wrong day.

This is the DOM fragility tax: the ongoing operational cost of building automated agents on top of a web that was never designed to be operated by machines. Unlike most infrastructure taxes, it compounds. The web changes. Anti-bot defenses evolve. SPAs get more dynamic. And your agent quietly degrades.

Computer Use Agents in Production: When Pixels Replace API Calls

· 9 min read
Tian Pan
Software Engineer

Most AI agents interact with the world through structured APIs — clean JSON in, clean JSON out. But a growing class of agents has abandoned that contract entirely. Computer use agents look at screenshots, reason about what they see, and drive a mouse and keyboard like a human operator. When the only integration surface is a screen, pixels become the API.

This sounds like a party trick until you realize how much enterprise software has no API at all. Legacy ERP systems, internal admin panels, proprietary desktop applications — the GUI is the only interface. For years, robotic process automation (RPA) handled this with brittle, selector-based scripts that shattered whenever a button moved three pixels. Computer use agents promise something different: visual understanding that adapts to UI changes the way a human would.