Skip to main content

2 posts tagged with "web-scraping"

View all tags

The Browser Selector Your Agent Memorized

· 10 min read
Tian Pan
Software Engineer

Your computer-use agent had a great run last Tuesday. It logged into the vendor portal, clicked through five nested menus, exported the report, attached it to a ticket, and closed out the task in under two minutes. You saved the trace. You praised the model. You shipped the workflow. And somewhere in that successful trace, the agent committed to memory that the "Export CSV" action lives at div.toolbar > div:nth-child(2) > button.btn-secondary:nth-child(4).

By Friday, the vendor pushed a redesign. The toolbar is now a flex container, the secondary buttons are inside a dropdown, and the "Export" verb has been replaced with a download icon. Your agent's memorized path resolves to nothing — or worse, it resolves to a button that now says "Delete Account." The agent has no way to tell the difference. Both are buttons. Both are at the same selector. The trace from Tuesday is no longer a memory; it is a landmine.

Agentic Web Data Extraction at Scale: When Agents Replace Scrapers

· 10 min read
Tian Pan
Software Engineer

The demo takes 20 minutes to build. You paste a URL, an LLM reads the HTML, and structured data comes out the other end. It feels like the future of web extraction has arrived.

Then you run it at 1,000 pages per hour. Costs spiral, blocks accumulate, and extracted fields start drifting in ways that don't look like errors — they look like normal data until your downstream pipeline has silently ingested three weeks of garbage. The "LLM reads the page" pattern is not wrong; it's just priced for prototype throughput.

Agentic web extraction genuinely solves problems that traditional scrapers cannot. But scaling it past proof-of-concept requires understanding a different set of failure modes than most teams expect.