Skip to main content

861 posts tagged with "insider"

View all tags

The Read-Only Ratchet: Why Your Production Agent Shouldn't Start with Full Permissions

· 11 min read
Tian Pan
Software Engineer

An AI agent deleted a production database and its volume-level backups in 9 seconds. It didn't go rogue. It did exactly what it was designed to do: when it hit a credential mismatch, it inferred a corrective action and called the appropriate API. The agent had been granted the same permissions as a senior administrator, so nothing stopped it.

This is not an edge case. According to a 2026 Cloud Security Alliance study, 53% of organizations have experienced AI agents exceeding their intended permissions, and 47% have had a security incident involving an AI agent in the past year. Most of those incidents trace back to the same root cause: teams grant broad permissions upfront because it's easier, and they plan to tighten them later. Later never comes until something breaks.

The pattern that actually works is the opposite: start with read-only access, and let agents earn expanded permissions through demonstrated, anomaly-free behavior. This is the read-only ratchet.

Reranking Is the Real Work: Why Your Retrieval System's Bottleneck Is Never the Index

· 10 min read
Tian Pan
Software Engineer

Teams building RAG systems almost universally hit the same wall: they spend a week tuning their HNSW index parameters, add product quantization, push recall@100 from 0.81 to 0.87 — and then watch LLM output quality barely budge. The assumption baked into months of effort is that a better index equals better answers. It doesn't. The bottleneck was never the index.

The actual chokepoint is the ranking step between your candidate set and your context window. What you put into the LLM determines what comes out, and the job of ranking is to ensure that the most genuinely relevant documents, not just the most semantically similar ones, make it through. That distinction matters more than any HNSW configuration you'll ever tune.

Thinking Budgets: When Extended Reasoning Models Actually Make Economic Sense

· 10 min read
Tian Pan
Software Engineer

A surprising number of AI teams default to extended thinking on every query once they gain access to an o3-class or Claude extended thinking model. The logic seems obvious: smarter reasoning equals better outputs, so why not always enable it? The problem is that this reasoning fails to account for a basic fact of how test-time compute scaling works in practice. Extended thinking dramatically improves performance on a specific class of tasks, degrades quality on others, and can inflate your inference costs by 5–30x across the board. The teams getting the most value from these models treat the reasoning budget as an explicit decision — one with the same weight as model selection or prompt engineering.

This post lays out the task taxonomy, the cost structure, and the routing decision framework that distinguishes teams who use thinking budgets strategically from teams who are just paying a premium for an illusion of quality.

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

· 10 min read
Tian Pan
Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.

Token Economics for AI-Powered API Products: Pricing What You Cannot Predict

· 10 min read
Tian Pan
Software Engineer

A team ships a customer-facing AI assistant. They price it at $49/month per seat, targeting 70% gross margins based on a spreadsheet that assumed "average 500 tokens per query." Three months later, finance flags that their heaviest users are consuming 15,000 tokens per session. The pricing model collapses not because the feature failed, but because the product team priced something they didn't yet understand.

This isn't a failure of forecasting. It's a structural problem: the cost basis of an LLM-powered product is fundamentally unlike anything traditional SaaS pricing was designed to handle. Every API call has unpredictable and material token cost. The inputs vary wildly by user, task, and time of day. The outputs compound in ways that only show up weeks later on your cloud bill. And once you layer in agentic patterns — tool calls, multi-turn reasoning, subagent orchestration — a single user interaction can cost $0.02 or $20 depending on what the model decides to do.

Tool Discovery at Scale: Why Embedding-Only Retrieval Fails Past 20 Tools

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same problem on their fifth sprint: the agent can't reliably pick the right tool anymore. At ten tools, it mostly works. At twenty, accuracy starts to slip. At fifty, you're watching the agent call search_documents when it should call update_record, and the logs offer no explanation. The usual reaction is to tweak the tool descriptions — add more context, be more explicit, rewrite the examples. This occasionally helps. But it misses the root cause: flat embedding retrieval is architecturally wrong for large tool inventories, and better descriptions cannot fix an architectural problem.

Tool selection is retrieval, and retrieval has known scaling limits. Understanding those limits — and the structured metadata patterns that work around them — is what separates agent systems that hold up in production from ones that require constant babysitting.

Vector DB Sharding: Why HNSW Breaks at Partition Boundaries and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Most vector database tutorials show you how to insert a million embeddings and run a query. What they don't show you is what happens six months later, when your corpus has grown past what a single node can hold, and you're trying to shard the HNSW index your entire retrieval pipeline depends on. The answer, which vendors leave out of the marketing copy, is that HNSW graphs resist partitioning in ways that cause silent recall degradation — and the operational patterns needed to recover that quality add real complexity.

This post covers the technical reasons HNSW sharding breaks down, what recall loss looks like in practice, and the operational patterns teams use to maintain retrieval accuracy when they've outgrown a single node.

Why AI Coding Tools Amplify Juniors and Plateau Seniors

· 9 min read
Tian Pan
Software Engineer

Ask any VP of Engineering whether AI coding tools are a productivity win and they'll say yes. Ask the same question to a staff engineer who lives in a ten-year-old codebase with six undocumented data models and a deployment process held together with shell scripts, and you'll get a different answer.

The productivity story for AI coding tools is bifurcated in a way that most organizations haven't fully processed. Junior engineers are seeing 27–39% gains in completed weekly tasks. Experienced developers are, in a controlled study of real-world issues, taking 19% longer to finish tasks when they have AI assistance than when they don't. Both results are consistent with how these tools work — and they lead to a management trap that's playing out quietly on engineering teams right now.

Your Prompts Are Configuration: Treating AI Settings as Production Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most engineering teams can tell you exactly which environment variable controls their database connection pool. Almost none can tell you which system prompt version is serving 90% of their traffic right now — or what changed since the last model behavior complaint rolled in.

This is the AI configuration footprint problem. Teams building LLM-powered features accumulate an implicit configuration layer — model selection, sampling parameters, system prompts, tool schemas, retry budgets — that governs how their product behaves in production. Most of this layer lives in no system of record. It gets updated through direct code edits, spreadsheet hand-offs, or Slack messages. When something breaks, nobody can say what changed.

That's not a process problem. It's an architecture problem. And the fix requires treating AI configuration with the same rigor that mature teams bring to environment config, feature flags, and infrastructure-as-code.

AI Content Drift: When Your Documentation Corpus Starts Contradicting Itself

· 10 min read
Tian Pan
Software Engineer

Your documentation looked fine six months ago. It still looks fine today — individually. But a user filed a bug this week: two pages of your developer docs give opposite advice on the same configuration option. One page says to set max_retries to 3 for production workloads; another page says to leave it at the default of 0. Both were AI-generated. Both sound authoritative. One reflects what your system actually did in January; the other reflects how your AI tool interpreted a slightly different prompt in June. Nobody caught it because nobody was looking at the corpus as a whole.

This is AI content drift. It is not a hallucination problem. The AI was accurate at the time of generation. The drift happened in the gap between runs.

The Coverage Illusion: Why AI-Generated Tests Inherit Your Code's Blind Spots

· 9 min read
Tian Pan
Software Engineer

An engineer on a small team spent three months delegating test generation to AI. Code coverage jumped from 47% to 72% to 98%. Every PR came back green. Then production broke. A race condition in user registration allowed duplicate emails due to database replication lag. A promo code endpoint returned null instead of zero when a code was invalid, and the payment calculation silently broke for 4,700 customers. The total damage: $47,000 in refunds and 66 hours of engineering time. The tests hadn't missed a few edge cases. The tests had covered the code that was written, not the system that was deployed.

This is the coverage illusion. And it's getting easier to fall into as AI-assisted development becomes the default.

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.