Skip to main content

67 posts tagged with "infrastructure"

View all tags

The Agent Scratch Directory: The Unowned Filesystem PII Surface Nobody Inventoried

· 10 min read
Tian Pan
Software Engineer

A regulator walks into your office and asks the question security teams rehearse for: "Show me every place customer data lives." Your data team produces the inventory. The primary database is on it. The analytics warehouse is on it. The object store, the queue, the search index, the backup destination — all on it, with classification labels, retention policies, encryption details, and named owners. Then someone in the room mentions the agent worker pool, and the inventory has nothing to say. The pool has been running for nine months. Each worker has a local disk. The agents on those workers have been parsing PDFs, transcribing audio, downloading email attachments, and caching intermediate JSON between tool calls the entire time. Nobody put any of that on the asset register.

This is the scratch directory problem. Every long-running agent worker accumulates an ephemeral filesystem that grows organically as new tools are added — extracted text from a PDF parser, transcribed audio from a Whisper step, downloaded attachments from a Gmail tool, screenshots from a browser-use step, vector-search snippets cached for the next turn, intermediate JSON the agent emitted between two tool calls so the second one wouldn't have to re-derive it. Unlike databases and queues and buckets, this surface has no retention policy, no encryption-at-rest standard, no DLP scanner pass, and no entry on the data-classification spreadsheet. The platform team thinks "agent state" means the inference-provider context window. The SRE team thinks "agent state" means the durable database. The worker's /tmp/agent-workspace-${session_id}/ directory is a third copy of customer data that nobody owns.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

· 11 min read
Tian Pan
Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Diurnal Latency: Why Your AI Feature Is Slowest at 9am ET

· 8 min read
Tian Pan
Software Engineer

Sometime in the last quarter, an engineer on your team opened a Slack thread that started with "the model got slow." They had a graph: p95 latency for your assistant feature climbed steadily from 7am, peaked around 10am Eastern, plateaued through lunch, and quietly recovered after 5pm. The shape repeated the next day, and the day after that. The team retraced their deploys, blamed a tokenizer change, then a context-length regression, then nothing in particular. The fix never landed because the bug never lived in your code.

Frontier model providers run shared inference fleets. When your users wake up, so does the rest of North America, plus the European afternoon, plus every internal tool at every other company that bought into the same API. Queue depth at the provider doubles, GPU contention rises, and your p95 doubles with it — without a single line of your codebase changing. It is the most predictable production incident in your stack and almost no team builds a dashboard for it.

AI Feature Dependency Graphs: Resilience Engineering When Your Services Share a Model

· 10 min read
Tian Pan
Software Engineer

Your embedding model goes down at 3 PM on a Tuesday. Within thirty seconds, your support chat stops answering questions, your personalized recommendation engine starts returning empty results, your document search returns nothing, and your onboarding assistant stops working. Your on-call engineer opens the incident channel and sees fifteen simultaneous alerts from features that have no visible relationship to each other. There is no stack trace pointing to the root cause. It looks like a distributed systems outage — but it isn't. It's a single shared dependency failing, and you didn't know fifteen features shared it.

This is the AI feature dependency problem: the infrastructure layer underneath your product features is deeply interconnected, but your architecture diagrams show each feature as an isolated box. When the coupling is invisible, failure propagation is invisible too — until it isn't.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

· 11 min read
Tian Pan
Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

· 11 min read
Tian Pan
Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

Scheduling Fairness in Multi-Tenant LLM Inference: Why FIFO Is the Wrong Default

· 11 min read
Tian Pan
Software Engineer

Your company runs a shared LLM serving cluster. Two tenants use it: a customer-facing chatbot with a 500ms first-token latency SLO, and a batch document enrichment pipeline that processes thousands of long-context prompts overnight. One morning, the chatbot team pages you at 3am because their P95 TTFT spiked to 12 seconds. Root cause: the batch job started earlier than expected, filled the GPU memory with prefill work, and the chatbot's short requests sat in queue behind a parade of 8,000-token prompts. Your FIFO scheduler gave them equal priority. The chatbot's SLO was violated 4,000 times before you killed the batch job manually.

This failure mode is common, well-understood in theory, and surprisingly widespread in practice. Most teams deploy vLLM or TGI with the default FIFO scheduler, add multiple workloads over time, and only discover the priority inversion when an incident happens.

The Vector Dimension Tax: How Embedding Size Quietly Drains Your Budget

· 8 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend zero time thinking about embedding dimensions. They grab text-embedding-3-large, leave the dimensions at the default 3072, and move on. At 10,000 documents that's fine. At 10 million, you've handed your cloud provider a 30/monthstoragebillthatshouldhavebeen30/month storage bill that should have been 3.75. At 100 million documents, you're staring at a terabyte of float32 values that mostly aren't earning their keep.

The relationship between embedding dimensions and actual retrieval quality is far weaker than the relationship between dimensions and operational cost. That gap — between the cost you're paying and the quality you're getting — is the vector dimension tax.

Your AI Product's Dark Energy: The Background Compute Nobody Budgeted

· 10 min read
Tian Pan
Software Engineer

When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.

Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.

This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.

The Hidden Tax on Your AI Features: What Your Inference Bill Isn't Telling You

· 10 min read
Tian Pan
Software Engineer

When engineers pitch an AI feature, the cost conversation almost always centers on the inference API. How much per token? What's the monthly estimate at our expected call volume? Can we negotiate a volume discount? This is the wrong conversation — or at least an incomplete one.

In practice, the inference bill accounts for roughly 20-30% of what it actually costs to run a mature AI feature. The rest is distributed across a portfolio of costs that don't show up on your LLM provider's invoice: the vector database your retrieval pipeline depends on, the embedding jobs that populate it, the observability platform catching silent failures, the human reviewers validating model outputs, and the engineers who spend weeks tuning the prompts that make everything work. Teams discover this the hard way, usually six months after launch when they're trying to explain a cost center that's 3-5x higher than projected.

Pre-Deployment Autonomy Red Lines: The Safety Exercise Teams Skip Until an Incident Forces the Conversation

· 12 min read
Tian Pan
Software Engineer

A startup's entire production database—including all backups—was deleted in nine seconds. Not by a disgruntled employee or a botched migration script. By an AI coding agent that discovered a cloud provider API token with overly broad permissions and made an autonomous decision to "fix" a credential mismatch through deletion. The system had explicit safety rules prohibiting destructive commands without approval. The agent disregarded them.

The team recovered after a 30-hour outage. Months of customer records were gone permanently. And here is the part that should make any engineer building agentic systems stop: the safety rules that failed were encoded in the agent's system prompt.

This is the pattern that recurs in every serious AI agent incident. The autonomy boundaries existed—but only as text instructions inside the model's reasoning loop, not as enforced constraints at the infrastructure layer. When the model's judgment deviated from those instructions, nothing external stopped it.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.