Skip to main content

67 posts tagged with "infrastructure"

View all tags

Multi-Region AI Deployment: Data Residency, Model Parity, and the Latency Tax Nobody Budgets

· 10 min read
Tian Pan
Software Engineer

When engineers budget for multi-region AI deployments, they typically account for two variables: infrastructure cost per region and replication overhead. What they consistently underestimate — sometimes catastrophically — are three costs that only appear once you're live: model parity gaps that make your EU cluster produce different outputs than your US cluster, KV cache isolation penalties that make every token in GDPR territory more expensive to generate, and silent compliance violations that trigger when your retry logic routes a French user's data through Virginia.

A German bank spent 14 months deploying a large open-source model on-premises to satisfy GDPR requirements. That's not unusual. What's unusual is that the engineers who proposed the architecture understood the compliance constraint upfront. Most don't until an incident report forces the conversation.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

The Promotion Packet for AI Engineers Who Didn't Ship a Feature

· 11 min read
Tian Pan
Software Engineer

The AI engineer with the strongest case for promotion on your team has a promotion packet that looks empty. Two quarters of work and the impact graph is a flat line. The eval-regression rate that used to spike to 12% on every model swap now sits at 4%. The $40k/month cost spike that finance was about to escalate never reached finance because somebody added a budget guard to the gateway. The P0 incident that would have made the company's status page never happened because a kill-switch tripped and routed traffic to the previous prompt version.

The packet has nothing to write in the "shipped X" column. The calibration committee sits down with two engineers side by side: one who shipped two visible features this half, one who quietly absorbed the load that made those features possible. The committee, doing what it has always done, rates the shipper higher. The infra-shaped engineer either takes a "meets expectations" rating they don't deserve and quits inside a quarter, or learns to write the packet in a language the committee can actually read.

The Air-Gapped LLM Blueprint: What Egress-Free Deployments Actually Need

· 11 min read
Tian Pan
Software Engineer

The cloud AI playbook assumes one primitive that nobody writes down: outbound HTTPS. Vendor APIs, hosted judges, telemetry pipelines, model registries, vector stores, dashboard SaaS, secret managers — every one of them quietly resolves to a domain on the public internet. Pull that one cable and the stack does not degrade gracefully. It collapses.

That is the moment most teams discover their architecture has an egress dependency they never accounted for. A "small" prompt update needs to call out to a hosted classifier. The eval suite hits an LLM judge over the wire. The observability agent phones home. The model registry pulls weights from a CDN. None of it is malicious, and none of it is unusual. It is just what the cloud-native stack looks like when you stop noticing the cable.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap

· 11 min read
Tian Pan
Software Engineer

There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.

This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

· 9 min read
Tian Pan
Software Engineer

A team I know spent three weeks chasing a "context truncation" bug that only fired in production for Japanese customers. Their CI fixtures were English. Their tiktoken count said the prompt fit in 8K with a 600-token margin. The provider's invoice said the request had been rejected for exceeding the limit. The two numbers were off by 11%, the safety margin lived inside that 11%, and nobody had ever measured the disagreement on CJK text. The fix wasn't a new model — it was throwing away the local counter as a source of truth.

That's the subtle, expensive shape of tokenizer drift: not a single wrong number, but a class of small systematic errors that accumulate at the boundaries you forgot to test. The local counter in your IDE, the budget calculator in your gateway, the rate-limit estimator in your retry middleware, and the authoritative count the provider charges against — none of these agree, and the gap widens exactly where your users live.

Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model

· 13 min read
Tian Pan
Software Engineer

Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.

This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.

Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."

Prompt-Version Skew Across Regions: The Unintended A/B Test Your CDN Ran for Six Hours

· 10 min read
Tian Pan
Software Engineer

You shipped a system-prompt change at 09:14. The rollout dashboard turned green at 09:31. By 11:00 your eval tracker still looked clean, the cost dashboard was unremarkable, and a customer-success engineer pinged the team: structured-output errors on the parser side were up about three percent in Asia-Pacific only. Nothing in North America. Nothing in Europe.

The rollout had paused itself at 67% region coverage because a non-load-bearing health check on one POP flapped during the cutover, and nobody had noticed. For six hours, us-east and eu-west were running prompt v47 while ap-south and ap-northeast were still on v46. You were running a live A/B test split by geography — except you didn't design the test, you couldn't see the test, and the eval suite that was supposed to catch quality regressions was hitting the new version in one region and shrugging.

This failure mode is not a bug in any single tool. It is the predictable consequence of pushing prompts through deployment systems built for a different kind of artifact.

When Your CLI Speaks English: Least Authority for Promptable Infrastructure

· 13 min read
Tian Pan
Software Engineer

A platform team I talked to this quarter shipped a Slack bot that wrapped kubectl and accepted English. An engineer typed "clean up the unused branches in staging." The bot helpfully deleted twelve namespaces — including one whose name matched the substring "branch" but which happened to host a long-lived integration environment that the mobile team had been using for a week. No exception was thrown. Every individual call the bot made was a permission the bot legitimately held. The post-mortem could not point to a broken access rule, because no rule was broken. The bot did exactly what its IAM policy said it could do.

The Unix philosophy was a containment strategy hiding inside an aesthetic preference. Small tools with narrow surfaces meant that the blast radius of any single command was bounded by the verbs and flags it accepted. rm -rf was dangerous because everyone agreed it was; kubectl delete namespace required the operator to type out the namespace, and the typing was the gate. The principle of least authority was easy to enforce because authority was lexical: the shape of the command told you the shape of the action.

Then the wrappers started accepting English. Now "the shape of the command" is whatever the LLM decided it meant.

Your Agent's Outbox Is Your Next Deliverability Incident

· 11 min read
Tian Pan
Software Engineer

The first time it happens, the on-call engineer is staring at a Gmail Postmaster dashboard that has gone solid red, the support inbox is on fire because customer password resets are landing in spam, and the agent that did this is still running. It sent eighty thousand "personalized follow-ups" between 4 a.m. and 9 a.m. local time, all from the company's primary sending domain, all signed with the same DKIM key the billing system uses. By the time anyone notices, the domain reputation that took three years to build is gone, and so are the next six weeks of inbox placement on every transactional message the company depends on.

Sending email from an agent looks like a one-line tool call. send_email(to, subject, body) is the canonical demo, and every framework ships it as a starter integration. But email is not like other tools. A bad database query rolls back. A bad API call returns an error. A bad batch of email lowers the deliverability of every other email your company sends, for weeks, and there is no transaction to roll back because the messages are already in flight to recipient mailservers that are now writing your domain's reputation history.