Skip to main content

678 posts tagged with "ai-engineering"

View all tags

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

· 11 min read
Tian Pan
Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

Ambient AI Architecture: Designing Always-On Agents That Don't Get Disabled

· 9 min read
Tian Pan
Software Engineer

Most teams building ambient AI ship something users immediately turn off.

The pattern is consistent: the team demos the feature internally, everyone agrees it's useful in theory, and within two weeks of launch the disable rate exceeds 60%. This isn't a model quality problem. It's an architecture problem — and specifically an interrupt threshold problem. Teams design their ambient agents around what the AI can do rather than what users will tolerate when they didn't ask for help.

The gap between explicit invocation ("ask the AI") and ambient monitoring ("the AI watches and acts") is not just a UX question. It demands a fundamentally different system architecture, a different event model, and a different mental model for when an AI agent earns the right to speak.

Annotator Bias in Eval Ground Truth: When Your Labels Are Systematically Steering You Wrong

· 10 min read
Tian Pan
Software Engineer

A team spent six months training a sentiment classifier. Accuracy on the holdout set looked solid. They shipped it. Three months later, an audit revealed the model consistently rated product complaints from non-English-native speakers as more negative than identical complaints from native speakers — even when the text said the same thing. The root cause wasn't the model architecture. It wasn't the training procedure. It was the annotation team: twelve native English speakers in one timezone, none of whom noticed that certain phrasings carried different emotional weight in translated text.

The model had learned the annotators' blind spots, not the actual signal.

This is annotator bias in practice. It doesn't announce itself. It shows up as an eval score you trust, a benchmark rank that looks reasonable, a deployed system that behaves strangely on subgroups you didn't test carefully enough. Ground truth corruption is upstream of everything else in your ML pipeline — and it's the problem most teams discover too late.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.

Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix

· 9 min read
Tian Pan
Software Engineer

Here's a failure mode that happens silently: you ask a coding agent to update the authentication service's token refresh endpoint. The agent produces clean-looking code — confident, well-commented, type-safe. It also calls a method signature that was renamed three months ago in a shared library three directories up. The tests for that endpoint pass because the mock still uses the old signature. The bug surfaces in staging when the real library gets pulled in.

This isn't a hallucination in the abstract sense. The model knew about that method — it existed somewhere in the training data or was briefly visible in context. The problem is architectural: the agent never had access to the current version of the interface it was calling.

The Cold Start Trap in AI Products

· 12 min read
Tian Pan
Software Engineer

There's a specific kind of failure that kills AI features before they ever get a chance to prove themselves. It doesn't look like a technical failure — the model architecture is sound, the eval scores are decent, and the feature ships. But adoption is flat, users bounce, and six months later the team quietly deprioritizes the feature. The diagnosis, delivered in a retrospective: "not enough data."

This is the cold start trap. AI features improve with engagement data, but users won't engage until the feature is good enough to be useful. The circular dependency is not a solvable math problem — it's a product design challenge disguised as an engineering problem. And most teams walk into it with the same wrong plan: collect data first, ship ML second.

Cultural Calibration for Global AI Products: Why Translation Is 10% of the Problem

· 9 min read
Tian Pan
Software Engineer

There is a quiet failure mode baked into almost every globally deployed AI product. An engineer localizes the UI strings, runs the model outputs through a translation API, has a native speaker spot-check a handful of responses, and ships. The product is technically multilingual. It is not culturally competent. Users in Tokyo, Riyadh, and Chengdu receive outputs that are grammatically correct and culturally wrong — responses that signal disrespect, confusion, or distrust in ways the team will never see in aggregate metrics.

The research is unambiguous: every major LLM tested reflects the worldview of English-speaking, Protestant European societies. Studies testing models against representative data from 107 countries found not a single model that aligned with how people in Africa, Latin America, or the Middle East build trust, show respect, or resolve conflict. Translation patches the surface. The underlying calibration remains Western.

Database Connection Pools Are the Hidden Bottleneck in Your AI Pipeline

· 9 min read
Tian Pan
Software Engineer

Your AI feature ships. Response times look reasonable in staging. A week later, production starts throwing mysterious p99 spikes — latency jumps from 800ms to 8 seconds under moderate load, with no GPU pressure, no model errors, and no obvious cause. You add more replicas. It doesn't help. You profile the model server. It's fine. You add caching. Still no improvement.

Eventually someone checks the database connection pool wait time. It's been sitting at 95% utilization since day three.

This is the most common category of AI production incident that nobody talks about, because connection pool exhaustion looks like model slowness. The symptoms appear in the wrong layer — you see high latency on LLM calls, not on database queries — so the diagnosis takes days while users experience degraded responses.

The Demo-to-Production Failure Pattern: Why AI Prototypes Collapse When Real Users Arrive

· 10 min read
Tian Pan
Software Engineer

Thirty percent of generative AI projects are abandoned after proof of concept. Ninety-five percent of enterprise pilots deliver zero measurable business impact. Gartner projects 40% of agentic AI projects will be canceled before the end of 2027. These aren't failures of the underlying technology — they're failures of the gap between demo and production.

The demo-to-production failure pattern is predictable, repeatable, and almost entirely preventable. It happens because the conditions that make a demo look great are systematically different from the conditions that make production work. Teams optimize for the former and get ambushed by the latter.