Skip to main content

720 posts tagged with "llm"

View all tags

Fine-Tune Orphan: Recovering Domain Expertise When the Base Model Is Deprecated

· 9 min read
Tian Pan
Software Engineer

On January 4, 2024, OpenAI retired the /fine-tunes endpoint. Every fine-tuned Ada, Babbage, Curie, and Davinci model stopped responding. Teams that had spent months building production systems on these models — careful prompt design, annotated datasets, labeling pipelines — woke up to HTTP 404s. The fine-tunes didn't migrate. The learned behaviors didn't transfer. The domain expertise was gone.

This wasn't a fringe edge case. Google followed in August 2024 by completely decommissioning the PaLM API, with zero backwards-compatible grace period. Unlike OpenAI, which at least let existing GPT-3.5 fine-tunes keep running while blocking new training runs, Google's shutdown meant production inference stopped the same day. If your fine-tuned PaLM model was in the critical path, you had a service outage.

Statistical Watermarking for LLM Output: How Token Logit Bias Creates Detectable Signatures

· 9 min read
Tian Pan
Software Engineer

Google has been watermarking Gemini output for every user since October 2024 — 20 million users, no perceptible quality degradation, algorithmically detectable. OpenAI has a working prototype that requires only a few hundred tokens to produce a reliable signal. Anthropic says it's on the roadmap. The EU AI Act's Article 50 mandates machine-readable marking of AI-generated content for covered providers. And yet: a $0.88-per-million-token attack achieves ~100% evasion success against seven recent watermarking schemes simultaneously.

This is the actual state of LLM text watermarking. The gap between what's deployed, what the papers claim, and what adversaries can do is wider than most teams realize — and the engineering decisions you make about watermarking depend heavily on which side of that gap you're standing on.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The AI Efficiency Paradox: When Your Best Feature Kills Your Revenue

· 9 min read
Tian Pan
Software Engineer

In early 2026, Atlassian reported something that hadn't happened in the company's history: a decline in enterprise seat counts. For a company whose entire growth model rests on expansion revenue — selling more seats as customer organizations grow — this was a structural alarm, not a blip. The proximate cause wasn't churn or product failure. It was that Atlassian's own AI features had made teams so much more productive that fewer seats were needed to do the same amount of work.

This is the AI efficiency paradox: build a feature that genuinely saves users time, and you may be training them to need less of your product. The more useful your AI, the faster your pricing model breaks.

Story Points Don't Survive First Contact With an LLM

· 8 min read
Tian Pan
Software Engineer

Here is a failure mode that happens quietly, at every company with a mature Agile practice that decides to add an LLM feature: the team estimates the work in story points, assigns it to a two-week sprint, and then spends three sprints in a row reporting "70% done" while the engineering manager stares at a burndown chart that refuses to burn down. Nobody lied. The feature is genuinely hard to finish — because the conditions that make story points a useful planning tool don't exist for AI features, and nobody noticed until they were already committed.

The problem is not that engineers are bad at estimating. The problem is that story points encode assumptions about the nature of software work — assumptions that LLM features violate structurally, not accidentally.

AI Feature Dependency Graphs: Resilience Engineering When Your Services Share a Model

· 10 min read
Tian Pan
Software Engineer

Your embedding model goes down at 3 PM on a Tuesday. Within thirty seconds, your support chat stops answering questions, your personalized recommendation engine starts returning empty results, your document search returns nothing, and your onboarding assistant stops working. Your on-call engineer opens the incident channel and sees fifteen simultaneous alerts from features that have no visible relationship to each other. There is no stack trace pointing to the root cause. It looks like a distributed systems outage — but it isn't. It's a single shared dependency failing, and you didn't know fifteen features shared it.

This is the AI feature dependency problem: the infrastructure layer underneath your product features is deeply interconnected, but your architecture diagrams show each feature as an isolated box. When the coupling is invisible, failure propagation is invisible too — until it isn't.

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

The Context Format Decision Most Teams Make Accidentally: JSON vs Markdown vs Plain Text

· 9 min read
Tian Pan
Software Engineer

Most teams pick a context format once, early in development, and never revisit it. A developer reaches for JSON because it looks structured and machine-readable. Another grabs markdown because it's what they use in README files. Plain text gets chosen when nothing else seems necessary. These are not engineering decisions — they're habits. And they silently shape how your model reasons.

The format you pass to an LLM is not inert packaging. It is an instruction. Structured JSON context primes the model toward schema-following behavior. Markdown encourages hierarchical synthesis. Plain text opens up more flexible inference. Getting this wrong by even one format category can degrade accuracy by 40% or more — and you won't see the error in your logs.

The AI Code Feedback Loop: How Today's Generated Code Trains Tomorrow's Models

· 9 min read
Tian Pan
Software Engineer

About 41% of all new code merged globally in 2025 was AI-generated. Most of that code flows into production repositories that are publicly indexed, scraped, and eventually fed back into the next round of training data for AI coding tools. The implication is straightforward but its consequences are still unfolding: AI models are increasingly being trained on the outputs of prior AI models, with no structured record of which code came from where.

This is the context pollution problem. It is not hypothetical. The feedback loop is already operating at scale, the quality effects are measurable, and the failure mode is unusual enough that it can look like improvement in the short term while the underlying distribution quietly degrades.

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

· 9 min read
Tian Pan
Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.