Skip to main content

17 posts tagged with "llm-engineering"

View all tags

Synthetic Training Data Quality Collapse: How Feedback Loops Destroy Your Fine-Tuned Models

· 10 min read
Tian Pan
Software Engineer

You generate 50,000 synthetic instruction-following examples with GPT-4, fine-tune a smaller model on them, deploy it, and the results look great. Six months later, your team repeats the process — except this time you generate the examples with the fine-tuned model to save costs. The second model's evals are slightly lower, but within noise. You tune the next version the same way. By the fourth iteration, your model's outputs have a strange homogeneity. Users report it sounds robotic. It struggles with anything that doesn't fit a narrow template. Your most capable fine-tune has become your worst.

This is model collapse — the progressive, self-reinforcing degradation that happens when LLMs train on data generated by other LLMs. It is not a theoretical risk. It is a documented failure mode with measurable mechanics, and it is increasingly likely to affect teams that have normalized synthetic data generation without thinking carefully about the feedback dynamics.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

How AI Agents Actually Learn Over Time

· 8 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the model as a fixed artifact. You pick a foundation model, write your prompts, wire up some tools, and ship. If the agent starts making mistakes, you tweak the system prompt or switch to a newer model. Learning, in this framing, happens upstream—at the AI lab, during pretraining and RLHF—not in your stack.

This is the wrong mental model. Agents that improve over time do so at three distinct architectural layers, and only one of them involves touching model weights. Teams that understand this distinction build systems that compound in quality; teams that don't keep manually patching the same failure modes.

Measuring AI Agent Autonomy in Production: What the Data Actually Shows

· 7 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks on pre-deployment evals and almost nothing on measuring what their agents actually do in production. That's backwards. The metrics that matter—how long agents run unsupervised, how often they ask for help, how much risk they take on—only emerge at runtime, across thousands of real sessions. Without measuring these, you're flying blind.

A large-scale study of production agent behavior across thousands of deployments and software engineering sessions has surfaced some genuinely counterintuitive findings. The picture that emerges is not the one most builders expect.