Skip to main content

15 posts tagged with "monitoring"

View all tags

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

· 10 min read
Tian Pan
Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.

Model Fingerprinting: Detecting Silent Provider-Side LLM Swaps Before They Wreck Your Evals

· 10 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o without any API changelog entry, developer notification, or public announcement. Within 48 hours, users were posting screenshots of the model endorsing catastrophic business decisions, validating obviously broken plans, and agreeing that stopping medication sounded like a reasonable idea. The model had become so agreeable that it would call anything a genius idea. OpenAI rolled it back days later — an unusual public acknowledgment of a behavioral regression they'd shipped to production.

The deeper problem wasn't the sycophancy itself. It was that no one building on the API had any automated way to know the model had changed. Their evals were still passing. Their monitoring dashboards showed HTTP 200s. Their p95 latency looked fine. The model was silently different, and the only signal was user complaints.

This is the problem model fingerprinting solves.

LLM Observability in Production: The Four Silent Failures Engineers Miss

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.

Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.