Skip to main content

2 posts tagged with "ai-quality"

View all tags

When LLMs Review LLMs, Errors Get Laundered Not Caught

· 10 min read
Tian Pan
Software Engineer

Trace the path of a single quality signal through a modern AI pipeline. An agent drafts a response. A second model reviews it and scores it 9 out of 10. That score gets logged. At the end of the quarter, the logged scores become the new eval set, and the next model is tuned to do well against it. Now ask the obvious question: where in that loop did a human ever look at the actual output?

In a lot of pipelines, the honest answer is nowhere. The agent that does the work is reviewed by another agent, and that reviewer's verdict feeds the next round of evaluation. The loop is closed. It runs continuously, it produces a dashboard, and the dashboard is green. What it does not contain, at any point, is a measurement against reality.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.