Skip to main content

3 posts tagged with "real-time"

View all tags

The Avatar in the Conference Call: Engineering Real-Time Talking-Head AI for Video Meetings

· 12 min read
Tian Pan
Software Engineer

A voice agent with a face is not a voice agent with a face. It is a synchronous-video-AI system, and the difference shows up the first time a human watches the lips drift three frames behind the audio and decides — without being able to articulate why — that the thing on the screen is fake. The voice-only teams that built a 300ms speech pipeline and then bolted a rendering model onto the end of it have just inherited a real-time multimodal problem they did not price into the roadmap.

The threshold is not generous. Below roughly 45ms of audio-video offset, viewers report perfect sync. Past about 125ms with audio leading or 45ms with audio lagging, the brain flags the mismatch as wrong even when the viewer cannot point to the cause. Inside a conversational loop where the avatar must also listen, think, speak, and render — all while a network sits between you and the user — there is no slack to absorb a sloppy seam between the audio output and the rendered face.

Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down

· 11 min read
Tian Pan
Software Engineer

The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.

Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.

The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.

Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

· 10 min read
Tian Pan
Software Engineer

A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.

The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.