Voice-agent accuracy: from recognition to production

“Voice-agent accuracy” is not the word-recognition rate in lab tests. For a business, accuracy is the share of calls the agent handled correctly: understood the request, didn’t break the dialogue, drove to the target action or handed off to a human in time. Let’s break down what it’s made of and how to get it to production.

The call pipeline

For a single turn the agent passes four stages: ASR (caller speech → text), understanding (the LLM decides what to say), TTS (text → voice) and the telephony that carries it all. Error accumulates along the chain: an ASR miss distorts understanding, and latency at any stage kills the liveliness of the conversation.

— Diagram: ASR → understanding (LLM) → synthesis (TTS) → telephony

Recognition: where words are lost

Phone speech is a compressed channel — noise, accents and interruptions. Accuracy drops most on exactly what matters to the business: names, addresses, numbers and amounts. What helps:

Barge-in — the caller can interrupt and the agent stops talking, like a real interlocutor
Endpointing — correctly detecting that the person has finished, without cutting off on a pause
Domain hints (a dictionary of products, common names and formats) lift recognition of critical entities

Latency decides

Even a perfectly understood request is useless if the answer arrives three seconds late — the caller already thinks “the line froze”. The target is a response under ~1–1.5 seconds. It’s achieved with streaming: the agent starts synthesising before processing fully ends, and ASR runs streaming rather than after the whole turn.

In voice, latency matters more than a pretty phrase. The conversation stays alive while the agent replies at a human rhythm.

Script and human handoff

Accuracy also means honest boundaries. The agent should confidently run a standard dialogue, but on doubt or an unusual request it must not improvise — it escalates to a human with the context already collected. The “time to call a human” logic is wired explicitly: which signals and phrases trigger the handoff.

Need RAG or a voice agent for your use case? Book a call

How to measure and improve

You can’t improve accuracy without measuring it. We label real calls, score the share of correctly handled dialogues per stage, and run changes through A/B on live traffic. The final metric isn’t the model’s WER but the result in client money: connect rate, conversion to conversation, share driven to the target action.