“Voice-agent accuracy” is not the word-recognition rate in lab tests. For a business, accuracy is the share of calls the agent handled correctly: understood the request, didn’t break the dialogue, drove to the target action or handed off to a human in time. Let’s break down what it’s made of and how to get it to production.
The call pipeline
For a single turn the agent passes four stages: ASR (caller speech → text), understanding (the LLM decides what to say), TTS (text → voice) and the telephony that carries it all. Error accumulates along the chain: an ASR miss distorts understanding, and latency at any stage kills the liveliness of the conversation.
Recognition: where words are lost
Phone speech is a compressed channel — noise, accents and interruptions. Accuracy drops most on exactly what matters to the business: names, addresses, numbers and amounts. What helps:
- Barge-in — the caller can interrupt and the agent stops talking, like a real interlocutor
- Endpointing — correctly detecting that the person has finished, without cutting off on a pause
- Domain hints (a dictionary of products, common names and formats) lift recognition of critical entities
Latency decides
Even a perfectly understood request is useless if the answer arrives three seconds late — the caller already thinks “the line froze”. The target is a response under ~1–1.5 seconds. It’s achieved with streaming: the agent starts synthesising before processing fully ends, and ASR runs streaming rather than after the whole turn.
In voice, latency matters more than a pretty phrase. The conversation stays alive while the agent replies at a human rhythm.
Script and human handoff
Accuracy also means honest boundaries. The agent should confidently run a standard dialogue, but on doubt or an unusual request it must not improvise — it escalates to a human with the context already collected. The “time to call a human” logic is wired explicitly: which signals and phrases trigger the handoff.
How to measure and improve
You can’t improve accuracy without measuring it. We label real calls, score the share of correctly handled dialogues per stage, and run changes through A/B on live traffic. The final metric isn’t the model’s WER but the result in client money: connect rate, conversion to conversation, share driven to the target action.