Blog post
Evaluating voice agents: what to measure (and what to ignore)
A practical evaluation stack for voice: containment, latency, correctness, and safety—measured per workflow.
Evaluating voice agents: what to measure (and what to ignore)
If you can’t measure it, you can’t improve it—and for voice agents, the gap between a great demo and a reliable deployment is mostly measurement discipline.
This post outlines a lightweight evaluation stack that works in practice.
Start with outcomes, not vibes
Your north-star metrics should map to business outcomes:
- Containment rate: % of conversations resolved without human handoff
- Task success rate: % of conversations that achieved the intended outcome
- Average handle time (AHT): how long it takes (including escalations)
- Cost per resolution: model + tool + infra + human time
For voice, add two experience metrics:
- p95 end-to-end latency per turn
- interruption rate (barge-in / talk-over) as a proxy for pacing issues
Evaluate per workflow
“Overall accuracy” is not a useful metric. A workflow is a specific set of tools, policies, and success criteria.
Examples:
- order status
- refund request
- appointment scheduling
- password reset
Each workflow gets its own scorecard.
A simple scorecard template
The minimum viable dataset
You don’t need 50k conversations to start. You need:
- 50–200 real conversations per workflow
- labeled outcomes (success/fail + why)
- tool traces and timestamps
Then iterate weekly.
What to ignore early
- “Response similarity to gold text” (voice isn’t a chat benchmark)
- “BLEU/ROUGE” style metrics
- “Model win rate” without workflow constraints
Put evals in CI
Treat the agent like software:
- change prompt/tools/policy
- run the eval suite
- block merges on regressions
Even a tiny suite catches the most common failures: tool misuse, policy drift, and latency blowups.
Want help applying this to your workflow?
Share your top call types and integrations, we’ll map a safe, measurable rollout plan for your first production voice agent.
Get new posts in your inbox
Practical notes on building reliable voice agents: latency, evaluation, tool safety, and operational rollout.
No spam. Unsubscribe any time.
Next up
Browse more posts or read how Sonorant was built for production operations.
Ready to see it in action?
Tell us one workflow you want to automate. We’ll propose a measurable rollout, starting with a single high-impact call type.