sonorant
Launch

Blog post

Evaluating voice agents: what to measure (and what to ignore)

A practical evaluation stack for voice: containment, latency, correctness, and safety—measured per workflow.

Evaluating voice agents: what to measure (and what to ignore)

If you can’t measure it, you can’t improve it—and for voice agents, the gap between a great demo and a reliable deployment is mostly measurement discipline.

This post outlines a lightweight evaluation stack that works in practice.

Start with outcomes, not vibes

Your north-star metrics should map to business outcomes:

  • Containment rate: % of conversations resolved without human handoff
  • Task success rate: % of conversations that achieved the intended outcome
  • Average handle time (AHT): how long it takes (including escalations)
  • Cost per resolution: model + tool + infra + human time

For voice, add two experience metrics:

  • p95 end-to-end latency per turn
  • interruption rate (barge-in / talk-over) as a proxy for pacing issues

Evaluate per workflow

“Overall accuracy” is not a useful metric. A workflow is a specific set of tools, policies, and success criteria.

Examples:

  • order status
  • refund request
  • appointment scheduling
  • password reset

Each workflow gets its own scorecard.

A simple scorecard template

MetricDefinitionTargetNotes
Task successOutcome reached0.90measured on labeled set
Safety violationsPolicy breached0.00hard fail
p95 latencyend-to-end per turn2.3sbudgeted by segment
Escalation qualityGood handoff context0.95summary + entities + tool results

The minimum viable dataset

You don’t need 50k conversations to start. You need:

  • 50–200 real conversations per workflow
  • labeled outcomes (success/fail + why)
  • tool traces and timestamps

Then iterate weekly.

What to ignore early

  • “Response similarity to gold text” (voice isn’t a chat benchmark)
  • “BLEU/ROUGE” style metrics
  • “Model win rate” without workflow constraints

Put evals in CI

Treat the agent like software:

  • change prompt/tools/policy
  • run the eval suite
  • block merges on regressions

Even a tiny suite catches the most common failures: tool misuse, policy drift, and latency blowups.

Want help applying this to your workflow?

Share your top call types and integrations, we’ll map a safe, measurable rollout plan for your first production voice agent.

Get new posts in your inbox

Practical notes on building reliable voice agents: latency, evaluation, tool safety, and operational rollout.

No spam. Unsubscribe any time.

Next up

Browse more posts or read how Sonorant was built for production operations.

Ready to see it in action?

Tell us one workflow you want to automate. We’ll propose a measurable rollout, starting with a single high-impact call type.