Blog post

Latency budgeting for voice agents

A practical guide to measuring, allocating, and improving end-to-end latency without breaking user experience.

Feb 04, 2026engineering latency evaluation

Latency budgeting for voice agents

Voice UX has zero patience. If your agent responds like a sluggish IVR, users will talk over it, repeat themselves, or bounce.

The fix isn’t “make everything faster” (you can’t). The fix is to treat latency like a budget: measure it, allocate it across components, and enforce it with fallbacks.

The mental model: one budget, many contributors

A typical turn looks like:

Audio in → VAD / endpointing decides “the user is done”
STT produces text (partial + final)
Model decides the next action (respond vs call tool vs ask clarification)
Tools (CRM, ticketing, KB search, payments) do I/O
TTS produces audio
Audio out plays to the user

Each segment has variability; the user experiences the sum.

A concrete budget table

Start by capturing p50 / p95 for each segment and then decide what you can afford.

Segment	p50 (ms)	p95 (ms)	Target p95 budget (ms)	Notes
Endpointing (VAD)	180	450	400	Prefer short turns; allow barge-in
STT finalization	120	350	300	Stream partials; don’t wait for perfect
Model (first token)	150	600	450	Choose faster model for routing
Tool call(s)	220	1200	800	Parallelize; cache; timebox
TTS (first audio)	140	520	400	Stream audio; shorten first sentence
End-to-end	810	3120	2350	User-perceived budget

Two important rules:

The end-to-end p95 is your product. If p95 is bad, the experience feels bad.
Budgets imply tradeoffs. If tools are slow, you must constrain model time or response length (or both).

Measure the right thing (and label it consistently)

Most teams measure “request time” for the model and call it done. For voice, you need at least these spans:

turn.start (first audio frame after silence)
vad.end_of_speech (endpointing fires)
stt.final
llm.first_token and llm.done
tool.start / tool.done per tool
tts.first_audio
audio.playback.start

A good convention is to log durations with a single correlation id per turn.

type TurnSpans = {
  turnId: string;
  vadMs: number;
  sttMs: number;
  llmTtfbMs: number; // time to first token
  llmTotalMs: number;
  toolMs: number;    // sum or critical path
  ttsTtfbMs: number; // time to first audio
  endToEndMs: number;
};

function summarizeTurn(spans: TurnSpans) {
  return {
    turnId: spans.turnId,
    endToEndMs: spans.endToEndMs,
    criticalPathMs: spans.vadMs + spans.sttMs + spans.llmTtfbMs + spans.toolMs + spans.ttsTtfbMs,
  };
}

Budget enforcement: timeboxes + fallbacks

A budget that isn’t enforced becomes a dashboard curiosity. Enforce budgets with:

1) Timeboxed tools

Put hard timeouts on external calls.
Prefer partial answers over waiting forever.

Example fallback text when a tool misses its budget:

I’m still checking that. While it loads, can you confirm the last 4 digits of the order number?

2) Progressive responses

Don’t wait to craft a perfect paragraph. Voice wants quick acknowledgement, then substance.

A useful pattern is:

First sentence: acknowledge + plan (short)
Then: result (from tools / reasoning)

3) Routing models by phase

Use a fast model for early routing (tool/no-tool, intent classification) and a slower model only when needed.

Endpointing is half your latency

Users will feel 300ms vs 600ms at the end of speech more than they’ll feel a 50ms STT improvement.

Tips:

Use barge-in. Let users interrupt long responses.
Tune endpointing by intent: short confirmations vs explanations.
Use partial STT to start planning while the user speaks.

A practical workflow to improve p95

Pick one workflow (e.g. “order status”).
Capture 200+ real turns; compute p50/p95 per segment.
Identify the biggest p95 contributor (often tools + endpointing).
Improve one thing at a time (cache, parallelize, timebox, shorten response).
Re-measure and lock budgets into code as timeouts + heuristics.

Checklist

Correlation id per turn
p95 per segment on a dashboard
Tool timeouts and graceful fallback prompts
Streaming TTS with fast first audio
Barge-in enabled

If you do nothing else: set tool timeouts and optimize endpointing. Those two changes usually move the needle the most.

Want help applying this to your workflow?

Share your top call types and integrations, we’ll map a safe, measurable rollout plan for your first production voice agent.

Request a demo Build your first agent

Get new posts in your inbox

Practical notes on building reliable voice agents: latency, evaluation, tool safety, and operational rollout.

No spam. Unsubscribe any time.

Next up

Browse more posts or read how Sonorant was built for production operations.

Browse blog Read our story

Ready to see it in action?

Tell us one workflow you want to automate. We’ll propose a measurable rollout, starting with a single high-impact call type.

Talk to sales Browse the blog