sonorant
Launch

Blog post

Latency budgeting for voice agents

A practical guide to measuring, allocating, and improving end-to-end latency without breaking user experience.

Latency budgeting for voice agents

Voice UX has zero patience. If your agent responds like a sluggish IVR, users will talk over it, repeat themselves, or bounce.

The fix isn’t “make everything faster” (you can’t). The fix is to treat latency like a budget: measure it, allocate it across components, and enforce it with fallbacks.

The mental model: one budget, many contributors

A typical turn looks like:

  1. Audio in → VAD / endpointing decides “the user is done”
  2. STT produces text (partial + final)
  3. Model decides the next action (respond vs call tool vs ask clarification)
  4. Tools (CRM, ticketing, KB search, payments) do I/O
  5. TTS produces audio
  6. Audio out plays to the user

Each segment has variability; the user experiences the sum.

A concrete budget table

Start by capturing p50 / p95 for each segment and then decide what you can afford.

Segmentp50 (ms)p95 (ms)Target p95 budget (ms)Notes
Endpointing (VAD)180450400Prefer short turns; allow barge-in
STT finalization120350300Stream partials; don’t wait for perfect
Model (first token)150600450Choose faster model for routing
Tool call(s)2201200800Parallelize; cache; timebox
TTS (first audio)140520400Stream audio; shorten first sentence
End-to-end81031202350User-perceived budget

Two important rules:

  • The end-to-end p95 is your product. If p95 is bad, the experience feels bad.
  • Budgets imply tradeoffs. If tools are slow, you must constrain model time or response length (or both).

Measure the right thing (and label it consistently)

Most teams measure “request time” for the model and call it done. For voice, you need at least these spans:

  • turn.start (first audio frame after silence)
  • vad.end_of_speech (endpointing fires)
  • stt.final
  • llm.first_token and llm.done
  • tool.start / tool.done per tool
  • tts.first_audio
  • audio.playback.start

A good convention is to log durations with a single correlation id per turn.

type TurnSpans = {
  turnId: string;
  vadMs: number;
  sttMs: number;
  llmTtfbMs: number; // time to first token
  llmTotalMs: number;
  toolMs: number;    // sum or critical path
  ttsTtfbMs: number; // time to first audio
  endToEndMs: number;
};

function summarizeTurn(spans: TurnSpans) {
  return {
    turnId: spans.turnId,
    endToEndMs: spans.endToEndMs,
    criticalPathMs: spans.vadMs + spans.sttMs + spans.llmTtfbMs + spans.toolMs + spans.ttsTtfbMs,
  };
}

Budget enforcement: timeboxes + fallbacks

A budget that isn’t enforced becomes a dashboard curiosity. Enforce budgets with:

1) Timeboxed tools

  • Put hard timeouts on external calls.
  • Prefer partial answers over waiting forever.

Example fallback text when a tool misses its budget:

I’m still checking that. While it loads, can you confirm the last 4 digits of the order number?

2) Progressive responses

Don’t wait to craft a perfect paragraph. Voice wants quick acknowledgement, then substance.

A useful pattern is:

  • First sentence: acknowledge + plan (short)
  • Then: result (from tools / reasoning)

3) Routing models by phase

Use a fast model for early routing (tool/no-tool, intent classification) and a slower model only when needed.

Endpointing is half your latency

Users will feel 300ms vs 600ms at the end of speech more than they’ll feel a 50ms STT improvement.

Tips:

  • Use barge-in. Let users interrupt long responses.
  • Tune endpointing by intent: short confirmations vs explanations.
  • Use partial STT to start planning while the user speaks.

A practical workflow to improve p95

  1. Pick one workflow (e.g. “order status”).
  2. Capture 200+ real turns; compute p50/p95 per segment.
  3. Identify the biggest p95 contributor (often tools + endpointing).
  4. Improve one thing at a time (cache, parallelize, timebox, shorten response).
  5. Re-measure and lock budgets into code as timeouts + heuristics.

Checklist

  • Correlation id per turn
  • p95 per segment on a dashboard
  • Tool timeouts and graceful fallback prompts
  • Streaming TTS with fast first audio
  • Barge-in enabled

If you do nothing else: set tool timeouts and optimize endpointing. Those two changes usually move the needle the most.

Want help applying this to your workflow?

Share your top call types and integrations, we’ll map a safe, measurable rollout plan for your first production voice agent.

Get new posts in your inbox

Practical notes on building reliable voice agents: latency, evaluation, tool safety, and operational rollout.

No spam. Unsubscribe any time.

Next up

Browse more posts or read how Sonorant was built for production operations.

Ready to see it in action?

Tell us one workflow you want to automate. We’ll propose a measurable rollout, starting with a single high-impact call type.