Blog post
Latency budgeting for voice agents
A practical guide to measuring, allocating, and improving end-to-end latency without breaking user experience.
Latency budgeting for voice agents
Voice UX has zero patience. If your agent responds like a sluggish IVR, users will talk over it, repeat themselves, or bounce.
The fix isn’t “make everything faster” (you can’t). The fix is to treat latency like a budget: measure it, allocate it across components, and enforce it with fallbacks.
The mental model: one budget, many contributors
A typical turn looks like:
- Audio in → VAD / endpointing decides “the user is done”
- STT produces text (partial + final)
- Model decides the next action (respond vs call tool vs ask clarification)
- Tools (CRM, ticketing, KB search, payments) do I/O
- TTS produces audio
- Audio out plays to the user
Each segment has variability; the user experiences the sum.
A concrete budget table
Start by capturing p50 / p95 for each segment and then decide what you can afford.
Two important rules:
- The end-to-end p95 is your product. If p95 is bad, the experience feels bad.
- Budgets imply tradeoffs. If tools are slow, you must constrain model time or response length (or both).
Measure the right thing (and label it consistently)
Most teams measure “request time” for the model and call it done. For voice, you need at least these spans:
turn.start(first audio frame after silence)vad.end_of_speech(endpointing fires)stt.finalllm.first_tokenandllm.donetool.start/tool.doneper tooltts.first_audioaudio.playback.start
A good convention is to log durations with a single correlation id per turn.
type TurnSpans = {
turnId: string;
vadMs: number;
sttMs: number;
llmTtfbMs: number; // time to first token
llmTotalMs: number;
toolMs: number; // sum or critical path
ttsTtfbMs: number; // time to first audio
endToEndMs: number;
};
function summarizeTurn(spans: TurnSpans) {
return {
turnId: spans.turnId,
endToEndMs: spans.endToEndMs,
criticalPathMs: spans.vadMs + spans.sttMs + spans.llmTtfbMs + spans.toolMs + spans.ttsTtfbMs,
};
}
Budget enforcement: timeboxes + fallbacks
A budget that isn’t enforced becomes a dashboard curiosity. Enforce budgets with:
1) Timeboxed tools
- Put hard timeouts on external calls.
- Prefer partial answers over waiting forever.
Example fallback text when a tool misses its budget:
I’m still checking that. While it loads, can you confirm the last 4 digits of the order number?
2) Progressive responses
Don’t wait to craft a perfect paragraph. Voice wants quick acknowledgement, then substance.
A useful pattern is:
- First sentence: acknowledge + plan (short)
- Then: result (from tools / reasoning)
3) Routing models by phase
Use a fast model for early routing (tool/no-tool, intent classification) and a slower model only when needed.
Endpointing is half your latency
Users will feel 300ms vs 600ms at the end of speech more than they’ll feel a 50ms STT improvement.
Tips:
- Use barge-in. Let users interrupt long responses.
- Tune endpointing by intent: short confirmations vs explanations.
- Use partial STT to start planning while the user speaks.
A practical workflow to improve p95
- Pick one workflow (e.g. “order status”).
- Capture 200+ real turns; compute p50/p95 per segment.
- Identify the biggest p95 contributor (often tools + endpointing).
- Improve one thing at a time (cache, parallelize, timebox, shorten response).
- Re-measure and lock budgets into code as timeouts + heuristics.
Checklist
- Correlation id per turn
- p95 per segment on a dashboard
- Tool timeouts and graceful fallback prompts
- Streaming TTS with fast first audio
- Barge-in enabled
If you do nothing else: set tool timeouts and optimize endpointing. Those two changes usually move the needle the most.
Want help applying this to your workflow?
Share your top call types and integrations, we’ll map a safe, measurable rollout plan for your first production voice agent.
Get new posts in your inbox
Practical notes on building reliable voice agents: latency, evaluation, tool safety, and operational rollout.
No spam. Unsubscribe any time.
Next up
Browse more posts or read how Sonorant was built for production operations.
Ready to see it in action?
Tell us one workflow you want to automate. We’ll propose a measurable rollout, starting with a single high-impact call type.