Cutting voice-agent latency below 300ms at the edge
Every millisecond between a spoken word and a streamed response is borrowed trust. Cross 300ms and a conversation stops feeling like a conversation — it becomes a transaction with a server. The work of a real-time voice agent is mostly the work of refusing to spend time you do not have.
I keep a fixed budget and treat it like a memory allocation: capture, transport, inference, and synthesis each get a fraction, and nothing is allowed to overdraw. When one hop runs long, the time has to be repaid by another — there is no overdraft facility on a live call.
Latency is not a number you optimize at the end. It is a constraint you design the whole system around from the first request.
Move the model to the conversation
The single largest win was geographic: terminating audio and running first-token inference at the edge region nearest the speaker, rather than routing every packet to a central cluster. Below is the guard that decides whether a token stream is fast enough to begin speaking, or whether we wait for a fuller buffer.
const BUDGET_MS = 300;
export async function* guardedStream(
source: AsyncIterable<Token>,
startedAt: number,
) {
let spoken = false;
for await (const token of source) {
const elapsed = performance.now() - startedAt;
if (!spoken && elapsed < BUDGET_MS) {
spoken = true; // begin synthesis early
yield { ...token, lead: true };
} else {
yield token;
}
}
}
Starting synthesis on the first confident token — rather than the first complete sentence — is what makes a slow model feel immediate. The listener hears the agent begin to think out loud, and the remaining latency hides inside speech that is already underway.