Musings
Building

Cutting voice-agent latency below 300ms at the edge

1 min read 229 words

Every millisecond between a spoken word and a streamed response is borrowed trust. Cross 300ms and a conversation stops feeling like a conversation — it becomes a transaction with a server. The work of a real-time voice agent is mostly the work of refusing to spend time you do not have.

I keep a fixed budget and treat it like a memory allocation: capture, transport, inference, and synthesis each get a fraction, and nothing is allowed to overdraw. When one hop runs long, the time has to be repaid by another — there is no overdraft facility on a live call.

Latency is not a number you optimize at the end. It is a constraint you design the whole system around from the first request.

Move the model to the conversation

The single largest win was geographic: terminating audio and running first-token inference at the edge region nearest the speaker, rather than routing every packet to a central cluster. Below is the guard that decides whether a token stream is fast enough to begin speaking, or whether we wait for a fuller buffer.

const BUDGET_MS = 300;

export async function* guardedStream(
  source: AsyncIterable<Token>,
  startedAt: number,
) {
  let spoken = false;
  for await (const token of source) {
    const elapsed = performance.now() - startedAt;
    if (!spoken && elapsed < BUDGET_MS) {
      spoken = true; // begin synthesis early
      yield { ...token, lead: true };
    } else {
      yield token;
    }
  }
}

Starting synthesis on the first confident token — rather than the first complete sentence — is what makes a slow model feel immediate. The listener hears the agent begin to think out loud, and the remaining latency hides inside speech that is already underway.