−450m · Reference architecture

Low-latency voice agent: Conversation Relay + ElevenLabs + Deepgram + Claude

Reference architecture for a real-time voice AI agent on the phone. Twilio Conversation Relay as the transport, ElevenLabs and Deepgram on the audio I/O, Claude as the brain. Paired with a worked example in RecallIQ.

Public reference repo: sgeddy/voice-ai-conversation-relay-elevenlabs — runnable Node.js + TypeScript implementation with latency instrumentation, a synthetic-call benchmark harness, and a failure-mode catalog. Apache-style pattern, no application specifics.

−100m · Problem

Real-time voice AI agents sit across more hops than people remember: caller → PSTN → Twilio → Conversation Relay → STT → LLM → TTS → Conversation Relay → caller. Each hop adds latency. Most public examples skip instrumentation entirely, so builders adopt the pattern without knowing where their budget is being spent.

This pattern is for the workloads where you actually want a phone call instead of a browser chat — outbound study/review/coaching, customer-initiated agent flows, anything where the user needs to be hands-free and the experience starts with a ring. Worked example: the RecallIQ phone-call review session, deployed at study.samueleddy.com. The case study goes deeper on the application side; this page is the pattern.

−300m · Architecture

Five components, one phone call.

caller ──(PSTN)──> Twilio Voice

                     │  TwiML <Connect><ConversationRelay>

              Conversation Relay  ◄──(Deepgram STT)── caller speech

                     │  WebSocket

              your API server

                     │  anthropic.messages.stream()

                   Claude

                     │  content_block_delta tokens

              your API server

                     │  {type:"text", token, last:false}

              Conversation Relay  ──(ElevenLabs TTS)──> caller

Your server’s only job in the audio path is to translate Claude tokens into Conversation Relay frames. Twilio handles the phone call, Deepgram handles STT, ElevenLabs handles TTS, Claude handles reasoning. You glue.

TwiML at call connect:

<Response>
  <Connect>
    <ConversationRelay
      url="wss://your-api.example.com/ws/review"
      ttsProvider="ElevenLabs"
      voice="${ELEVENLABS_VOICE_ID}"
      transcriptionProvider="deepgram"
      speechModel="nova-2"
      interruptible="any"
      welcomeGreeting="Hi! Give me a moment to load your session."
      intelligenceService="${TWILIO_INTELLIGENCE_SERVICE_SID}"
    >
      <Parameter name="enrollmentId" value="${enrollmentId}" />
    </ConversationRelay>
  </Connect>
</Response>

Four attributes are doing most of the work:

intelligenceService is optional — it hands the post-call audio to Twilio Conversational Intelligence for transcription, sentiment, custom operators. Drop the attribute if you don’t need it.

WebSocket message flow: setupprompt* → interrupt? → prompt* → end. Your server handles five event types: setup, prompt, interrupt, dtmf, error. The Claude streaming loop runs inside the prompt handler.

−500m · Code

Three snippets that matter. Full runnable implementation in the reference repo — Fastify server, Conversation Relay handler, Anthropic streaming, pino instrumentation, plus a benchmark harness that places real Twilio calls.

Stream Claude tokens back as Conversation Relay frames:

const stream = await anthropic.messages.stream({
  model: "claude-haiku-4-5-20251001",
  max_tokens: 300,
  system: buildSystemPrompt(session),
  messages: session.history,
});

for await (const event of stream) {
  if (event.type === "content_block_delta" &&
      event.delta.type === "text_delta") {
    ws.send(JSON.stringify({
      type: "text",
      token: event.delta.text,
      last: false,
    }));
  }
}

ws.send(JSON.stringify({ type: "text", token: "", last: true }));

Each Claude text_delta becomes a Conversation Relay text frame. The final empty frame with last: true tells Conversation Relay to flush — ElevenLabs starts synthesizing as tokens arrive, not after the full reply.

Patch history on barge-in so the next turn reflects reality:

case "interrupt": {
  const heard = msg.utteranceUntilInterrupt ?? "";
  const last = session.history[session.history.length - 1];
  if (last?.role === "assistant") {
    last.content = heard + " [interrupted]";
  }
  break;
}

Without this, the model thinks it delivered the full reply. The caller thinks it didn’t. The next turn drifts.

Validate Twilio’s webhook signature in prod:

const signature = request.headers["x-twilio-signature"] ?? "";
const url = `${API_BASE_URL}/twiml/review-call`;
const isValid = twilio.validateRequest(
  TWILIO_AUTH_TOKEN, signature, url, params,
);
if (!isValid && process.env["NODE_ENV"] === "production") {
  return reply.code(403).send("Forbidden");
}

Cheap, important. Skipping signature validation is how someone else’s API keys end up making calls on your account.

−750m · Latency budget

Measure end-to-end, then per-hop. Targets to design against on a US-to-US phone call:

HopReasonable budget
Caller speech → Deepgram final word~300–500 ms
Conversation Relay → your WS~50 ms
First Claude token (Haiku, streaming)~400–800 ms
First TTS audio playing~200–400 ms
Total p50 first audio~1.0–1.7 s

What moves the number:

Instrument the timestamps at WS receive, messages.stream() start, first delta, last delta, and the interrupt event if any. Log them per turn. The data is the architecture.

−1000m · Failure modes

The ones you only see in production, taken from comments in the worked example:

When this pattern fits — and when it doesn’t

Fits:

Doesn’t fit:

Worked example

The applied implementation lives in RecallIQ — the indie spaced-repetition study app where I built this pattern in production. The phone-call review path uses everything described above; the browser quiz path uses a different stack and is documented in the case study.

The sanitized, runnable pattern lives in the public reference repo: sgeddy/voice-ai-conversation-relay-elevenlabs. Clone, fill in env vars, point a Twilio number at it, call.