−450m · Reference architecture
Low-latency voice agent: Conversation Relay + ElevenLabs + Deepgram + Claude
Reference architecture for a real-time voice AI agent on the phone. Twilio Conversation Relay as the transport, ElevenLabs and Deepgram on the audio I/O, Claude as the brain. Paired with a worked example in RecallIQ.
Public reference repo: sgeddy/voice-ai-conversation-relay-elevenlabs — runnable Node.js + TypeScript implementation with latency instrumentation, a synthetic-call benchmark harness, and a failure-mode catalog. Apache-style pattern, no application specifics.
−100m · Problem
Real-time voice AI agents sit across more hops than people remember: caller → PSTN → Twilio → Conversation Relay → STT → LLM → TTS → Conversation Relay → caller. Each hop adds latency. Most public examples skip instrumentation entirely, so builders adopt the pattern without knowing where their budget is being spent.
This pattern is for the workloads where you actually want a phone call instead of a browser chat — outbound study/review/coaching, customer-initiated agent flows, anything where the user needs to be hands-free and the experience starts with a ring. Worked example: the RecallIQ phone-call review session, deployed at study.samueleddy.com. The case study goes deeper on the application side; this page is the pattern.
−300m · Architecture
Five components, one phone call.
caller ──(PSTN)──> Twilio Voice
│
│ TwiML <Connect><ConversationRelay>
▼
Conversation Relay ◄──(Deepgram STT)── caller speech
│
│ WebSocket
▼
your API server
│
│ anthropic.messages.stream()
▼
Claude
│
│ content_block_delta tokens
▼
your API server
│
│ {type:"text", token, last:false}
▼
Conversation Relay ──(ElevenLabs TTS)──> caller
Your server’s only job in the audio path is to translate Claude tokens into Conversation Relay frames. Twilio handles the phone call, Deepgram handles STT, ElevenLabs handles TTS, Claude handles reasoning. You glue.
TwiML at call connect:
<Response>
<Connect>
<ConversationRelay
url="wss://your-api.example.com/ws/review"
ttsProvider="ElevenLabs"
voice="${ELEVENLABS_VOICE_ID}"
transcriptionProvider="deepgram"
speechModel="nova-2"
interruptible="any"
welcomeGreeting="Hi! Give me a moment to load your session."
intelligenceService="${TWILIO_INTELLIGENCE_SERVICE_SID}"
>
<Parameter name="enrollmentId" value="${enrollmentId}" />
</ConversationRelay>
</Connect>
</Response>
Four attributes are doing most of the work:
ttsProvider="ElevenLabs"+voice— picks the TTS engine and voice ID. Conversation Relay calls ElevenLabs for you; you never see the audio stream.transcriptionProvider="deepgram"+speechModel="nova-2"— Deepgram runs in the inbound audio path. Transcripts arrive in your WebSocket aspromptmessages.interruptible="any"— barge-in. The caller talking cuts off the in-flight TTS; Conversation Relay sends aninterruptevent withutteranceUntilInterrupt(what was actually played before the cut).welcomeGreeting— plays the instant the call connects. Buys you the latency budget to load context (user, due cards, history) before the first model call.
intelligenceService is optional — it hands the post-call audio to Twilio Conversational Intelligence for transcription, sentiment, custom operators. Drop the attribute if you don’t need it.
WebSocket message flow: setup → prompt* → interrupt? → prompt* → end. Your server handles five event types: setup, prompt, interrupt, dtmf, error. The Claude streaming loop runs inside the prompt handler.
−500m · Code
Three snippets that matter. Full runnable implementation in the reference repo — Fastify server, Conversation Relay handler, Anthropic streaming, pino instrumentation, plus a benchmark harness that places real Twilio calls.
Stream Claude tokens back as Conversation Relay frames:
const stream = await anthropic.messages.stream({
model: "claude-haiku-4-5-20251001",
max_tokens: 300,
system: buildSystemPrompt(session),
messages: session.history,
});
for await (const event of stream) {
if (event.type === "content_block_delta" &&
event.delta.type === "text_delta") {
ws.send(JSON.stringify({
type: "text",
token: event.delta.text,
last: false,
}));
}
}
ws.send(JSON.stringify({ type: "text", token: "", last: true }));
Each Claude text_delta becomes a Conversation Relay text frame. The final empty frame with last: true tells Conversation Relay to flush — ElevenLabs starts synthesizing as tokens arrive, not after the full reply.
Patch history on barge-in so the next turn reflects reality:
case "interrupt": {
const heard = msg.utteranceUntilInterrupt ?? "";
const last = session.history[session.history.length - 1];
if (last?.role === "assistant") {
last.content = heard + " [interrupted]";
}
break;
}
Without this, the model thinks it delivered the full reply. The caller thinks it didn’t. The next turn drifts.
Validate Twilio’s webhook signature in prod:
const signature = request.headers["x-twilio-signature"] ?? "";
const url = `${API_BASE_URL}/twiml/review-call`;
const isValid = twilio.validateRequest(
TWILIO_AUTH_TOKEN, signature, url, params,
);
if (!isValid && process.env["NODE_ENV"] === "production") {
return reply.code(403).send("Forbidden");
}
Cheap, important. Skipping signature validation is how someone else’s API keys end up making calls on your account.
−750m · Latency budget
Measure end-to-end, then per-hop. Targets to design against on a US-to-US phone call:
| Hop | Reasonable budget |
|---|---|
| Caller speech → Deepgram final word | ~300–500 ms |
| Conversation Relay → your WS | ~50 ms |
| First Claude token (Haiku, streaming) | ~400–800 ms |
| First TTS audio playing | ~200–400 ms |
| Total p50 first audio | ~1.0–1.7 s |
What moves the number:
- Model choice. Haiku is the right call for turn-by-turn agent reasoning. Sonnet for synthesis tasks done off the call path.
- Stream, don’t await. Use
messages.stream()and pipe tokens. Awaiting the full reply before the first frame is the easiest way to feel slow. - Deepgram utterance settings. A
utterance_end_msfloor below 1000 ms in the streaming case will produce false “user finished” events and degrade the experience. 1000 ms is the published floor; trust it. - System prompt size. Every token in the system prompt is paid on every turn. Move per-card state into the user message, keep the system prompt focused on persona and rules.
- Network locality. Run the API close to the Twilio region terminating the call. A round trip across an ocean adds 150 ms each direction, on every turn.
Instrument the timestamps at WS receive, messages.stream() start, first delta, last delta, and the interrupt event if any. Log them per turn. The data is the architecture.
−1000m · Failure modes
The ones you only see in production, taken from comments in the worked example:
- In-memory session state breaks horizontally. A
Map<callSid, VoiceSession>is fine on one API process. Two processes behind a load balancer and the WS routes to the wrong one. Move session state to Redis before scaling. - Heuristic session control. Substring-matching the model’s reply for “correct” or “next question” works in the demo and gets fragile fast. Use Claude tool use, or a structured-output field, to drive state transitions from the model side.
- Conversation history grows. Long calls hit token limits before they hit time limits. Decide your truncation policy explicitly: rolling window, summarized prefix, or session-ending hand-off.
- Barge-in vs echo bleed. Carrier echo can look like barge-in to STT. Some
interruptevents are real; some aren’t. The mitigations are at the audio layer (good echo cancellation on the caller side) and at the application layer (require some minimum utterance length before treating barge-in as authoritative). - DTMF as fallback. Speech recognition fails on bad networks, accents the model wasn’t trained on, and high-noise environments. Wire DTMF as a parallel input for any structured prompt (MCQ letters, yes/no, session-end). It’s almost free to implement and saves the experience in the long tail.
- Webhook auth. Twilio signature validation only works against the public URL Twilio actually called. Behind a reverse proxy that rewrites paths, get the original URL right or signature validation will reject legitimate calls. Test it once in staging, then leave it on.
When this pattern fits — and when it doesn’t
Fits:
- Outbound study/coaching/reminder workloads where hands-free phone is the right modality
- Customer-initiated agent flows where you can charge a recurring phone number to the business
- Anything where Twilio Conversational Intelligence is going to do work on the call after the fact
- Workloads where ElevenLabs voice quality matters more than the per-minute cost
Doesn’t fit:
- Low-friction web quiz / browser-first interactions — keep that in the browser. Voice mode in the browser is a different stack (Deepgram REST + ElevenLabs via AudioContext, no Conversation Relay involved)
- Workloads where cost per minute is the dominant constraint — ElevenLabs is not the cheapest TTS, and Twilio voice minutes aren’t free
- Workloads that need anything that isn’t text in/text out at the LLM — Conversation Relay’s interface is text-based; richer modalities go through other patterns
Worked example
The applied implementation lives in RecallIQ — the indie spaced-repetition study app where I built this pattern in production. The phone-call review path uses everything described above; the browser quiz path uses a different stack and is documented in the case study.
The sanitized, runnable pattern lives in the public reference repo: sgeddy/voice-ai-conversation-relay-elevenlabs. Clone, fill in env vars, point a Twilio number at it, call.