−750m · Case study

RecallIQ — voice AI study app on a $10/month box

PDF in, flashcards out, spaced repetition with two voice modes — one browser, one phone call. Built indie on Twilio Conversation Relay, ElevenLabs, Deepgram, and Claude. Live at study.samueleddy.com.

May 17, 2026 · Status: Published · Stack: Twilio Conversation Relay · ElevenLabs · Deepgram · Claude (Haiku + Sonnet) · Next.js · Fastify · BullMQ · Postgres · Redis · Caddy · Docker · AWS

Project is closed source. This case study only reviews some decisions made and learnings gained.

Personal work, built and written outside Twilio working hours. Views and opinions expressed here are my own and do not represent Twilio.

−100m · What it is

Upload a PDF, paste a URL, or both. Claude extracts the testable concepts, generates flashcards, and schedules them on a spaced-repetition cadence. Quiz yourself by text or by voice. Spacing widens as you retain.

Live at study.samueleddy.com. Indie, gated signup. Request access below.

−150m · Why I built it

I built RecallIQ for myself. Memorization doesn’t always come easy for me. I wish I could be someone who memorizes by reading something once or twice — for me it takes more effort, more time, more discipline. It’s gotten harder as an adult. Between work, family, and a packed schedule, my study minutes scatter across the day. Optimizing what gets reviewed and when is the difference between retaining the material and watching it slide off.

The material I’m trying to retain doesn’t fit one shape:

Certification study — practice exams stop helping the second you’ve seen the questions twice.
Prayer memorization — needs word-perfect recall, not just gist. Standard flashcard apps don’t handle that well.
General learning — books, talks, technical material I want to keep past the first read.

Different requirements, one system that handles all three.

Voice mode came from the same place. I wanted to quiz myself on the road, while washing dishes — not chained to a screen. From there it evolved into the two voice paths below: one in the browser for laptop sessions, one over the phone for hands-free.

And it gave me the excuse to build, with my own hands, the kind of system I spend my day job helping Twilio customers build. There’s a category of understanding you only get from shipping it yourself.

−200m · The system, one breath

Three runtimes, one monorepo (Turborepo + pnpm workspaces).

apps/web — Next.js 14 App Router. Learner UI, course editor, settings. Clerk for auth.
apps/api — Fastify + Zod. Route handlers, ownership checks, voice routes, the Claude grader.
apps/worker — BullMQ + Redis. Async jobs: course generation from uploads, scheduled reviews, notification dispatch.

Postgres + Drizzle ORM for state. Caddy out front for TLS. Everything runs in Docker Compose on a single ARM EC2 instance — more on that in §7.

Multi-tenant isolation is enforced at the API layer. Every user-data query filters by userId or enrollmentId. No global queries to slip through. The design rule is one sentence: never trust the URL.

−300m · Why Claude, and why two tiers

Anthropic is the default. Claude is the model RecallIQ ships against in production. Other providers are swap-in alternates, not active.

The two-tier strategy matters more than the vendor choice. Different jobs, different models.

Claude Haiku 4.5 for grading and extraction — given a question, a learner’s free-recall answer, and an expected concept list, decide what they covered and what they missed. Fast, cheap, plenty smart enough. Same model also pulls testable concepts out of uploaded practice-exam PDFs.
Claude Sonnet 4 for synthesis — course generation from raw content, AI-assisted course editing, prompt-only course creation. Slower, more expensive, materially better at the kind of structured judgment curriculum design needs.

The grader prompt isn’t trivial. Two failure modes had to be designed around:

The model counting concepts that just restate the question as “missing” — expected concept “Advent” when the question is “What is Advent?” should resolve as matched, not missing.
The model treating any list-format question as enumeration when the learner only needs one right answer — “Who founded the Jesuits?” doesn’t require every concept in the list, just a correct name.

The prompt has explicit single-vs-list rules, and the parser does a defensive set-validation pass on the way out: matched + missing must equal the expected concept list exactly. If the model hallucinates concepts that weren’t in the input, they get dropped.

One other habit worth naming: Haiku sometimes wraps JSON in markdown fences even when the system prompt tells it not to. Strip the fences before parsing. Don’t trust the format.

−400m · Two voice paths, one app

RecallIQ ships two voice modes that target different ergonomics, on different stacks.

Browser voice mode — quiz yourself with the laptop on the kitchen counter.

STT: Deepgram nova-2 prerecorded (per-card audio capture), with keyword boost biasing recognition toward single letters (A/B/C/D) and option-specific terms for MCQ questions. Falls back to the browser’s Web Speech API if DEEPGRAM_API_KEY isn’t set.
TTS: ElevenLabs streamed via the API server’s /voice/tts proxy, played through AudioContext.destination. Why not SpeechSynthesis? See §5.
Mic and audio session are juggled per-card — open the mic to listen, close it to flip the iOS audio category back to playback before speaking.

Phone-call review — reviews scheduled by BullMQ trigger an outbound call so you can review while doing literally anything else.

Twilio Conversation Relay does the heavy lifting. The TwiML names ttsProvider="ElevenLabs", transcriptionProvider="deepgram", speechModel="nova-2", interruptible="any", plus an optional intelligenceService for post-call analysis through Twilio Conversational Intelligence.
Claude Haiku is streamed token-by-token via anthropic.messages.stream() — each content_block_delta becomes a {type: "text", token, last: false} WebSocket frame back to Conversation Relay, which feeds ElevenLabs for synthesis on the fly.
Barge-in is handled through the interrupt event: when the caller talks over the assistant, Conversation Relay sends utteranceUntilInterrupt and the conversation history gets patched so the next Claude turn reflects what was actually heard, not what was generated.
DTMF digits map to MCQ letters (1→A, 2→B, 3→C, 4→D) so you can answer hands-free with the keypad.

Why both exist: low-friction quiz UX needs to live in the browser. Hands-free review while doing other things needs to be a phone call. Same backend grader, same SRS engine, two front doors.

−500m · The iOS audio session chapter

This is the part where building taught me things I would not have figured out by reading. Roughly fifteen commits across two weeks of voice work were iOS-audio fixes — Safari on iOS gives PWAs second-class audio session access, and the constraints don’t line up with what a continuous-mic voice agent needs.

The working pattern, in four pieces:

Route ElevenLabs TTS through AudioContext.destination, not SpeechSynthesis. iOS sends AudioContext output to the speaker even with the mic active; SpeechSynthesis gets demoted in play-and-record mode.
Prime mic permission at session open, then release the tracks. Re-acquire per-listen, not held the whole session — keeps the audio session in playback mode for TTS.
Use the W3C Audio Session API on iOS 17+ Safari (navigator.audioSession.type = 'play-and-record') to give WebKit an explicit category hint.
Resume the AudioContext before every TTS. iOS suspends it between user gestures.

Each fix on its own is fragile. The working state is the combination — remove any one and a different failure mode comes back. The full chase, with the failed attempts, the gesture-chain breaker that cost me a day, the echo-cancellation false-positive heuristic, and the rationale for each piece, is in iOS Safari audio sessions: fifteen commits to a working voice mode.

That whole chapter is the motivation for a planned native iOS wrapper — Capacitor or React Native around the existing web UI, with native modules replacing only the voice plumbing. The web stays the source of truth. Deferred until the browser path has been proven unworkable in practice, not just in theory.

−600m · Cost discipline as architecture

Every chargeable thing in RecallIQ runs on someone else’s API. Deepgram per voice answer. Claude per grade, per extraction, per course generation. ElevenLabs per phone call. Resend per email. Twilio per call and per SMS. At personal-use scale that’s pennies a month. At any signup scale it becomes a real bill someone has to pay.

The system that makes that survivable is cost transparency as a first-class concern, designed in from the start, not bolted on:

Every action that costs money is tagged with one of a fixed set of buckets — STT (Deepgram), TTS (ElevenLabs), LLM grading (Haiku), LLM generation (Sonnet), LLM extraction (Haiku), Email (Resend), SMS / voice (Twilio), Compute (flat-rate per active user, allocated).
A usage_events row is written per chargeable action with the user, the bucket, the amount in cents, and metadata. One per action. No aggregation at write time — the data is there to slice however we need later.
Provider unit prices live in a cost_estimates table. When Deepgram raises rates, one row updates and the change propagates everywhere.
Every paid feature has a free-tier allowance and a graceful degradation path. Voice answer too expensive? Fall back to typed input. Semantic grading too expensive? Fall back to keyword overlap. The feature still works, just with less polish.

Prices are stored in cents (integers). No floating-point math anywhere near billing.

This was easy to skip and would have come back to bite me the first time someone hammered the public demo. Building it before opening signups is the difference between “indie project” and “credit-card liability with extra steps.”

−650m · Pre-render TTS instead of streaming it

Voice mode in v1 streamed every utterance — every flashcard playback, every replay, every retry hit ElevenLabs live. The day my test account hit the ElevenLabs free-tier quota mid-session (2026-05-16, single user, normal study session) showed me what scaling that model would actually look like: a 100-card course consumed by 100 users for a year would burn ~$4,500 in TTS, plus quota-wall outages along the way.

The fix was to pre-render the canonical content — each card’s front and back text — once at course generation, store the MP3 in S3 with Cache-Control: immutable, and serve via short-lived presigned URLs. Only the genuinely dynamic content stays on streaming — grading feedback that varies per user answer, retry prompts, voice greetings. About 5% of the original streaming volume.

A BullMQ worker handles the rendering with concurrency: 2 and limiter: { max: 4, duration: 1000 } — not because the box can’t handle more, but because the provider’s rate limits will throttle a freshly-generated 200-card course at full speed.

Full deep dive — the error taxonomy, the key prefix split, the Hail Mary’s “I-men” pronunciation incident, and what I learned about idempotency, voice commitment, and the architectural difference between “cost per playback” and “cost per content unit” — in Pre-render your TTS audio, don’t stream it.

−700m · Cheap by design

The whole thing runs on one t4g.small EC2 — ARM, Amazon Linux 2023, 2 GB RAM. Docker Compose, five services: Postgres, Redis, the API, the worker, the web app, and Caddy out front for TLS. ~$10/month with a budget alert wired up before anything else.

A few decisions that earned their keep:

Build off-box, pull on EC2. Next.js compile would OOM the 2 GB box. Images get built on GitHub Actions (matrix-parallel, ~3 min total for all three services), pushed to ECR, and the EC2 instance just pulls. The road there went through docker compose build OOMing the EC2, then Docker.raw eating my laptop’s disk, before settling on Actions. Full chase in Three deploy pivots: each fix bought time, not a solution.
Caddy, not nginx, not ALB. TLS via the Caddyfile, a handful of security headers, three route handlers (/api/* → Fastify, /ws/* → Fastify websocket, everything else → Next.js). Forty lines of config for the whole reverse proxy.
Postgres in the same compose, not RDS. RDS for a $10 indie app is malpractice. When this needs to outgrow the box, it’ll be a Postgres dump and a connection-string change, not an architecture migration.
Health checks on Postgres and Redis are wired up. API and worker depends_on: condition: service_healthy. The compose comes up in the right order automatically.

The deploy is one EC2 instance. There’s no orchestrator, no autoscaler, no service mesh. If the box dies, the data is in EBS, and the application is one docker compose up away from running again. That’s enough.

−800m · What’s shipped, what’s scaffolded

Honest list, because the case study isn’t credible if I oversell what’s working.

Shipped:

AI course generation from PDF/URL uploads (Claude Sonnet 4, async via BullMQ).
Browser voice quiz mode — Deepgram REST + ElevenLabs via AudioContext + Web Speech fallback.
Free-recall grading with Haiku semantic grader (replaces a prior keyword-overlap implementation).
Spaced-repetition engine — pure TypeScript, zero framework dependencies, lives in packages/srs-engine. Testable in isolation, portable to a future mobile client.
Mock exam sessions with domain-weighted scoring and review mode.
Pre-rendered TTS audio pipeline — ElevenLabs Flash to S3, idempotent BullMQ worker, presigned URL serving. Streaming fallback for the ~5% of audio that’s genuinely dynamic.
Pre-populated practice question bank from publicly available certification study materials, transformed for SRS playback.
Post-exam follow-up email + maintenance reminder jobs.
Signup gate via hashed access code + email allowlist.

Scaffolded but not yet end-to-end tested:

Phone-call review sessions over Twilio Conversation Relay. The full code path is in place and reads correctly, but the production env vars (TWILIO_*, ELEVENLABS_VOICE_ID, TWILIO_INTELLIGENCE_SERVICE_SID) haven’t been configured. First end-to-end call is the next milestone.
SMS review reminders — Twilio scaffolding is in place, needs a phone-number collection UI.

Deferred:

Native iOS wrapper for voice mode. Conditional on the browser path proving structurally unreliable, not just inconvenient.

−1000m · What I’d do differently

Some of this is already in the code as TODO comments to myself. Putting it in writing so it doesn’t get lost.

Conversation Relay session state should not be an in-memory Map. It works for single-process. A second API instance breaks it immediately. Redis is the obvious next move — same Redis already running for BullMQ, so the infrastructure is free.
Session-advancement heuristics in the voice route are reading reply text for “correct” / “next question” / “session complete” substrings. That’s how it works today, and it’s a placeholder for structured output. Either Claude tool use or a dedicated state field in the JSON the model returns — either is more robust than substring matching.
Build a debug panel before building the iOS native wrapper. The decision to ship a Capacitor or RN wrapper should be informed by what actually fails in browser voice mode across iOS versions, not by what felt wrong in the worst session of debugging. Instrument the audio session transitions and watch real sessions before committing to native.
The grader’s reasoning is logged but not user-visible. Surfacing the reasoning (“you said X, expected Y, treated as paraphrase”) would close the trust gap on auto-grading without much code.

This list will keep growing. The point of writing it down is that the next session of work starts from “what would I do differently” instead of from a blank page.

Request access {#request-access}

RecallIQ is in private beta while it productizes. To request access — or to talk through any of the decisions above — drop me a note.