−750m · Case study

RecallIQ — voice AI study app on a $10/month box

PDF in, flashcards out, spaced repetition with two voice modes — one browser, one phone call. Built indie on Twilio Conversation Relay, ElevenLabs, Deepgram, and Claude. Live at study.samueleddy.com.

Project is closed source. This case study only reviews some decisions made and learnings gained.

Personal work, built and written outside Twilio working hours. Views and opinions expressed here are my own and do not represent Twilio.

−100m · What it is

Upload a PDF, paste a URL, or both. Claude extracts the testable concepts, generates flashcards, and schedules them on a spaced-repetition cadence. Quiz yourself by text or by voice. Spacing widens as you retain.

Live at study.samueleddy.com. Indie, gated signup. Request access below.

−150m · Why I built it

I built RecallIQ for myself. Memorization doesn’t always come easy for me. I wish I could be someone who memorizes by reading something once or twice — for me it takes more effort, more time, more discipline. It’s gotten harder as an adult. Between work, family, and a packed schedule, my study minutes scatter across the day. Optimizing what gets reviewed and when is the difference between retaining the material and watching it slide off.

The material I’m trying to retain doesn’t fit one shape:

Different requirements, one system that handles all three.

Voice mode came from the same place. I wanted to quiz myself on the road, while washing dishes — not chained to a screen. From there it evolved into the two voice paths below: one in the browser for laptop sessions, one over the phone for hands-free.

And it gave me the excuse to build, with my own hands, the kind of system I spend my day job helping Twilio customers build. There’s a category of understanding you only get from shipping it yourself.

−200m · The system, one breath

Three runtimes, one monorepo (Turborepo + pnpm workspaces).

Postgres + Drizzle ORM for state. Caddy out front for TLS. Everything runs in Docker Compose on a single ARM EC2 instance — more on that in §7.

Multi-tenant isolation is enforced at the API layer. Every user-data query filters by userId or enrollmentId. No global queries to slip through. The design rule is one sentence: never trust the URL.

−300m · Why Claude, and why two tiers

Anthropic is the default. Claude is the model RecallIQ ships against in production. Other providers are swap-in alternates, not active.

The two-tier strategy matters more than the vendor choice. Different jobs, different models.

The grader prompt isn’t trivial. Two failure modes had to be designed around:

  1. The model counting concepts that just restate the question as “missing” — expected concept “Advent” when the question is “What is Advent?” should resolve as matched, not missing.
  2. The model treating any list-format question as enumeration when the learner only needs one right answer — “Who founded the Jesuits?” doesn’t require every concept in the list, just a correct name.

The prompt has explicit single-vs-list rules, and the parser does a defensive set-validation pass on the way out: matched + missing must equal the expected concept list exactly. If the model hallucinates concepts that weren’t in the input, they get dropped.

One other habit worth naming: Haiku sometimes wraps JSON in markdown fences even when the system prompt tells it not to. Strip the fences before parsing. Don’t trust the format.

−400m · Two voice paths, one app

RecallIQ ships two voice modes that target different ergonomics, on different stacks.

Browser voice mode — quiz yourself with the laptop on the kitchen counter.

Phone-call review — reviews scheduled by BullMQ trigger an outbound call so you can review while doing literally anything else.

Why both exist: low-friction quiz UX needs to live in the browser. Hands-free review while doing other things needs to be a phone call. Same backend grader, same SRS engine, two front doors.

−500m · The iOS audio session chapter

This is the part where building taught me things I would not have figured out by reading. Roughly fifteen commits across two weeks of voice work were iOS-audio fixes — Safari on iOS gives PWAs second-class audio session access, and the constraints don’t line up with what a continuous-mic voice agent needs.

The working pattern, in four pieces:

  1. Route ElevenLabs TTS through AudioContext.destination, not SpeechSynthesis. iOS sends AudioContext output to the speaker even with the mic active; SpeechSynthesis gets demoted in play-and-record mode.
  2. Prime mic permission at session open, then release the tracks. Re-acquire per-listen, not held the whole session — keeps the audio session in playback mode for TTS.
  3. Use the W3C Audio Session API on iOS 17+ Safari (navigator.audioSession.type = 'play-and-record') to give WebKit an explicit category hint.
  4. Resume the AudioContext before every TTS. iOS suspends it between user gestures.

Each fix on its own is fragile. The working state is the combination — remove any one and a different failure mode comes back. The full chase, with the failed attempts, the gesture-chain breaker that cost me a day, the echo-cancellation false-positive heuristic, and the rationale for each piece, is in iOS Safari audio sessions: fifteen commits to a working voice mode.

That whole chapter is the motivation for a planned native iOS wrapper — Capacitor or React Native around the existing web UI, with native modules replacing only the voice plumbing. The web stays the source of truth. Deferred until the browser path has been proven unworkable in practice, not just in theory.

−600m · Cost discipline as architecture

Every chargeable thing in RecallIQ runs on someone else’s API. Deepgram per voice answer. Claude per grade, per extraction, per course generation. ElevenLabs per phone call. Resend per email. Twilio per call and per SMS. At personal-use scale that’s pennies a month. At any signup scale it becomes a real bill someone has to pay.

The system that makes that survivable is cost transparency as a first-class concern, designed in from the start, not bolted on:

Prices are stored in cents (integers). No floating-point math anywhere near billing.

This was easy to skip and would have come back to bite me the first time someone hammered the public demo. Building it before opening signups is the difference between “indie project” and “credit-card liability with extra steps.”

−650m · Pre-render TTS instead of streaming it

Voice mode in v1 streamed every utterance — every flashcard playback, every replay, every retry hit ElevenLabs live. The day my test account hit the ElevenLabs free-tier quota mid-session (2026-05-16, single user, normal study session) showed me what scaling that model would actually look like: a 100-card course consumed by 100 users for a year would burn ~$4,500 in TTS, plus quota-wall outages along the way.

The fix was to pre-render the canonical content — each card’s front and back text — once at course generation, store the MP3 in S3 with Cache-Control: immutable, and serve via short-lived presigned URLs. Only the genuinely dynamic content stays on streaming — grading feedback that varies per user answer, retry prompts, voice greetings. About 5% of the original streaming volume.

A BullMQ worker handles the rendering with concurrency: 2 and limiter: { max: 4, duration: 1000 } — not because the box can’t handle more, but because the provider’s rate limits will throttle a freshly-generated 200-card course at full speed.

Full deep dive — the error taxonomy, the key prefix split, the Hail Mary’s “I-men” pronunciation incident, and what I learned about idempotency, voice commitment, and the architectural difference between “cost per playback” and “cost per content unit” — in Pre-render your TTS audio, don’t stream it.

−700m · Cheap by design

The whole thing runs on one t4g.small EC2 — ARM, Amazon Linux 2023, 2 GB RAM. Docker Compose, five services: Postgres, Redis, the API, the worker, the web app, and Caddy out front for TLS. ~$10/month with a budget alert wired up before anything else.

A few decisions that earned their keep:

The deploy is one EC2 instance. There’s no orchestrator, no autoscaler, no service mesh. If the box dies, the data is in EBS, and the application is one docker compose up away from running again. That’s enough.

−800m · What’s shipped, what’s scaffolded

Honest list, because the case study isn’t credible if I oversell what’s working.

Shipped:

Scaffolded but not yet end-to-end tested:

Deferred:

−1000m · What I’d do differently

Some of this is already in the code as TODO comments to myself. Putting it in writing so it doesn’t get lost.

This list will keep growing. The point of writing it down is that the next session of work starts from “what would I do differently” instead of from a blank page.


Request access {#request-access}

RecallIQ is in private beta while it productizes. To request access — or to talk through any of the decisions above — drop me a note.