— field note

Pre-render your TTS audio, don't stream it

May 17, 2026

The day my test account hit the ElevenLabs free-tier cap mid-session was 2026-05-16. One test user — me, on the couch, with my iPhone. Quiz mode, voice on. Halfway through a deck of cards, the audio cut out. The browser’s network tab showed a 4xx with quota_exceeded in the body.

That was the moment I had to confront the actual cost shape of the voice mode I’d shipped in RecallIQ. Each card playback was a fresh round trip to ElevenLabs. Each user would be their own copy of the bill. The cost wasn’t bad with one user — it was a bill that grew with every review session, every replay, every retry. Not with the size of the catalog. With the use of the catalog. Two very different growth curves.

I rewrote the audio pipeline that week. This is the design and what I learned.

What streaming TTS actually costs you

Money is the loud signal. It’s not the only one.

Money. At ElevenLabs Flash rates (~$0.10 per 1,000 input characters), a 100-character flashcard front costs ~3¢ to synthesize. SRS reviews each card 10–20 times over its lifetime — per user. So one 100-card course consumed by 100 users for a year is:

100 cards × ~15 plays × 100 users × 3¢ ≈ $4,500

That number was real for RecallIQ at any meaningful user count. The catalog wasn’t growing. The reviews against the catalog were. Streaming priced the wrong thing.

Quota. Every TTS provider has a hard usage cap on lower tiers and rate limits on higher ones. Streaming every utterance means every replay, every retry, every accidental tab-switch-and-back hits the meter. A user studying intensely can drain a daily quota by lunchtime. Once it’s gone, voice mode is dead until quota resets. The user has no idea why.

Latency. Every TTS call is a round trip: network out, synthesize, network back. ElevenLabs Flash is fast (~75 ms first-byte) but you’re still paying ~200–400 ms total before the first audio frame plays. Multiply by every TTS hit per session — including replays, retries, browser-cache-miss reloads.

A pre-rendered MP3 served from S3 (with the browser’s HTTP cache in front of it) hits the user’s speaker in dozens of milliseconds. Same audio. Different cost shape.

Predictability. Streaming means cost grows with usage. Every user, every replay, every retry costs marginal money. Predicting a monthly bill becomes guesswork at any non-trivial user count. Pre-rendering anchors cost to content creation, not content consumption. You know what a course costs to ship before anyone reviews it.

UX failure modes during outages. When streaming-quota dies, the user hears… nothing. No browser error, no inline message, no retry prompt — those would all require the API to first call TTS to render them, which is the thing that’s broken. The failure is at a layer the user can’t see. They think the app is broken in a way they can’t articulate.

The shape of the answer

Pre-render the canonical content — things that get played verbatim, repeatedly, by many users — once at creation time. Store immutable, serve fast. Stream only the genuinely dynamic content that varies per session.

For RecallIQ:

Pre-render: each card’s front and back text. Static per card. Re-played every SRS review.
Pre-render: prayer canonical text. The Hail Mary plays 50+ times in a single memorization session. Streaming that one session was more expensive than the rest of the app combined.
Stream: per-answer grading feedback (“You covered X but missed Y, Z”). Varies per user, per attempt.
Stream: retry prompts (“I didn’t catch that”). Fires rarely, dynamic by design.
Stream: voice greetings (“Voice mode on”, “Paused”). Fixed text, very low volume.

The streaming surface dropped to ~5% of the original volume. The other 95% became one-time renders amortized across every future playback.

The architectural pieces

The pipeline is four files in a monorepo. Each does one job. None are big.

A render function — calls ElevenLabs, returns an MP3 buffer. Same code used by the worker (pre-rendering) and the API (streaming fallback). One implementation, no drift between code paths.

const DEFAULT_VOICE_ID = "JBFqnCBsd6RMkjVDRZzb";   // George
const DEFAULT_MODEL_ID = "eleven_flash_v2_5";      // ~75ms first-byte

export async function renderTtsAudio(text: string, opts: RenderTtsOptions = {}): Promise<Buffer> {
  const apiKey = process.env.ELEVENLABS_API_KEY;
  if (!apiKey) throw new TtsError("ELEVENLABS_API_KEY not set", "not_configured");
  // POST to ElevenLabs, return Buffer
}

The error taxonomy is more important than the happy path:

export class TtsError extends Error {
  constructor(
    message: string,
    public readonly code:
      | "not_configured"
      | "quota_exceeded"
      | "network_error"
      | "provider_error",
    public readonly status?: number,
  ) { /* ... */ }
}

quota_exceeded is not a transient error — retrying immediately just burns more attempts against a quota that hasn’t reset. network_error is transient — retry with backoff. provider_error is “ElevenLabs returned a 5xx” — also transient. not_configured means the env var is missing — fail loudly at startup, never get here at all.

Without specific codes, your retry logic is wrong by default.

An S3 storage module with key conventions, uploads, and presigned URLs:

// Per-card audio
export function ttsObjectKey(cardId: string, side: "front" | "back"): string {
  return `tts/cards/${cardId}-${side}.mp3`;
}

// Canonical shared audio (one render serves all users)
export function prayerAudioKey(opts: { slug: string; isCanonical: boolean; prayerId: string }): string {
  if (opts.isCanonical) return `library/prayers/${opts.slug}.mp3`;
  return `tts/prayers/${opts.prayerId}.mp3`;
}

export async function uploadTtsAudio(key: string, body: Buffer): Promise<void> {
  await s3().send(new PutObjectCommand({
    Bucket: BUCKET,
    Key: key,
    Body: body,
    ContentType: "audio/mpeg",
    CacheControl: "public, max-age=31536000, immutable",
  }));
}

Two non-obvious details:

The key prefix split. Canonical shared audio (the Hail Mary anyone studies) lives under library/prayers/. User-owned audio (a card from a private course) lives under tts/cards/. The split lets IAM policies treat them differently — public-readability for the library, strict per-user authorization for the rest. Plan the prefix before you upload the first object; renaming later requires copying every object.
Cache-Control: public, max-age=31536000, immutable. A year of TTL. The MP3 never changes — if the underlying text changes, we render a new key. The old one is orphaned and ages out naturally. Browsers cache it. CDNs cache it. After the first byte arrives once, the user has it forever.

Bucket is private. The API mints short-lived presigned URLs:

const DEFAULT_PRESIGN_TTL_SECONDS = 3600; // 1 hour

export async function getPresignedTtsUrl(key: string, expiresIn = DEFAULT_PRESIGN_TTL_SECONDS): Promise<string> {
  return getSignedUrl(s3(), new GetObjectCommand({ Bucket: BUCKET, Key: key }), { expiresIn });
}

1 hour TTL is deliberate. Long enough to outlive a quiz card (cards take ~30s of audio, generously). Short enough that a scraped URL pasted into a Reddit comment goes stale before anyone gets value from it.

A BullMQ worker that renders audio asynchronously when a course is generated. One job per card. Idempotent — re-runs skip sides that are already populated:

const cardTtsWorker = new Worker<CardTtsJobData>(CARD_TTS_QUEUE, async (job) => {
  const card = await loadCard(job.data.cardId);
  if (!card) return;

  const sidesToRender = pendingSides(card);  // skip sides with an existing audio key
  if (sidesToRender.length === 0) return;

  for (const { side, text } of sidesToRender) {
    try {
      const audio = await renderTtsAudio(text);
      const key = ttsObjectKey(card.id, side);
      await uploadTtsAudio(key, audio);
      await updateCardAudioKey(card.id, side, key);
    } catch (err) {
      if (err instanceof TtsError && err.code === "quota_exceeded") {
        // Quota's gone — fail the job. BullMQ retry would just beat against the wall.
        throw err;
      }
      throw err;
    }
  }
}, {
  connection: redis,
  concurrency: 2,
  limiter: { max: 4, duration: 1000 },
});

Two worker-config decisions that took an outage each to learn:

concurrency: 2 — only two cards rendering at once. Not because the box can’t handle more — because ElevenLabs’s rate limits pop quickly when you batch-render a 200-card course at full speed. The bottleneck isn’t your CPU.
limiter: { max: 4, duration: 1000 } — at most four jobs per second. Belt-and-suspenders for the rate limit. Without this, my first batch of 200 cards killed itself in under a minute on 429s.

A client lookup in the browser. When the player needs to speak a card, it checks if the card record has a pre-rendered audio key. If yes, ask the API for a presigned URL and play that. If no (or for dynamic content like grading feedback), fall back to the streaming /voice/tts endpoint.

That’s the whole thing. Render → store → presigned URL → play. Maybe 400 lines across four files.

The 5% that still streams

Tempting to pre-render everything. Don’t.

Grading feedback. Every free-recall answer gets a response unique to what the user said. “You covered the concepts X and Y but missed Z” — content varies per user per attempt. No pre-rendering possible.
Retry prompts. “I didn’t catch that.” Pre-renderable in theory, but the volume is so low (sometimes zero per session) that the migration cost wasn’t worth the savings.
Voice greetings. Fixed text but rarely played. Same reasoning.

Together, ~5% of the original streaming volume. The other 95% became zero-marginal-cost playbacks the moment they were rendered.

What I learned

Idempotency is non-negotiable. Render jobs will be re-enqueued — manual retries, redeploys, BullMQ’s at-least-once semantics. The handler must check whether the audio key already exists and skip if so. Without that, one botched deploy re-renders a 200-card course and burns quota for nothing.

Concurrency limits live at the provider, not your worker. Your worker can scale to whatever the box handles. The third party cannot. Match the provider’s tolerance, not your machine’s headroom.

Cache-Control: immutable is one of the highest-leverage header values you’ll write. A year of TTL on content that genuinely never changes. Browser cache. CDN cache. Every byte served exactly once, then free forever. Four characters in a header save an unbounded number of round trips.

Specific error codes pay for themselves the first incident. quota_exceeded and network_error route to different recovery paths. The union type is the documentation for the runbook.

Voice choice locks in early. Once you’ve pre-rendered 1,000 cards with George, swapping to a different voice means re-rendering all 1,000. The “nuclear option” is supported but expensive. Pick the voice deliberately; document the choice as a course-level commitment.

Mispronunciation is a pre-rendered cost. RecallIQ shipped with the Hail Mary’s pre-rendered audio saying “I-men” instead of “ah-men” for “Amen” — a non-American-English voice slipped through on one render. With streaming, you can swap voice or model per playback and just try again. With pre-rendered audio, you have to detect the issue, re-render, and re-upload. The fix is straightforward; the detection needs a user complaining. Build a “this audio sounds wrong” feedback path early.

Pre-rendering changes the cost shape from “per playback” to “per content unit.” This isn’t just an optimization — it’s a different relationship between your bill and your traffic. Streaming makes the bill a function of how much your users use the app. Pre-rendering makes it a function of how much content exists. Two completely different growth curves. The architectural choice is also a business-architecture choice.

When you should NOT pre-render

The case for streaming hasn’t disappeared. Stream when:

The content is genuinely dynamic — varies per session, per user, per turn.
Total volume is small — a few hundred renders a day, you don’t care about the bill.
You’re prototyping. Get the voice working first. Optimize when usage justifies it.
Voice itself is personalized — per-user voice cloning, voice picked at session time — where pre-rendering wouldn’t reduce uniqueness.

The right architecture is hybrid. Pre-render the canonical static content. Stream the dynamic content. You probably already know which is which in your app — you just haven’t named the line.

The meta-lesson

Efficiency is multidimensional. Cost is the loudest signal because it’s a number you can compare. But quota is efficiency. Latency is efficiency. Predictability is efficiency. UX during outages is efficiency.

The streaming model I started with optimized for the engineer — easier to write, just call the API. The pre-render model optimizes for everyone downstream: the user, the budget, the operations, the future maintainer.

Naive defaults compound. Pick the default deliberately.

This is one of the architectural decisions from the RecallIQ case study, expanded into its own piece.