— field note

iOS Safari audio sessions: fifteen commits to a working voice mode

Voice mode is supposed to be the easy part. You ask the user a question, they speak, you transcribe, you reply, you synthesize. On a laptop, in Chrome, this works the first time and you move on. On iOS Safari, you can spend two weeks here.

What the user sees, on their iPhone, in different combinations:

Every one of those was a real session in my browser logs. The voice mode in RecallIQ — the spaced-repetition study app I built — ended up shipping with a working iOS Safari implementation after about fifteen commits of fixes, reverts, and re-fixes. This post is the chase, in order, with the patterns that worked, why they work, and the limits I still hit.

The category problem, in one paragraph

iOS audio routes through AVAudioSession categories. Two of them matter for a voice agent: playback (TTS to the speaker, default for media), and play-and-record (both mic and speaker active, mandatory for capture, but the speaker output gets de-prioritized — sometimes silenced, sometimes routed to the earpiece, sometimes attenuated). In native iOS you set the category explicitly with AVAudioSession.setCategory(.playAndRecord, options: [.defaultToSpeaker]) — three lines, well-trodden, lots of documentation. In Safari, you can’t. WebKit infers the category from your media API calls — AudioContext alone gives you playback, getUserMedia flips you to play-and-record, both at once gives you play-and-record without a defaultToSpeaker equivalent, so TTS goes wherever Safari thinks is polite. The browser decides. The browser decides wrong for voice agents.

That’s the root cause of every symptom above.

The first fix that didn’t fix it

The obvious move: stop using getUserMedia until you actually need to listen. Open the AudioContext at session start. Only call getUserMedia when entering a listen phase, then release the tracks when the user finishes.

async function openListen() {
  this.stream = await navigator.mediaDevices.getUserMedia({audio: true});
  this.workletNode.connect(audioCtx.destination);
  // ...pipe to Deepgram WS
}

function closeListen() {
  this.stream.getTracks().forEach(t => t.stop());
  this.stream = null;
}

Mostly worked — except for the first card. TTS for the first card was silent. Every single time.

The reason: even after I’d closed all the mic tracks, iOS hadn’t actually re-flipped the audio session back to playback. It clung to play-and-record for several hundred milliseconds. The first TTS would start before the flip completed, route to the earpiece, and the user heard nothing.

Diagnosis: insert a setTimeout(speakFirstCard, 400). Works in dev. Ship it. Test on a real phone. Doesn’t work.

The reason it didn’t work on the real phone: setTimeout breaks the gesture chain. iOS requires that audio playback be initiated within a synchronous call chain rooted at a user gesture (a tap, a click). A setTimeout in the middle of that chain breaks the chain. The audio plays “in response to a timer,” not “in response to the tap.” iOS silently denies it.

setTimeout is off the table.

That commit got reverted. Onto fix two.

Routing TTS through AudioContext.destination

The actual fix turned out to be: stop using SpeechSynthesis entirely. Send the TTS audio bytes to the AudioContext directly.

async function playTtsViaAudioContext(text: string, ctx: AudioContext) {
  const response = await fetch('/voice/tts', {
    method: 'POST',
    body: JSON.stringify({text}),
    headers: {'Content-Type': 'application/json'},
  });
  const arrayBuffer = await response.arrayBuffer();
  const audioBuffer = await ctx.decodeAudioData(arrayBuffer);

  const source = ctx.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(ctx.destination);
  source.start();

  return new Promise(resolve => source.addEventListener('ended', resolve));
}

The /voice/tts endpoint is a thin proxy to ElevenLabs — POST text, return audio bytes. The browser fetches the bytes, decodes them through the AudioContext, plays through ctx.destination.

Why this works: iOS treats AudioContext.destination output differently from SpeechSynthesis audio. Even in play-and-record mode, AudioContext output goes to the speaker. Reliable, loud, normal.

SpeechSynthesis is the API designed for accessibility readers — VoiceOver, screen readers. iOS routes it as “system speech.” When the audio session is in play-and-record, system speech gets demoted. The AudioContext path is the media path. Media plays through the speaker.

This was the largest single fix. A handful of lines of code, two days of debugging.

AudioContext suspension between gestures

The next thing that broke: TTS would play on card one, and the audio cut off halfway through card two.

AudioContexts on iOS Safari suspend themselves between user gestures. A tap opens the session, audio plays. The model finishes responding. The user reads the response. Five seconds pass. The user taps “next card.” The next TTS doesn’t play.

The fix is one line:

async function speak(text: string) {
  if (audioCtx.state === 'suspended') {
    await audioCtx.resume();
  }
  await playTtsViaAudioContext(text, audioCtx);
}

AudioContext.resume() is idempotent and its effect is synchronous within the gesture chain. Always check state === 'suspended' before playing TTS. Always.

I missed this for a day because it works fine on the desktop. iOS only.

The mic-permission lifecycle

Even with TTS-via-AudioContext working, a third class of failure remained: the user grants mic permission, voice mode works for the first interaction, the user starts a new session ten minutes later, grants mic permission again, and TTS is silent on the welcome utterance.

The fix: prime the mic permission at session open, immediately release the tracks.

async function openSession() {
  // 1. Prime mic permission inside the user gesture
  const primingStream = await navigator.mediaDevices.getUserMedia({audio: true});
  primingStream.getTracks().forEach(t => t.stop());

  // 2. Audio session is now back in 'playback' mode.
  //    AudioContext opens cleanly.
  audioCtx = new AudioContext();

  // 3. Greet — TTS plays through the speaker.
  await speak("Voice mode on. Here's your first card.");
}

What’s happening: opening the priming stream forces iOS to ask for permission (if not already granted) and flips to play-and-record. Releasing the tracks immediately flips back to playback. Permission is now granted; the audio session is in the right mode for the welcome TTS.

When the user taps “I’m ready to answer,” we open the mic again with getUserMedia. That flips back to play-and-record for the listen phase. After they finish speaking, we close the tracks. Back to playback for the next TTS.

This open-close-open dance is non-obvious. On desktop, you’d just open the mic once and keep it open. On iOS, that would route every TTS through the earpiece.

The W3C Audio Session API on iOS 17+

iOS 17 shipped the W3C Audio Session API in Safari. It lets you tell WebKit explicitly which audio session category you want, rather than letting it infer.

if ('audioSession' in navigator) {
  (navigator as any).audioSession.type = 'play-and-record';
}

This isn’t a silver bullet — Safari still does its own thing with category inference — but giving it an explicit hint helps. Without it, audioSession.type defaults to auto, and Safari guesses based on which APIs you’ve called. With it, you’re declaring intent.

Set it once at session open. Don’t try to flip it during a session — iOS gets confused.

The API isn’t on iOS 16 or earlier. Wrap in a feature detect.

Barge-in is a heuristic, not a fact

Conversation Relay’s interruptible="any" setting lets the user talk over the assistant — barge-in cancels the in-flight TTS and starts a new listen turn. On iOS, it fires more often than you want.

The reason: Safari’s echo cancellation is tuned for VoIP. It aggressively suppresses anything the speaker plays that the mic picks up. But it also picks up real speaker bleed and reports it to the application as if the user is talking. So you get phantom barge-ins triggered by the assistant’s own TTS leaking from the earpiece into the mic.

The mitigation isn’t at the API layer — Safari doesn’t expose echo cancellation tuning. The mitigation is heuristic: require some minimum utterance length before treating a barge-in event as real. If the “user speech” is shorter than ~300ms, assume it was echo bleed. Continue the TTS.

This works most of the time. It also occasionally suppresses real short user interjections (“yes,” “no,” “stop”). Tradeoff.

The pattern, end-to-end

Every individual fix above is fragile on its own. The working state is the combination. Remove any one and a different failure mode comes back.

  1. AudioContext for TTS, not SpeechSynthesis. Routes around iOS’s audio-session demotion of system speech.
  2. AudioContext.resume() before every TTS. iOS suspends contexts between user gestures.
  3. Mic prime-and-release at session open. Permission granted, audio session flipped back to playback for the welcome TTS.
  4. Open mic on listen, close mic on speak. Don’t hold a continuous mic stream; flip categories per phase.
  5. navigator.audioSession.type = 'play-and-record' on iOS 17+. Give WebKit explicit intent.
  6. Barge-in minimum-utterance heuristic. Discard sub-300ms “user speech” as echo bleed.

That’s the configuration that produces a reliable continuous-mic voice mode in Safari. None of it is documented as a coherent pattern — every piece is a separate workaround for a separate undocumented iOS behavior.

What I still don’t know

Safari’s audio plumbing is undocumented at the depth this work requires. The constraints I’ve reverse-engineered are stable across iOS 17.x and 18.x. I don’t know what 19 will change. WebKit doesn’t ship a changelog for category inference behavior.

I also don’t have a clean way to detect, from the page, when iOS demotes my TTS to the earpiece versus the speaker. The browser fires audio.ended either way. Telemetry stops there.

Native iOS is fully tunable here. AVAudioSession.setCategory(.playAndRecord, options: .defaultToSpeaker) and AVAudioEngine give you the control Safari doesn’t expose. The eventual move for any production-grade continuous-mic voice agent is a thin native wrapper — Capacitor or React Native around the web UI, native modules for the audio plumbing, the web stays the source of truth for everything except the bytes flowing in and out of the speaker.

I haven’t done that yet for RecallIQ. The Safari path is good enough for my use. If the user base grows, the wrapper is the next milestone.

When I’d go native instead

If you can ship the entire thing without a continuous-mic voice mode — half-duplex, press-to-talk, release-to-listen — you don’t need to leave Safari. getUserMedia opened only during the talk window, released afterward, AudioContext for TTS. Works fine.

If you need continuous-mic with barge-in and the experience has to be polished, you’re going native eventually. The threshold is roughly: are you willing to accept phantom barge-ins, occasional earpiece routing, and the open-close-open mic dance? On a personal project, yes. On something users pay for, probably not.

The meta-lesson

You don’t learn this by reading. The Apple documentation tells you about AVAudioSession categories on native iOS — it doesn’t tell you how Safari maps them to WebKit. The MDN documentation tells you about AudioContext and getUserMedia — it doesn’t tell you that they affect each other on iOS in undocumented ways.

You learn it by shipping the app on an actual iPhone, recording sessions where the user can’t hear anything, and stepping through the audio session transitions until you understand what’s happening.

The reason this post exists is to spare the next person fifteen commits of that. If it spared you any, drop me a note at contact@samueleddy.com.


This is the iOS audio chapter of the RecallIQ case study, expanded into its own piece.