— field note

Three deploy pivots: each fix bought time, not a solution

The day I noticed I had 58 MiB free on my Mac was the day I had to admit my deploy plumbing wasn’t actually working.

I’d been shipping RecallIQ deploys for two weeks. Build locally on the laptop, push to ECR, swap containers on the EC2 instance. Each deploy added a few hundred MB to Docker’s disk image. None of it ever came back. Two weeks in, the laptop’s Docker.raw had grown to 228 GB and the OS was 58 MiB from wedging itself.

That was pivot two of three. Each one looked like the answer when I shipped it. Each one held for a while. Each one had a wall behind it that I couldn’t see.

This is the shape of the problem, what I learned at each wall, and why the third pivot is the one I’d start with if I rebuilt the pipeline from scratch.

Wall one: build on EC2

The first version was the obvious version. One box. Code on the box. Build on the box. Run on the box. git pull && docker compose up -d --build. That’s the entire deploy.

The box was a t4g.small. 2 GB of RAM. ARM. Cheap. Plenty for an indie app’s runtime — Postgres, Redis, the Fastify API, the BullMQ worker, the Next.js web app, Caddy out front. Together they used about 1.4 GB resident in steady state.

The Next.js compile alone needs more memory than the box has.

The first time I shipped a change that touched the web app, docker compose build OOM-killed itself partway through next build. The kernel started swapping the rest of the stack to make room. Postgres got swapped out. The API stopped responding. The site went offline. Recovery was rebooting the box, which dropped any in-flight connections and lost a few minutes of uptime.

The lesson is one sentence: don’t build where you serve. Especially not on a 2 GB box. Especially not when the build’s peak memory is uncorrelated with the runtime’s steady-state memory.

The fix: stop building on EC2. Build somewhere with more RAM, push the image to ECR, have EC2 pull instead of build.

Wall two: build locally, push to ECR

I wrote a scripts/deploy.sh that runs on my Mac. 16 GB of RAM, all the headroom in the world. Builds three images (api, worker, web) for linux/arm64 (EC2 is Graviton), tags them as latest and <git-sha>, pushes them to ECR. Then SSH to the EC2 instance, docker compose pull && up -d.

docker buildx build \
  --platform linux/arm64 \
  --target "$svc" \
  --tag "${IMAGE}:latest" \
  --tag "${IMAGE}:${GIT_SHA}" \
  --push \
  .

This worked the first time. It worked the second time. It worked for two weeks. I shipped maybe 15 deploys against it. Felt great.

Each deploy left behind layers in Docker’s local store. Cached image layers, intermediate buildx caches, three services × two tags × N builds. Docker doesn’t aggressively clean — that’s by design, so subsequent builds can reuse layers and finish faster.

The actual storage lives in a single sparse file on macOS: ~/Library/Containers/com.docker.docker/Data/vms/0/data/Docker.raw. The file appears small on disk until you allocate space inside Docker’s VM. Then it grows. When you delete things inside the VM, the file does not shrink back. The sparse file holds at its high-water mark.

After two weeks of deploys, my Docker.raw was 228 GB. The disk image setting in Docker Desktop was capped at 256 GB by default. I ran docker system prune -a --volumes. The output said it freed 180 GB. The disk image stayed at 228 GB.

That’s a subtle one. The prune reclaimed space inside the sparse file’s virtual filesystem. The file on the host disk is still 228 GB until you explicitly delete it and let Docker recreate.

To recover the host disk: Docker Desktop → Settings → Resources → Disk image size slider → reduce. That deletes Docker.raw and creates a new one at the requested size. You lose every image and volume in the process. After that the next deploy rebuilds everything from scratch, which is slow because no layers are cached.

I ran into this when the Mac hit 58 MiB free. Two deploys had failed mid-push that morning. One had corrupted Docker’s blob store. One had triggered a kernel-extension reset to free up swap. The OS was alerting me about low disk every few minutes.

The fix was to nuke Docker.raw and start fresh, which I did. The deploys started working again.

It was at this point that I had to admit Wall Two wasn’t fixed — it was timed. The 14 GB → 228 GB growth had taken two weeks. The next two weeks would be the same problem with a fresh-from-scratch Docker.raw. The wall didn’t move; I just made room behind it for a while.

The lesson is the meta-pattern: each fix bought time, not a solution. Build-on-EC2 was bounded by 2 GB of RAM. Build-on-Mac was bounded by my drive size. Different dimension, same shape.

There was a smaller lesson too: there are two extremely surprising things about Docker.raw that I would have liked to know on day one.

  1. It’s sparse. It doesn’t shrink when you delete files inside it.
  2. docker system prune operates inside the VM. It does not touch the file on your host disk.

If you Google this, the search results are mostly people complaining and being told to “just turn off Docker Desktop and delete the file.” Which is correct, but also exactly the kind of thing you only learn by hitting it.

Wall three: GitHub Actions

The third pivot solves two dimensions at once.

The first problem was network. Pushing a 700 MB image from a residential connection takes 14-19 minutes per service over my uplink. Three services serial = ~57 minutes of “watch the progress bar” before EC2 sees anything new.

The second problem was disk. As long as Docker is running on a developer machine, builds accumulate. Push the builds off the developer machine, the disk pressure goes away. Permanently.

GitHub Actions does both. The matrix build runs each service on its own runner, in parallel, with gigabit uplink to AWS. The result is roughly ~3 minutes wall-clock from push to “containers swapped on EC2” — versus ~57 minutes locally.

Here’s the build job:

build:
  runs-on: ubuntu-latest
  strategy:
    fail-fast: false
    matrix:
      service: [api, worker, web]
  steps:
    - uses: actions/checkout@v4
    - name: Set short SHA
      run: echo "value=$(git rev-parse --short HEAD)" >> "$GITHUB_OUTPUT"

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1

    - uses: aws-actions/amazon-ecr-login@v2
    - uses: docker/setup-qemu-action@v3
      with: { platforms: arm64 }
    - uses: docker/setup-buildx-action@v3

    - name: Build and push
      uses: docker/build-push-action@v6
      with:
        context: .
        platforms: linux/arm64
        target: ${{ matrix.service }}
        tags: |
          ${{ env.ECR_HOST }}/recalliq-${{ matrix.service }}:latest
          ${{ env.ECR_HOST }}/recalliq-${{ matrix.service }}:${{ steps.sha.outputs.value }}
        push: true
        cache-from: type=gha,scope=${{ matrix.service }}
        cache-to: type=gha,scope=${{ matrix.service }},mode=max

A couple of things in there matter:

A second job — swap — depends on build and SSHs into EC2 to pull the new images and roll the containers:

swap:
  runs-on: ubuntu-latest
  needs: build
  steps:
    - name: Configure SSH key
      run: |
        mkdir -p ~/.ssh
        echo "${{ secrets.EC2_SSH_KEY }}" > ~/.ssh/ec2.pem
        chmod 600 ~/.ssh/ec2.pem

    - name: Pull images + restart containers
      run: |
        ssh -i ~/.ssh/ec2.pem $EC2_HOST "cd $REPO_DIR && \
          git pull origin main && \
          docker compose pull api worker web && \
          docker compose up -d api worker web"

    - name: Verify
      run: |
        sleep 5
        curl -sf https://study.samueleddy.com/api/health > /dev/null
        curl -sf https://study.samueleddy.com/ -o /dev/null
        echo "✓ both endpoints responding"

scripts/deploy.sh still exists — it’s the fallback if GH Actions is down, or for testing pre-merge images, or for a service-specific deploy where you don’t want to push to main first. But the standard path is git push and walk away.

Honest cost framing

I’m not going to claim “free CI/CD.” That’s true for me right now and not true for everyone.

GitHub Actions’s free tier is 2,000 minutes per month for a private repo on the Free plan. At ~3 minutes of wall-clock per deploy and a few deploys per day, that’s plenty of headroom.

If you’re on a team pushing a hundred deploys a day, or if your builds are longer because you’re not using ARM-emulation tricks, the math changes. Public repos are unlimited. Private repos with heavy CI need a paid plan or a self-hosted runner.

It also gets cheaper the more you use it, in a sense — the type=gha cache amortizes Dockerfile layer rebuilds across runs. The first build is slow (10+ minutes cold). Subsequent builds that hit cache are 1-3 minutes.

The pattern

Three walls, three dimensions:

PivotWhat brokeWhat was undersized
Build on EC2RAM (OOM during Next.js compile)Compute
Build on MacDisk (Docker.raw grew to 228 GB)Storage
GitHub Actions(nothing yet)n/a

When one box does compute, transport, and storage, the smallest dimension breaks first. The fix is always the same shape: take that dimension off that box.

The meta-thesis is that “fix it on the same box” buys time, not solution. Build-on-EC2 was bounded by 2 GB RAM. I moved the build to a box with more RAM. New ceiling: my Mac’s disk. Move the build off the Mac. New ceiling: GitHub Actions’s quota. That ceiling is high enough that I haven’t hit it yet, but it exists — and when I do, the answer will be “move the build off GitHub Actions” (self-hosted runners, dedicated build VM, whatever).

Every pivot is a step further in the same direction: separate the work from the machine that serves.

What I’d do differently from day one

Skip Walls one and two. Start with GitHub Actions on day one for any project that isn’t a five-minute prototype.

The friction of setting up Actions on day one is small. The friction of un-tangling two weeks of habits formed around the wrong topology is larger. By the time I had scripts/deploy.sh working, I was psychologically committed to the local-build pattern. It took 58 MiB free to admit it had to change.

If GH Actions isn’t right (private team, compliance, whatever) — fine, but apply the same principle. Pick a compute resource for builds that’s distinct from your serving resource. A small dedicated build VM. A teammate’s spare Mac mini. Anything that lets your serving box stay focused on serving.

The smaller lessons

Two specific things worth knowing because most developers learn them on the floor:

The closer

If you’re shipping an indie app on a small EC2 instance and the deploys are slow, the wall isn’t where you think it is. The first wall is RAM. The next wall is your laptop’s disk. The wall after that is the third party you offloaded to.

Each fix buys time. None of them are the solution. The solution is separating the work from the machine that serves, and doing it explicitly, before the box in front of you tells you to.


This is one of the architectural decisions from the RecallIQ case study, extracted into its own piece.