Database Migrations Mental Model

How experienced teams ship Postgres schema changes to running prod systems without downtime — and why some migrations are easy while others can take down your service.

Why this is scary

A schema change in prod is unlike a code change. Code rolls back instantly; a deploy is a few minutes. A schema change can:

Acquire a lock that blocks every read AND write to a table for the duration. On a busy table, that's an outage.
Rewrite the entire table on disk (ALTER TYPE, certain ADD COLUMN patterns) — minutes to hours of full-table I/O.
Be irreversible without a restore (DROP COLUMN, DROP TABLE).
Get serialized behind another long transaction, so you can't even abort it cleanly.
Replicate to read replicas with lag — 30 minutes of replica lag means 30 minutes of stale reads for whatever uses the replica.
Race with rolling deploys: the DB has the new schema, but half your pods are still running old code that breaks on it.

Migrations are scary because the worst-case blast radius is "everyone who uses this DB is down right now".

The good news: every dangerous category has a known-safe alternative. The job is to recognise the category and pick the right pattern.

Taxonomy: easy / medium / hard

The single most useful skill is being able to look at a migration and classify it. Roughly:

🟢 Easy (just ship it)

Operation	Why safe
Add new table	No existing data, no locks held
Add column with constant default (Postgres 11+)	Metadata-only — no table rewrite, no lock beyond microseconds
Add column nullable, no default	Metadata-only
`CREATE INDEX CONCURRENTLY`	Doesn't lock writes
Drop unused index	Almost instant; no lock issues

These are "type and run" — pre-deploy checklist still applies, but you don't need expand/contract.

🟡 Medium (think before running)

Operation	What to watch out for
Add column with NOT NULL + default	Pre-PG11 → table rewrite. PG11+ ok if default is a constant; volatile default (e.g. `gen_random_uuid()`) still rewrites.
Add foreign key	Default takes `ACCESS EXCLUSIVE` on both tables. Use `ADD CONSTRAINT ... NOT VALID` then `VALIDATE CONSTRAINT` separately.
Add unique constraint	Lock during validation. Build a unique index `CONCURRENTLY` first, then `ADD CONSTRAINT ... USING INDEX`.
Drop column	Quick, but deploy code that stops referencing it first, or old code 500s.
Backfill data	Row count × per-row cost. Batch it (100-10,000 rows at a time, with sleeps). Don't `UPDATE ... WHERE x IS NULL` against millions of rows in one statement.

🔴 Hard (expand/contract is the only sane path)

Operation	Why hard
Rename column / table	Old code reads old name; new code reads new name. Single rename = guaranteed window of brokenness.
Change column type	Almost always rewrites the table. Locks for the duration.
Add `NOT NULL` to existing column with NULLs	Need to backfill first; then validate; the `SET NOT NULL` itself takes a lock to scan.
Split one column into two (or merge two into one)	Pure schema-shape change → many app code paths affected.
Restructure a 1:1 relationship into 1:many	Same — multi-step, can't be atomic.

The ones in this row are where teams either (a) take downtime, (b) do expand/contract over multiple deploys, or (c) reach for tooling like pgroll to automate the dance.

The expand/contract pattern (the key idea)

The single most useful pattern. Also called "parallel change."

The shape: never change the schema and the app at the same time. Instead, evolve them in additive only steps until both old and new code work, then remove the old.

Renaming email → email_address as a worked example:

Phase	DB	App	Both old + new code work?
0. Start	`email`	reads/writes `email`	✓
1. Expand (deploy A)	Add `email_address`, nullable	unchanged	✓
2. Dual-write (deploy B)	both columns exist	writes to both, reads from `email`	✓
3. Backfill (no deploy)	populate `email_address` from `email` in batches	unchanged	✓
4. Switch reads (deploy C)	both columns exist	reads from `email_address`, still writes both	✓
5. Stop dual-write (deploy D)	both columns exist	writes only to `email_address`	✓
6. Contract (deploy E)	drop `email`	unchanged	✓

That's a single rename across five deploys. Yes, really. Each step keeps both versions of the code happy → rolling deploys, canaries, and rollbacks all work cleanly.

The same shape covers type changes, constraint tightening, splitting, and 1-to-many restructures. The pattern is the contract: at no point is schema and app version of the column going to disagree.

Postgres-specific traps you have to know

1. The lock queue

ALTER TABLE waits for an ACCESS EXCLUSIVE lock. If a slow query is holding even a SELECT against that table, the DDL waits — and every other query that touches the table now queues behind the DDL. A single slow query → instant outage during a "fast" migration.

Mitigation: always set a short lock_timeout on the migration session, and retry on failure:

SET lock_timeout = '2s';
ALTER TABLE users ADD COLUMN ...;

If we can't grab the lock in 2 seconds, fail fast and retry later. Don't let our DDL be the head of an ever-growing queue.

2. Default values that rewrite the table

The PG11+ optimization: ALTER TABLE ... ADD COLUMN ... DEFAULT <expr> is metadata-only (milliseconds) when the default is non-volatile — Postgres evaluates the expression once and stores it in the catalog (pg_attribute.attmissingval). Every old row gets the same value at read time. New rows evaluate normally on insert.

The optimization applies based on Postgres function volatility class, not on whether the value is "constant" in the everyday sense:

Default	Volatility	Optimization?	Notes
`true`, `0`, `'active'`, `42`	constant (immutable)	✓ fast	Stored once.
`now()`, `CURRENT_TIMESTAMP`	STABLE	✓ fast	Stored once → every old row gets the same timestamp (the moment you ran ALTER TABLE). Often surprising — usually you want different per-row timestamps, in which case backfill in batches instead.
`gen_random_uuid()`, `random()`	VOLATILE	✗ rewrite	Each row needs a distinct value → full table rewrite, ACCESS EXCLUSIVE held the whole time.
`clock_timestamp()`	VOLATILE	✗ rewrite	Different per call.

Mental model: "Can Postgres compute the value once and reuse it?" Yes → metadata-only. No → table rewrite.

Safe pattern for volatile defaults (or when you need real per-row values even from a stable function like now()):

-- step 1: cheap, metadata only
ALTER TABLE t ADD COLUMN c TYPE;

-- step 2: cheap, metadata only — sets the default for FUTURE inserts
ALTER TABLE t ALTER COLUMN c SET DEFAULT <expr>;

-- step 3: backfill existing rows in batches (separate transactions)
UPDATE t SET c = <expr>
  WHERE c IS NULL
  ORDER BY ctid
  LIMIT 10000;
-- repeat until 0 rows updated; sleep between batches to give VACUUM room

-- step 4: optionally enforce NOT NULL after backfill
ALTER TABLE t ALTER COLUMN c SET NOT NULL;  -- still scans the table briefly

Why each step is safe:

(1) new column nullable → metadata only, no row touch
(2) future-only default → metadata only
(3) small batches → row-level locks for milliseconds, no ACCESS EXCLUSIVE. Total wall-clock can be hours on big tables, but the app keeps serving the whole time.
(4) SET NOT NULL does scan the whole table to verify no nulls; takes a brief ACCESS EXCLUSIVE. On big tables you can avoid even this scan with the CHECK (c IS NOT NULL) NOT VALID + VALIDATE CONSTRAINT
- SET NOT NULL trick (PG12+) — Postgres uses the validated check constraint as proof and skips re-scanning.

3. `CREATE INDEX` blocks writes

Plain CREATE INDEX takes a SHARE lock — blocks writes until done. On big tables, that's minutes of write blocking.

Always: CREATE INDEX CONCURRENTLY .... It takes only SHARE UPDATE EXCLUSIVE instead of SHARE — INSERT/UPDATE/DELETE keep working. (It does still block other DDL and VACUUM FULL on the same table.) Wall-clock is slower (it does two table scans and waits for in-flight writers), but no app downtime.

Hard Postgres rule: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Alembic wraps each migration in a transaction by default, so you have to opt out for the index step. The cleanest way is alembic's autocommit_block():

def upgrade():
    with op.get_context().autocommit_block():
        op.create_index(
            "ix_users_email",
            "users",
            ["email"],
            postgresql_concurrently=True,
        )

The block commits the outer transaction before the CREATE INDEX CONCURRENTLY, runs it standalone, and resumes the migration after. Works on Postgres only; the kwarg is ignored on other dialects.

4. Validating new constraints

ADD CONSTRAINT ... NOT VALID skips the validation pass — fast. Then VALIDATE CONSTRAINT separately, which only takes a SHARE UPDATE EXCLUSIVE lock (allows reads + writes, blocks DDL).

This trick works for CHECK, FOREIGN KEY. Use it religiously.

5. MVCC, dead tuples, and autovacuum

This isn't a "trap" exactly — it's the model that explains why the "batch + breathing room" mantra exists everywhere in this doc.

MVCC. When you UPDATE or DELETE a row in Postgres, the old version is not removed from disk. It's marked obsolete; new versions are written elsewhere. This is what lets concurrent transactions see consistent snapshots without blocking each other — Postgres's super-power.

The cost: dead row versions ("dead tuples") accumulate. A million UPDATEs leaves a million dead rows on disk, slowing scans.

VACUUM reclaims the space dead tuples occupy and marks it reusable for future inserts. (VACUUM FULL actually compacts the file, but locks the whole table — almost never used in prod.)

autovacuum is the background daemon that runs VACUUM (and ANALYZE, which refreshes planner stats) on its own. By default it fires when a table is roughly 20% dead tuples. If autovacuum is keeping up, you don't think about it. Ever.

Two failure modes that hit during migrations:

Backfill UPDATE generates dead tuples faster than autovacuum can reclaim them. The table file on disk balloons way past the row count — bloat. Scans slow down, queries get worse. After the backfill you may need an explicit VACUUM tablename; to recover.
A long-running transaction freezes autovacuum globally. Vacuum can't reclaim tuples that any open transaction might still see. So one big multi-hour UPDATE blocks vacuum across the entire DB, bloating every table — not just the one you're touching. This is the worst of the two, because the symptoms appear far from the cause.

This is why "batch + breathing room" matters: short transactions let autovacuum do its job between batches. A common shape:

-- run this in a loop until 0 rows updated:
UPDATE t SET c = <expr>
  WHERE c IS NULL
  ORDER BY ctid
  LIMIT 10000;
-- sleep 1s between batches; autovacuum gets a window to reclaim

After a heavy backfill, follow up with VACUUM tablename; (or even VACUUM ANALYZE tablename;) to be sure. Cheap insurance.

Operational practices

Where migrations run in the deploy pipeline

The two patterns most teams use:

A. Migration as a one-off task before deploy. Run alembic upgrade head as a separate ECS one-off task or CI step, before the new app version starts rolling out. App containers don't run migrations on startup.

Pros: explicit, atomic, fails fast. No race when N tasks all start at once.
Cons: more pipeline plumbing.

B. Migration on app startup. Each container runs alembic upgrade head in its entrypoint (what app4/fastapi-app/entrypoint.sh does today).

Pros: trivial, works for single-task setups.
Cons: N tasks racing. Alembic's locking helps but doesn't make it pretty. Don't use this past 1 task.

For anything resembling prod: use pattern A. Pattern B is fine while learning.

Two real pipelines that follow Pattern A — the migration step is its own stage, before the deploy stage that rolls the new app version:

CI/CD pipeline — orchestrator service

CI/CD pipeline — main app, with per-package fan-out

Backward compatibility is the real product

Code and DB at every deploy boundary must be mutually compatible. Always ask: "if half my fleet is on the new version and half on the old, do they both work against this schema?"

If the answer is no, you need expand/contract.

Rollback plans

Forward-only rollback (preferred). Don't run alembic downgrade. Instead, write a new migration that reverses what you did. Easier to reason about, plays nicely with linear migration history, and you only ever go forward in version.
Schema rollback after data has been written is hard. New rows in the new column don't fit the old schema. Rolling back schema and code together = data loss.
Snapshot before scary changes. RDS can restore a point-in-time snapshot if a migration goes catastrophically wrong. For non-trivial migrations on prod data, take a manual snapshot beforehand.

Test in stages

Migration testing flow — PR pipeline (top) and prod-deploy pipeline (bottom)

Local Postgres (docker-compose) — catches obvious bugs.
Staging with prod-like data volume — catches migrations that pass on 100 rows and fail on 100M. Use anonymized snapshots if you have to.
Production — run during low-traffic window if you can; have an abort plan; watch lock waits and slow queries during the run.

"But running the migration changes the DB — how do I test it again?"

The big problem with migration testing: each run mutates the DB. To test the same migration twice you need to either undo it or start from a clean state. Three patterns, in order from where you are now:

1. Local: nuke the volume.

cd app4/fastapi-app
docker compose down -v   # the -v wipes the postgres volume
docker compose up -d     # fresh DB; entrypoint runs migrations from scratch

Cheapest reset; ~10 seconds. This is the right default while developing locally. Test, break, reset, test again.

2. Staging on RDS: snapshot + restore (one-snapshot, many-instances).

The model: a snapshot is reusable. You restore from it as many times as you want; each restore is an independent fresh DB.

1. Take ONE snapshot of staging (or a sanitized prod snapshot)
   → sits cheap in S3, doesn't go anywhere

2. Restore that snapshot to a NEW disposable RDS instance
   → e.g. "migration-test-001"

3. Run `alembic upgrade head` against it
   → see what happens, log everything

4. Delete the disposable instance
   → no further cost

5. Snapshot is STILL there → repeat steps 2-4 as many times as needed

# create the snapshot (one-time)
aws rds create-db-snapshot \
  --db-instance-identifier staging \
  --db-snapshot-identifier pre-migration-test

# restore to a new instance for each test
aws rds restore-db-instance-from-db-snapshot \
  --db-snapshot-identifier pre-migration-test \
  --db-instance-identifier migration-test-001

# ... run migration, observe ...

# delete when done
aws rds delete-db-instance \
  --db-instance-identifier migration-test-001 \
  --skip-final-snapshot

Snapshots cost ~$0.095/GB/month — very cheap. Restored instances bill hourly at the normal RDS rate; for a 15-minute test run it's pennies.

The trap: people forget to delete the restored instance. RDS keeps billing it hourly forever. Always pair every restore with a delete in the same runbook, or set a CloudWatch alarm on idle test instances.

3. The modern shortcut: cloneable / branchable DBs.

When migration testing becomes a daily activity, the snapshot/restore roundtrip (minutes per cycle) becomes a bottleneck. Two paths from there:

Aurora DB Clones — copy-on-write clone of a running DB, near-instant. Same cost model as a normal Aurora instance.
Neon / Xata / similar — Git-style branching for Postgres outside AWS. "Branch" the DB, run the migration, drop the branch. Test the same migration ten times in a few minutes total.

For learning / small teams: the snapshot+restore pattern is plenty. Reach for clones once you're testing migrations several times a day.

Why not `alembic downgrade`?

You'll see alembic downgrade -1 in tutorials. In theory it's how you "undo" a migration; in practice:

Autogenerated downgrades are often wrong — alembic infers the reverse from the diff and can miss edge cases.
Many operations are lossy. Drop column? Data is gone. Downgrade recreates the column but not the rows that lived in it.
Most teams don't write downgrades at all. Forward-only is the norm — if you need to "undo," write a new migration that reverses what the previous one did. Easier to reason about.

Use alembic downgrade for trivial reversals while iterating in dev. For real testing, fresh DB > downgrade.

Migration review

"Always review auto-generated migrations before running them" — autogenerate misses table renames (sees drop+create, which is data-loss), data migrations, and some constraint changes.

The alembic revision --autogenerate output is a draft, not the final. Review every line. If autogenerate produced a drop_column followed by add_column, that's almost never what you want — it's a rename and the data is gone.

Alembic-specific tips

One change per migration. Easier to review, easier to rollback, easier to bisect when something breaks.
Name them with intent. 2026_04_27_add_email_verified.py not 9f8e7d6c_revision.py. Use --rev-id or filename templates.
Branch labels for splitting environments — rarely needed, but exists if your dev/staging/prod schemas have to diverge briefly.
op.execute() for raw SQL when alembic's helpers don't cover what you need (e.g. CONCURRENTLY, partial indexes, partitioning).
Test the downgrade even if you don't plan to use it — if you can't write a downgrade, the migration probably isn't reversible, which is worth knowing.
Run from CI, not from your laptop. Make the deploy pipeline the only thing with prod DB credentials.

When schema changes are too big: AWS Blue/Green

Sometimes you can't expand/contract — the change is too big (major version upgrade, dramatic restructure) or your app can't handle dual schemas. AWS supports Blue/Green deployments for exactly this — and contrary to what I'd thought, it works on both RDS and Aurora:

	Supported
RDS for MySQL, MariaDB, Postgres	✓
Aurora MySQL, Postgres, Global Database	✓

How it works:

AWS clones your prod ("blue") into a synchronized staging ("green") that streams replication from blue.
You apply the disruptive migration on green only. Blue keeps serving prod.
Test green thoroughly.
Switchover. AWS swaps the endpoints; blue becomes the standby and is retained for fallback.

Switchover times depend on engine:

RDS Blue/Green: typically <5 seconds (or ~2s with the AWS Advanced JDBC Driver, since DNS propagation is bypassed).
Aurora Blue/Green: typically under a minute.

The crucial limitation for schema migrations: replication blue → green is logical, so changes that break replication compatibility break the green deployment. Concretely:

✓ Adding new columns at the end of a table — replication still works.
✗ Renaming columns or tables — breaks replication. You can't use Blue/Green for renames.
✗ Type changes that aren't replication-compatible — see PG logical replication restrictions.

So Blue/Green is the right tool for: major version upgrades (PG14 → PG18), big index reorgs, additive structural changes you want to test against a copy first. Renames still need expand/contract — Blue/Green doesn't let you skip the dance.

Pre-deploy migration checklist

Before clicking "deploy" on anything that touches the DB:

If any answer is "I don't know" — pause and find out.

References

pgroll — zero-downtime Postgres schema migrations — the modern automated expand/contract tool
Zero-Downtime Alembic Migrations on PostgreSQL (Gold Lapel) — alembic-specific patterns
Avoiding deadlocks in Postgres migrations (pganalyze)
Schema changes and the Postgres lock queue (Xata)
Zero-downtime Postgres schema migrations need this: lock_timeout and retries (PostgresAI)
Running a safe database migration using Postgres (Retool)
Which ALTER TABLE Operations Lock Your PostgreSQL Table?
AWS RDS Blue/Green Deployments — schema modifications
strong_migrations gem — Rails-side, but the README's list of unsafe migrations is gold for any Postgres user
pgfence — migration safety CLI — lock mode analysis + risk scoring