Schema Changes Without Downtime: The Art of Zero-Disruption Migrations
databases backend postgresql migrations system-design
Every backend engineer eventually faces the same scary moment: you need to rename a column, split a table, or change a data type - and you cannot afford downtime.
The instinct is to just “run the migration.” The correct instinct is to treat schema evolution as a multi-step, backwards-compatible dance between your database and your code.
Let’s break it down.
Why Naive Migrations Break Things
When you run ALTER TABLE users RENAME COLUMN email TO email_address, the database-level change is instant on most engines - but your application code still refers to email. The moment the migration runs, every in-flight request reading or writing email throws an error.
This is the core problem: code and schema are out of sync during deployment.
The fix is to never let them be out of sync - even briefly.
The Expand / Contract Pattern
This is the foundational pattern. All zero-downtime schema migrations are variations of it.
Phase 1 - Expand: Add the new thing. Keep the old thing. Both work.
Phase 2 - Migrate: Backfill data. Update app to write to both.
Phase 3 - Contract: Remove the old thing once code no longer needs it.
Never compress these into one step. Each phase is a separate deployment.
Example 1: Renaming a Column
Suppose you want to rename users.name → users.full_name.
❌ The naive approach (causes downtime)
ALTER TABLE users RENAME COLUMN name TO full_name;
Your app still reads name - instant errors.
✅ The zero-downtime approach
Step 1 - Add the new column:
ALTER TABLE users ADD COLUMN full_name TEXT;
Step 2 - Deploy app to write to both columns:
// Write to both during transition
await db.query(
`UPDATE users SET full_name = $1, name = $1 WHERE id = $2`,
[fullName, userId]
);
Step 3 - Backfill existing rows:
-- Run in batches to avoid locking the table
UPDATE users
SET full_name = name
WHERE full_name IS NULL
AND id BETWEEN $1 AND $2;
Step 4 - Deploy app to read from full_name only.
Step 5 - Drop the old column (separate migration, separate deployment):
ALTER TABLE users DROP COLUMN name;
Example 2: Adding a NOT NULL Column
This trips people up constantly. ALTER TABLE ... ADD COLUMN status TEXT NOT NULL will fail if the table has existing rows - PostgreSQL can’t satisfy the constraint for them.
✅ The safe pattern
-- Step 1: Add nullable with a default
ALTER TABLE orders ADD COLUMN status TEXT DEFAULT 'pending';
-- Step 2: Backfill (or the default handles it)
UPDATE orders SET status = 'pending' WHERE status IS NULL;
-- Step 3: Add the NOT NULL constraint later
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
In PostgreSQL 11+, if you add a column with a non-volatile default, Postgres stores the default in the catalog and doesn’t rewrite the table - making it instantaneous.
Example 3: Splitting a Table
You have a fat users table and want to extract user_profiles. This is the most involved migration.
-- Step 1: Create the new table
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
bio TEXT,
avatar_url TEXT,
website TEXT
);
-- Step 2: Backfill from the fat table
INSERT INTO user_profiles (user_id, bio, avatar_url, website)
SELECT id, bio, avatar_url, website FROM users;
-- Step 3: Deploy app to write to both tables
-- Step 4: Deploy app to read from user_profiles
-- Step 5: Drop columns from users
ALTER TABLE users
DROP COLUMN bio,
DROP COLUMN avatar_url,
DROP COLUMN website;
Handling Locks: The Silent Killer
Even “safe” DDL operations can acquire locks that block reads/writes. In PostgreSQL:
| Operation | Lock Type | Blocks |
|---|---|---|
ADD COLUMN (no default) | AccessExclusiveLock briefly | Reads + Writes |
ADD COLUMN (with default, PG11+) | metadata-only | Nothing |
DROP COLUMN | AccessExclusiveLock | Reads + Writes |
CREATE INDEX | ShareLock | Writes |
CREATE INDEX CONCURRENTLY | Weaker lock | Nothing |
Always prefer CREATE INDEX CONCURRENTLY:
-- ❌ Blocks all writes
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- ✅ Safe for production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
The trade-off: concurrent index builds take longer and can fail, requiring cleanup:
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_user_id;
Backfilling Large Tables: Do It in Batches
Never run a bare UPDATE on millions of rows. It creates a long-running transaction, holds locks, generates enormous WAL, and can crash your replica lag.
async function backfillInBatches(batchSize = 1000) {
let lastId = 0;
while (true) {
const result = await db.query(`
UPDATE users
SET full_name = name
WHERE id > $1
AND full_name IS NULL
ORDER BY id
LIMIT $2
RETURNING id
`, [lastId, batchSize]);
if (result.rows.length === 0) break;
lastId = result.rows.at(-1).id;
// Breathe. Let replication catch up.
await sleep(50);
}
}
This keeps transactions short, respects replicas, and gives you checkpointing for free.
Deploying Safely: The Three-Commit Rule
Each phase of expand/contract maps to a separate git commit and deployment:
Commit 1: migration + app writes to old AND new
Commit 2: app reads from new only (old still populated)
Commit 3: migration drops old column
If you need to roll back Commit 2, the old column is still there - no data loss.
Tooling That Helps
- migrate - language-agnostic, file-based
- Flyway / Liquibase - Java-ecosystem, enterprise-grade
- Prisma Migrate - great for Node.js, generates expand/contract-friendly diffs
- Atlas - schema-as-code, drift detection, declarative
Most tools will just run your SQL - the pattern is still yours to design. The tool handles ordering and tracking, not safety.
The Mental Model
Think of your schema and application code as two independent services that must maintain a backwards-compatible contract at every point. If your next schema migration would break the currently-deployed version of your app - it’s not ready to ship.
Schema evolution is slow and intentional. That’s not a bug - it’s what keeps your users online.
Quick Reference
| Goal | Pattern |
|---|---|
| Rename column | Add new → backfill → migrate reads → drop old |
| Add NOT NULL column | Add nullable → backfill → add constraint |
| Change column type | Add new column → dual-write → migrate reads → drop old |
| Add index | CREATE INDEX CONCURRENTLY |
| Split table | New table → backfill → migrate reads/writes → drop old columns |
| Remove column | Stop writing → verify → drop (never skip the verify step) |
Zero-downtime migrations are more work. They’re also the only kind you should run in production.
This article is premium
One-time payment · Lifetime access to all premium content
Get Premium AccessAlready have access? Sign in