Schema Changes Without Downtime: The Art of Zero-Disruption Migrations


databases backend postgresql migrations system-design

Every backend engineer eventually faces the same scary moment: you need to rename a column, split a table, or change a data type - and you cannot afford downtime.

The instinct is to just “run the migration.” The correct instinct is to treat schema evolution as a multi-step, backwards-compatible dance between your database and your code.

Let’s break it down.


Why Naive Migrations Break Things

When you run ALTER TABLE users RENAME COLUMN email TO email_address, the database-level change is instant on most engines - but your application code still refers to email. The moment the migration runs, every in-flight request reading or writing email throws an error.

This is the core problem: code and schema are out of sync during deployment.

The fix is to never let them be out of sync - even briefly.


The Expand / Contract Pattern

This is the foundational pattern. All zero-downtime schema migrations are variations of it.

Phase 1 - Expand:     Add the new thing. Keep the old thing. Both work.
Phase 2 - Migrate:    Backfill data. Update app to write to both.
Phase 3 - Contract:   Remove the old thing once code no longer needs it.

Never compress these into one step. Each phase is a separate deployment.


Example 1: Renaming a Column

Suppose you want to rename users.nameusers.full_name.

❌ The naive approach (causes downtime)

ALTER TABLE users RENAME COLUMN name TO full_name;

Your app still reads name - instant errors.


✅ The zero-downtime approach

Step 1 - Add the new column:

ALTER TABLE users ADD COLUMN full_name TEXT;

Step 2 - Deploy app to write to both columns:

// Write to both during transition
await db.query(
  `UPDATE users SET full_name = $1, name = $1 WHERE id = $2`,
  [fullName, userId]
);

Step 3 - Backfill existing rows:

-- Run in batches to avoid locking the table
UPDATE users
SET full_name = name
WHERE full_name IS NULL
AND id BETWEEN $1 AND $2;

Step 4 - Deploy app to read from full_name only.

Step 5 - Drop the old column (separate migration, separate deployment):

ALTER TABLE users DROP COLUMN name;

Example 2: Adding a NOT NULL Column

This trips people up constantly. ALTER TABLE ... ADD COLUMN status TEXT NOT NULL will fail if the table has existing rows - PostgreSQL can’t satisfy the constraint for them.

✅ The safe pattern

-- Step 1: Add nullable with a default
ALTER TABLE orders ADD COLUMN status TEXT DEFAULT 'pending';

-- Step 2: Backfill (or the default handles it)
UPDATE orders SET status = 'pending' WHERE status IS NULL;

-- Step 3: Add the NOT NULL constraint later
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

In PostgreSQL 11+, if you add a column with a non-volatile default, Postgres stores the default in the catalog and doesn’t rewrite the table - making it instantaneous.


Example 3: Splitting a Table

You have a fat users table and want to extract user_profiles. This is the most involved migration.

-- Step 1: Create the new table
CREATE TABLE user_profiles (
  user_id BIGINT PRIMARY KEY REFERENCES users(id),
  bio TEXT,
  avatar_url TEXT,
  website TEXT
);

-- Step 2: Backfill from the fat table
INSERT INTO user_profiles (user_id, bio, avatar_url, website)
SELECT id, bio, avatar_url, website FROM users;

-- Step 3: Deploy app to write to both tables
-- Step 4: Deploy app to read from user_profiles
-- Step 5: Drop columns from users
ALTER TABLE users
  DROP COLUMN bio,
  DROP COLUMN avatar_url,
  DROP COLUMN website;

Handling Locks: The Silent Killer

Even “safe” DDL operations can acquire locks that block reads/writes. In PostgreSQL:

OperationLock TypeBlocks
ADD COLUMN (no default)AccessExclusiveLock brieflyReads + Writes
ADD COLUMN (with default, PG11+)metadata-onlyNothing
DROP COLUMNAccessExclusiveLockReads + Writes
CREATE INDEXShareLockWrites
CREATE INDEX CONCURRENTLYWeaker lockNothing

Always prefer CREATE INDEX CONCURRENTLY:

-- ❌ Blocks all writes
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- ✅ Safe for production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

The trade-off: concurrent index builds take longer and can fail, requiring cleanup:

DROP INDEX CONCURRENTLY IF EXISTS idx_orders_user_id;

Backfilling Large Tables: Do It in Batches

Never run a bare UPDATE on millions of rows. It creates a long-running transaction, holds locks, generates enormous WAL, and can crash your replica lag.

async function backfillInBatches(batchSize = 1000) {
  let lastId = 0;

  while (true) {
    const result = await db.query(`
      UPDATE users
      SET full_name = name
      WHERE id > $1
      AND full_name IS NULL
      ORDER BY id
      LIMIT $2
      RETURNING id
    `, [lastId, batchSize]);

    if (result.rows.length === 0) break;
    lastId = result.rows.at(-1).id;

    // Breathe. Let replication catch up.
    await sleep(50);
  }
}

This keeps transactions short, respects replicas, and gives you checkpointing for free.


Deploying Safely: The Three-Commit Rule

Each phase of expand/contract maps to a separate git commit and deployment:

Commit 1: migration + app writes to old AND new
Commit 2: app reads from new only (old still populated)
Commit 3: migration drops old column

If you need to roll back Commit 2, the old column is still there - no data loss.


Tooling That Helps

  • migrate - language-agnostic, file-based
  • Flyway / Liquibase - Java-ecosystem, enterprise-grade
  • Prisma Migrate - great for Node.js, generates expand/contract-friendly diffs
  • Atlas - schema-as-code, drift detection, declarative

Most tools will just run your SQL - the pattern is still yours to design. The tool handles ordering and tracking, not safety.


The Mental Model

Think of your schema and application code as two independent services that must maintain a backwards-compatible contract at every point. If your next schema migration would break the currently-deployed version of your app - it’s not ready to ship.

Schema evolution is slow and intentional. That’s not a bug - it’s what keeps your users online.


Quick Reference

GoalPattern
Rename columnAdd new → backfill → migrate reads → drop old
Add NOT NULL columnAdd nullable → backfill → add constraint
Change column typeAdd new column → dual-write → migrate reads → drop old
Add indexCREATE INDEX CONCURRENTLY
Split tableNew table → backfill → migrate reads/writes → drop old columns
Remove columnStop writing → verify → drop (never skip the verify step)

Zero-downtime migrations are more work. They’re also the only kind you should run in production.

This article is premium

One-time payment · Lifetime access to all premium content

Get Premium Access

Already have access? Sign in