Deploy at 4:59 PM on a Friday: Zero-Downtime Deployments That Actually Work


zero-downtime deployment blue-green canary rollback

Deploy at 4:59 PM on a Friday

A β€œsmall fix” before the weekend. The on-call phone rings at 5:05. Zero-downtime deployments suddenly matter a lot.

⏱ 15 min readπŸ“ IntermediateπŸš€ DevOps

4:59 PM, Friday

Marcus is a senior engineer who has shipped hundreds of deployments. He is careful. He reviews his diffs. He runs the test suite locally. And today he has a genuinely small fix: a one-line change to the session validation logic that prevents a rare but annoying login bug reported by a handful of enterprise customers.

It is 4:59 PM. The team is mentally wrapping up. He hits deploy.

By 5:05 PM, the on-call phone rings.

Users across three time zones are getting logged out mid-session. Customer support tickets are flooding in. The error rate dashboard turns red. The CEO texts the CTO. The CTO texts Marcus.

The fix was one line. The incident took four hours to resolve, two hours to write up, and one very long Monday morning to explain to the leadership team. And the postmortem revealed a painful truth: it was not the code that failed. It was the deployment strategy.


What Actually Went Wrong

Marcus deployed using a standard rolling update: instances were killed and replaced with the new version one by one. Here is what that looked like from the outside:

Rolling Deploy - 8 instances

t=0:00  All running version 1 (stable)
        [v1][v1][v1][v1][v1][v1][v1][v1]

t=0:30  First instance replaced
        [v2][v1][v1][v1][v1][v1][v1][v1]
        12.5% of traffic now hitting v2

t=1:00  Second instance replaced
        [v2][v2][v1][v1][v1][v1][v1][v1]
        25% of users getting unexpected logouts

t=1:30  Third instance replaced
        [v2][v2][v2][v1][v1][v1][v1][v1]
        37% impacted - on-call fires

t=2:00  Deploy paused manually
        Three mixed-version minutes already done
        Rollback requires another full 8-minute deploy cycle

The failure chain:

  1. The new session validation code was subtly incompatible with tokens issued by the old version
  2. Users whose requests landed on a v2 instance had their sessions invalidated
  3. The load balancer mixed v1 and v2 traffic randomly on every request
  4. Users experienced random logouts depending on which instance they hit each time
  5. Rolling back required another 8-minute deploy cycle of the same mixed-version problem
  6. Total blast radius: 35% of active users, 13 minutes of degraded experience

The core problem: there was no safe way to validate the new version in production before it touched real users, and no instant escape hatch when it misbehaved. The team discovered the bug by watching user impact, not by catching it first.


Solution 1: Blue-Green Deployments

Blue-green is the simplest zero-downtime strategy. You maintain two identical production environments: Blue (currently serving traffic) and Green (the new version, warming up in parallel). To deploy, you switch the load balancer from Blue to Green atomically.

Blue-Green Architecture

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    Load Balancer    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ 100% traffic
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                             β”‚
          β–Ό                             β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  BLUE  (v1) β”‚               β”‚ GREEN  (v2) β”‚
   β”‚  (live)     β”‚               β”‚ (standby)   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deploy sequence:
  Step 1  Deploy v2 to Green (zero traffic, no user impact)
  Step 2  Run smoke tests and health checks against Green
  Step 3  Switch load balancer: Blue β†’ Green (instant, atomic)
  Step 4  Blue stays live for 10-minute rollback window
  Step 5  If stable, decommission Blue

Rollback: Switch load balancer back to Blue (under 30 seconds)

The key insight: the traffic switch is instant and atomic. Users go from 100% Blue to 100% Green in a single load balancer config change. There is no mixed-version window where some users hit v1 and others hit v2.

In Kubernetes, this translates to two separate Deployments and a Service selector switch:

# Switch service from blue to green (atomic)
kubectl patch service app-service \
  -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback is equally fast
kubectl patch service app-service \
  -p '{"spec":{"selector":{"version":"blue"}}}'

Blue-green costs double the infrastructure during the deployment window, but with modern autoscaling you spin up Green just before the deploy and tear it down afterward. For most services the overhead is 10 to 20 minutes of extra capacity - a very reasonable price for instant rollback.


Solution 2: Canary Releases

Canary releases are blue-green’s smarter sibling. Instead of switching 100% of traffic atomically, you route a small percentage to the new version first, observe metrics in production, and expand gradually. If anything looks wrong, you roll back before the majority of users are affected.

Canary Rollout Progression

Stage 1 - 5% canary (observe for 10 min)
  [v1][v1][v1][v1][v1][v1][v1][v1][v1][v2]
                                          ↑

Stage 2 - 10% (if error rate and latency are clean)
  [v1][v1][v1][v1][v1][v1][v1][v1][v2][v2]

Stage 3 - 25%
  [v1][v1][v1][v1][v1][v1][v1][v2][v2][v2]

Stage 4 - 50%
  [v1][v1][v1][v1][v1][v2][v2][v2][v2][v2]

Stage 5 - 100% full rollout
  [v2][v2][v2][v2][v2][v2][v2][v2][v2][v2]

Automated gate at each stage:
  error_rate    < 0.1%
  p99_latency   < 500ms
  5xx_count     < baseline * 1.2
  β†’ if any gate fails: auto-rollback, alert on-call

With a 5% canary, Marcus’s session bug would have affected only 5% of users. The automated health gate would have caught the error rate spike within 2 minutes and halted the rollout before Stage 2 ever started. Total blast radius: 5% of active users for 3 minutes, instead of 35% for 13 minutes.

This is the gold standard for teams shipping frequently. Netflix, Google, and Amazon all use some form of progressive delivery for production deployments.


Solution 3: Feature Flags

Feature flags decouple deployment from release. You ship the code to production in a disabled state. The feature is activated through a configuration toggle, with no new deployment required, for a controlled subset of users.

Without feature flags:
  Deploy β†’ Feature is live for everyone immediately

With feature flags:
  Deploy (flag off) β†’ Zero user impact
  Enable for 1%     β†’ Monitor
  Enable for 10%    β†’ Monitor
  Enable for 100%   β†’ Done

Rollback:
  Flip the flag off  β†’ Instant, no redeploy, no 8-minute window

In code, the feature flag wraps the new behavior:

def validate_session(token):
    if feature_flags.enabled("new_session_validator", user_id=token.user_id):
        return new_validator.validate(token)
    return legacy_validator.validate(token)

With this pattern, Marcus’s Friday deploy would have landed with the new validator completely dormant. On Monday morning, after everyone was refreshed, the team could have enabled it for 1% of users, watched the metrics, and expanded from there. A bad result? Flip the flag off in seconds, no phones ringing.

Tools to consider: LaunchDarkly, Unleash (open source), AWS AppConfig, Flipper (Ruby/Rails), or even a simple database-backed flag table for basic needs.

The main tradeoff is flag sprawl. A codebase with 200 stale feature flags becomes impossible to reason about. Establish a policy upfront: every flag has a named owner and a target removal date.


Solution 4: Health Checks and Automated Rollback

All of the above strategies need a signal that something is wrong. Health checks provide that signal. Automated rollback acts on it - without waiting for a human to notice, decide, and act.

Deployment Pipeline with Automated Rollback

New version deployed (canary at 5%)
            β”‚
            β–Ό
  Observation window (5-10 minutes)
            β”‚
  Continuous health evaluation:
    β”œβ”€β”€ HTTP /health/ready returns 200?
    β”œβ”€β”€ Error rate < threshold?
    β”œβ”€β”€ p99 latency within bounds?
    └── Business metrics (logins, checkouts) stable?
            β”‚
       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
      PASS      FAIL
       β”‚          β”‚
       β–Ό          β–Ό
  Expand       Automated rollback triggered
  to 25%       Alert fires to on-call
               (within 90 seconds of detection)

In Kubernetes, readiness probes are your first line of defense. A pod that fails its readiness probe is immediately removed from the load balancer - no traffic reaches it until it recovers.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Critical rule: your /health/ready endpoint must verify actual readiness, not just process health. It should test database connectivity, cache connectivity, and any critical dependencies. β€œThe process is alive” is a liveness check. β€œI can serve requests right now” is a readiness check. They are different endpoints with different jobs.


The Complete Safe Deployment Pipeline

Here is what a mature deployment pipeline looks like end to end:

Code merged to main
        β”‚
        β–Ό
CI: Build + Unit Tests + Integration Tests
        β”‚
        β–Ό
Deploy to staging (full environment mirror)
        β”‚
        β–Ό
Automated smoke tests in staging
        β”‚
        β–Ό
Deploy canary to production (5%)
        β”‚
        β–Ό
Automated observation window (10 minutes)
  Checks: error rate / p99 latency / business KPIs
        β”‚
     β”Œβ”€β”€β”΄β”€β”€β”
    PASS   FAIL
     β”‚       β”‚
     β–Ό       β–Ό
Expand     Auto-rollback triggered
to 25%     On-call alerted
     β”‚
Expand to 50%, then 100%
(or activate via feature flag)
        β”‚
        β–Ό
Monitor for 24 hours
Decommission old version

With this pipeline, Marcus’s one-line fix goes through two production validation stages before it reaches the full user base. A session validation bug that breaks 5% of canary traffic is caught automatically. The rollback fires. Marcus gets a Slack notification, not a phone call from the CEO at 5:05 PM.


Summary Table

StrategyRollback SpeedMax Blast RadiusExtra InfrastructureComplexity
Traditional Rolling8+ min redeploy100% during rolloutNoneLow
Blue-GreenUnder 30 seconds0% before switch2x during deployLow
Canary ReleaseAutomatic, under 5 min5% in canary windowMarginalMedium
Feature FlagsSecondsConfigurable %NoneMedium
Full PipelineAutomatic5% maxMarginalHigh

Key Takeaways

  • Never deploy to 100% of users at once for changes touching auth, sessions, payments, or any core user flow.
  • Blue-green is your minimum baseline. Double infrastructure for 20 minutes buys you instant rollback.
  • Canary releases are the industry standard. They let production traffic validate your code before full exposure.
  • Feature flags decouple deploy from release. Ship the code dormant, enable it deliberately, disable it instantly.
  • Health checks must be meaningful. β€œProcess is running” is not a health check. Test real readiness.
  • Automate the rollback decision. Humans detecting a bad deploy and deciding to roll back costs minutes. Automation costs seconds.

Frequently Asked Questions

Q: Is blue-green deployment expensive to maintain?

Only during the deployment window. With containers and autoscaling, you spin up the Green environment just before the deploy and tear it down when Green is stable. You are paying for extra capacity for 10 to 20 minutes per deploy, not maintaining a duplicate environment permanently.

Q: How do you handle database migrations with blue-green deployments?

This is the hardest part of zero-downtime deployments. Schema changes must be backward compatible with both the old and new application versions simultaneously. The standard approach is the expand/contract pattern: add the new column as nullable first (no app change required), deploy the new app that writes to both old and new columns, backfill historical data, then remove the old column in a later deploy.

Q: What percentage of traffic should a canary start at?

Typically 1% to 5% for high-traffic services. For lower-traffic services, you need enough volume through the canary to get statistically meaningful error rates in a reasonable time window. If you get 20 requests per minute total, a 1% canary means 1 request every 5 minutes - too slow. Scale the canary percentage up until you have enough signal.

Q: Can feature flags introduce performance issues?

Yes, if implemented naively. Flags evaluated on every request that make a remote network call to a flag service (LaunchDarkly, Unleash) add latency to your hot path. The standard fix is local caching: the flag service pushes config changes to your application’s in-memory cache. Evaluations are local and sub-millisecond.

Q: What is a deployment freeze and when should one be used?

A deployment freeze is a window where no code ships to production - typically before major holidays, around high-traffic events, or before important customer milestones. Most mature engineering organizations freeze non-critical deployments starting Thursday evening. Marcus’s Friday afternoon deploy would have been blocked by a sensible freeze policy, and none of this would have happened.


Interview Questions

Q: Explain the difference between blue-green and canary deployments.

Blue-green maintains two full environments and switches traffic atomically between them - it is binary, all or nothing. Canary routes a small percentage of traffic to the new version and expands gradually based on observed metrics. Blue-green is simpler to reason about; canary is safer for catching bugs that only appear under real traffic patterns. Many mature organizations use both together: blue-green for the initial switch, canary for the traffic expansion.

Q: How do you do a zero-downtime database migration?

Use the expand/contract pattern. Step 1: add the new column as nullable, backward compatible with existing code. Step 2: deploy application code that writes to both the old and new columns. Step 3: backfill historical rows to populate the new column. Step 4: deploy code that reads exclusively from the new column. Step 5: make the new column NOT NULL. Step 6: remove the old column in a separate, later migration. Never bundle a schema change and an application behavior change into a single deploy.

Q: What makes a good health check endpoint?

A good readiness probe verifies the service can actually serve traffic: it checks database connectivity, cache connectivity, any critical downstream dependency, and returns non-200 if any of them are unavailable. A good liveness probe is simpler - it verifies the process has not deadlocked or entered a broken state. These are separate endpoints with separate purposes and separate failure thresholds.

Q: How does Kubernetes handle rolling updates, and how is that different from blue-green?

Kubernetes RollingUpdate strategy replaces pods incrementally, controlled by maxSurge (extra pods allowed during rollout) and maxUnavailable (pods that can be down at once). This is similar to a traditional rolling deploy - there is a window where both old and new pods serve traffic simultaneously. Blue-green in Kubernetes is implemented explicitly: you run two separate Deployments and switch the Service’s selector label, which is a single atomic operation. Kubernetes does not do blue-green natively; you build it on top of its primitives.

Q: How would you design a system that automatically detects and rolls back a bad deploy?

Define your SLOs clearly: acceptable error rate, latency bounds, key business metrics. Emit all of these as time-series metrics to a platform like Prometheus or Datadog. Set short-window alerts - 2 to 5 minutes - tuned to catch regressions without firing on noise. When an alert fires during a deploy window (tracked via deployment markers in your metrics system), trigger automated rollback via your CI/CD pipeline’s rollback API. The deploy window marker is critical - it tells your automation that a regression is likely deploy-caused, not a pre-existing issue.

This article is premium

One-time payment Β· Lifetime access to all premium content

Get Premium Access

Already have access? Sign in