Zero downtime migrations: the parts people ignore

2026-02-21

Most teams talk about "zero downtime" like it is a routing trick. It is not. It is a risk management problem.

1) Data correctness beats uptime

If you migrate fast but corrupt state, you did not win. Build validation into the pipeline: checksums, row counts, business invariants, and replayable audit trails.

2) Dual-write is where systems die

Dual-write introduces drift. If you cannot prove ordering and idempotency, you are manufacturing incidents. Prefer a single source of truth and replicate outward, not two primaries.

3) Rollback is not a vibe

Write down the rollback plan as code and rehearse it. A rollback that depends on people reading a wiki is not a rollback.

4) Observability is the migration

Without visibility, you are guessing. Define SLOs for the cutover window, track error budgets, and make "stop" the default when signals degrade.

5) Business definition matters

"Zero downtime" for the business means users can complete critical flows. Identify the 3-5 flows that matter and instrument them end-to-end.

If you want this pattern applied to your environment, send context and constraints. I will tell you what I would do and what I would not do.