Never Roll Back, Always Fix Forward

27 February 2026 at 09:00 ~ 6 min read

TL;DR

The phrase never roll back, always fix forward is intentionally blunt.

In practice, you may still revert a bad deploy. But that revert should be treated as a tactical mitigation, not your primary recovery strategy.

The reason is simple: once a release has changed shared state, rolling code back does not roll the world back.

Databases may have been migrated. Jobs may already have been queued. Webhooks may have fired. Emails may have been sent. Records may have been transformed in ways that are not idempotent and not safely reversible.

That is why the safest production mindset is usually:

Reduce blast radius immediately
Revert only if the change is stateless and clearly safe
Keep the system compatible
Ship the forward fix

Why Pure Rollback Fails In Real Systems

Pure rollback assumes a comforting fiction: that the application, the database, the queue, and every side effect can be rewound together.

In small demos, that can be true.

In production, it usually is not.

The harder your system leans on state, the less trustworthy rollback becomes:

A migration may already have added, removed, or rewritten columns
A new worker may already have produced events in a new format
A background job may already have partially processed a batch
A billing or email workflow may already have triggered external side effects

Rolling back the app binary does nothing to undo those changes.

That is the key operational mistake: treating git revert as if it were a time machine.

Last Deploy Was Bad. Should You Revert?

Sometimes, yes.

If the bad release is mostly stateless application logic, a quick revert can be the fastest way to stop the bleeding. That is still a useful tool.

But before you redeploy an older tag, ask four questions:

Did this release run any database migration?
Did it emit data in a new shape to queues, caches, or other services?
Did it trigger non-idempotent side effects such as billing, emails, or third-party writes?
Can the older code safely run against the current production state?

If any answer is unclear, an older tag may be more dangerous than the broken one.

A revert is safest when it is code-only and the system state remains compatible.

That is not a rollback strategy. That is a narrow mitigation inside a fix-forward strategy.

The Older-Tag Trap

The biggest risk in “just deploy the previous tag” is not Git. It is compatibility.

Production rarely contains only one version of your application at one instant. During a deploy, old and new instances can overlap. Background workers can lag behind. Queues can contain payloads created by a newer release.

This is exactly why safe schema evolution is designed around overlapping versions.

Prisma’s guide to the expand-and-contract pattern describes the right mental model:

Expand the schema in a backwards-compatible way
Deploy application code that can handle both shapes
Migrate data gradually
Contract only after the old shape is no longer needed

That pattern exists because immediate rollback to an older assumption is often unsafe.

If your new release already wrote to a new column, changed a payload, or stopped reading an old field, an older tag may come back expecting a world that no longer exists.

GitLab’s database docs make the same point in a more operational form: even something as ordinary as renaming a table has to be spread across multiple releases to preserve zero-downtime compatibility.

That is the practical argument against blind rollback. Older code is not automatically compatible code.

Database Migrations Are Where Rollback Goes To Die

This is where most “just roll it back” plans collapse.

GitLab explicitly says that there is no guarantee the application and the database can be rolled back together, and that the safest strategy is usually to roll forward instead.

That is the right default.

A schema migration often outlives the deploy that introduced it:

New columns may already exist
Old columns may already be unused
Data may already have been backfilled
Destructive changes may already have removed information

Even when your migration framework supports a down() method, that does not make production rollback safe.

“Migrations Should Never Create A Down”?

Taken literally, no. That is too absolute.

You can still write a down() migration when reversing the change is trivial, deterministic, and genuinely useful for local development, CI, or short-lived test environments.

But you should not confuse “this migration has a down() method” with “production can safely roll back”.

Those are different claims.

For many real migrations, a truthful down() is either impossible or misleading:

If you dropped a column, the lost data is gone
If you rewrote values, the original intent may be unrecoverable
If you deduplicated records, you may not know what to restore
If you triggered external side effects, the database is only part of the story

That is why experienced teams often treat destructive migrations as operationally irreversible, even if the framework allows a reverse function to be written.

The useful rule is this:

Write reversible migrations when they are truly reversible. Do not rely on down() as your production incident strategy.

If you need hard recovery from a bad migration, backups and restore procedures are far more honest than pretending every DDL or data change can be cleanly undone. GitLab’s migration style guide states that GitLab production uses roll-forward instead of db:rollback, and advises self-managed users to restore the backup created before the upgrade started.

What Fix-Forward Actually Looks Like

Fix-forward does not mean “leave production broken and write a better version next week”.

It means your first response should be to restore safety while preserving compatibility.

That usually looks like this:

Turn off the bad path
Stop the blast radius
Keep the schema and data model compatible
Deploy the smallest safe patch

In well-run systems, step one is often a feature flag.

LaunchDarkly recommends feature flags during migrations because they let you disable a broken path quickly without immediately depending on a full redeploy.

That is exactly the kind of control you want in an incident:

Disable a feature
Pause a worker
Stop a cron
Route around the failure
Ship the patch that restores correctness

This is operationally stronger than betting everything on an older build still being safe.

A Better Incident Rule

When a deploy goes bad, do not ask only:

What code changed?

Ask:

What state changed, and can older code still live with it?

That question leads to better decisions.

If state has not changed in an incompatible way, a revert may be a perfectly reasonable temporary move.

If state has changed, your safest path is usually:

Mitigate immediately
Preserve forwards and backwards compatibility
Patch and redeploy

That is fix-forward in practice.

Verification

If you want fix-forward to be more than a slogan, verify that your delivery process supports it:

New deployments can run safely alongside the previous version for a short period
Database migrations follow an expand-and-contract model
Destructive schema changes are delayed until old code is no longer running
Feature flags or kill switches exist for high-risk paths
Incident runbooks ask about schema, queues, and side effects before suggesting a revert

If those are not true, your rollback plan is probably weaker than you think.

Final Take

“Always fix forward” is not a prohibition on ever reverting code.

It is a reminder that production recovery is about system state, not just source control history.

Use a revert when it is the fastest safe mitigation.

But design your releases, migrations, and incident response around the assumption that once shared state has moved forward, the safest path is usually forward too.

That is the difference between rolling back code and actually recovering a system.