TL;DR
The phrase never roll back, always fix forward is intentionally blunt.
In practice, you may still revert a bad deploy. But that revert should be treated as a tactical mitigation, not your primary recovery strategy.
The reason is simple: once a release has changed shared state, rolling code back does not roll the world back.
Databases may have been migrated. Jobs may already have been queued. Webhooks may have fired. Emails may have been sent. Records may have been transformed in ways that are not idempotent and not safely reversible.
That is why the safest production mindset is usually:
- Reduce blast radius immediately
- Revert only if the change is stateless and clearly safe
- Keep the system compatible
- Ship the forward fix
Why Pure Rollback Fails In Real Systems
Pure rollback assumes a comforting fiction: that the application, the database, the queue, and every side effect can be rewound together.
In small demos, that can be true.
In production, it usually is not.
The harder your system leans on state, the less trustworthy rollback becomes:
- A migration may already have added, removed, or rewritten columns
- A new worker may already have produced events in a new format
- A background job may already have partially processed a batch
- A billing or email workflow may already have triggered external side effects
Rolling back the app binary does nothing to undo those changes.
That is the key operational mistake: treating git revert as if it were a time machine.
Last Deploy Was Bad. Should You Revert?
Sometimes, yes.
If the bad release is mostly stateless application logic, a quick revert can be the fastest way to stop the bleeding. That is still a useful tool.
But before you redeploy an older tag, ask four questions:
- Did this release run any database migration?
- Did it emit data in a new shape to queues, caches, or other services?
- Did it trigger non-idempotent side effects such as billing, emails, or third-party writes?
- Can the older code safely run against the current production state?
If any answer is unclear, an older tag may be more dangerous than the broken one.
A revert is safest when it is code-only and the system state remains compatible.
That is not a rollback strategy. That is a narrow mitigation inside a fix-forward strategy.
The Older-Tag Trap
The biggest risk in “just deploy the previous tag” is not Git. It is compatibility.
Production rarely contains only one version of your application at one instant. During a deploy, old and new instances can overlap. Background workers can lag behind. Queues can contain payloads created by a newer release.
This is exactly why safe schema evolution is designed around overlapping versions.
Prisma’s guide to the expand-and-contract pattern describes the right mental model:
- Expand the schema in a backwards-compatible way
- Deploy application code that can handle both shapes
- Migrate data gradually
- Contract only after the old shape is no longer needed
That pattern exists because immediate rollback to an older assumption is often unsafe.
If your new release already wrote to a new column, changed a payload, or stopped reading an old field, an older tag may come back expecting a world that no longer exists.
GitLab’s database docs make the same point in a more operational form: even something as ordinary as renaming a table has to be spread across multiple releases to preserve zero-downtime compatibility.
That is the practical argument against blind rollback. Older code is not automatically compatible code.
Database Migrations Are Where Rollback Goes To Die
This is where most “just roll it back” plans collapse.
GitLab explicitly says that there is no guarantee the application and the database can be rolled back together, and that the safest strategy is usually to roll forward instead.
That is the right default.
A schema migration often outlives the deploy that introduced it:
- New columns may already exist
- Old columns may already be unused
- Data may already have been backfilled
- Destructive changes may already have removed information
Even when your migration framework supports a down() method, that does not make production rollback safe.
“Migrations Should Never Create A Down”?
Taken literally, no. That is too absolute.
You can still write a down() migration when reversing the change is trivial, deterministic, and genuinely useful for
local development, CI, or short-lived test environments.
But you should not confuse “this migration has a down() method” with “production can safely roll back”.
Those are different claims.
For many real migrations, a truthful down() is either impossible or misleading:
- If you dropped a column, the lost data is gone
- If you rewrote values, the original intent may be unrecoverable
- If you deduplicated records, you may not know what to restore
- If you triggered external side effects, the database is only part of the story
That is why experienced teams often treat destructive migrations as operationally irreversible, even if the framework allows a reverse function to be written.
The useful rule is this:
Write reversible migrations when they are truly reversible. Do not rely on down() as your production incident
strategy.
If you need hard recovery from a bad migration, backups and restore procedures are far more honest than pretending every
DDL or data change can be cleanly undone. GitLab’s migration style guide
states that GitLab production uses roll-forward instead of db:rollback, and advises self-managed users to restore the
backup created before the upgrade started.
What Fix-Forward Actually Looks Like
Fix-forward does not mean “leave production broken and write a better version next week”.
It means your first response should be to restore safety while preserving compatibility.
That usually looks like this:
- Turn off the bad path
- Stop the blast radius
- Keep the schema and data model compatible
- Deploy the smallest safe patch
In well-run systems, step one is often a feature flag.
LaunchDarkly recommends feature flags during migrations because they let you disable a broken path quickly without immediately depending on a full redeploy.
That is exactly the kind of control you want in an incident:
- Disable a feature
- Pause a worker
- Stop a cron
- Route around the failure
- Ship the patch that restores correctness
This is operationally stronger than betting everything on an older build still being safe.
A Better Incident Rule
When a deploy goes bad, do not ask only:
What code changed?
Ask:
What state changed, and can older code still live with it?
That question leads to better decisions.
If state has not changed in an incompatible way, a revert may be a perfectly reasonable temporary move.
If state has changed, your safest path is usually:
- Mitigate immediately
- Preserve forwards and backwards compatibility
- Patch and redeploy
That is fix-forward in practice.
Verification
If you want fix-forward to be more than a slogan, verify that your delivery process supports it:
- New deployments can run safely alongside the previous version for a short period
- Database migrations follow an expand-and-contract model
- Destructive schema changes are delayed until old code is no longer running
- Feature flags or kill switches exist for high-risk paths
- Incident runbooks ask about schema, queues, and side effects before suggesting a revert
If those are not true, your rollback plan is probably weaker than you think.
Final Take
“Always fix forward” is not a prohibition on ever reverting code.
It is a reminder that production recovery is about system state, not just source control history.
Use a revert when it is the fastest safe mitigation.
But design your releases, migrations, and incident response around the assumption that once shared state has moved forward, the safest path is usually forward too.
That is the difference between rolling back code and actually recovering a system.