Green builds or it doesn’t ship: why teams must keep tests passing

28 December 2025 at 18:05 ~ 3 min read

The Trust Crisis of Flakey Tests

Most teams value tests, but few treat a failing test run as a production incident for developer productivity. If your suite “usually passes” but periodically fails, it stops being a safety net and becomes background noise.

Adding more tests to a flaky suite compounds the problem. You aren’t increasing coverage; you’re increasing the surface area of instability.

Policy: Always keep main green. Fix flaky tests first. Don’t add new tests while the suite is failing.

TL;DR

Trust is the goal: A test suite is only useful if the signal is 100% reliable.
Stop-the-line: Treat red CI as a priority defect, not a nuisance.
Fix the root: Most flakiness stems from non-deterministic fake data or shared state.
Domain Builders: Prefer “valid by construction” factories over generic fakers.

The High Cost of “Mostly Green”

A flaky suite breaks the feedback contract. Instead of asking “did my change break something?”, developers start asking “did the dice roll against me today?” This leads to:

Rerunning CI until it passes.
Ignoring red builds and merging anyway.
Context switching and diagnostic waste.

Economically, stabilising tests is a high-ROI move. It improves lead time, reduces change failure rates, and protects developer flow.

Common Root Causes of Flakiness

Randomness without seeds: Generators producing unique-constrained fields (emails, usernames) without a fixed seed.
Domain violations: Fake data that is “realistic” but violates business rules (e.g. an end date before a start date).
Shared state: Tests mutating shared fixtures or databases without proper isolation.
Time leaks: Reliance on “now”, timezones, or month boundaries.
Concurrency: Parallel tests colliding on the same data or resources.

A Practical Playbook for Stability

1. Stop the Bleeding

Freeze new tests unless the PR also fixes a flake.
Block merges on red. No “override because flaky”.
Make “fix flake” the default interrupt.

2. Measure and Categorise

Track failures by test name and signature (timeout vs. uniqueness violation). Focus on the top 3 offenders first. Systematic fixes beat whack-a-mole.

3. Fix the Data Patterns

Prefer Domain Builders: Create factories (e.g. validUser(), activeSubscription()) that enforce invariants by default.
Deterministic Uniqueness: Use monotonic suffixes or UUIDs instead of random strings.
Isolate State: Wrap DB tests in transactions or use unique namespaces per worker.
Control Time: Inject a clock or freeze time. Avoid “now” in assertions.

Stabilisation Strategies

Flake Burn-down: Declare main must be green and swarm flakes as they appear. This is the strongest cultural path.
Quarantine: Move known flakes to a non-gating job. Warning: This must be time-boxed with clear ownership, or it becomes a “test graveyard”.

The Cultural Rule: Red is an Emergency

Teams choose one of two cultures:

Red is normal: People shrug and rerun.
Red is stop-the-line: People swarm and fix.

The second culture requires leadership’s permission to pause feature work to restore the delivery system’s integrity. Once teams experience a reliable suite, they never want to go back.

Closing Thought

Coverage is worthless without credibility. A small, trusted suite beats a large, flaky one every time. If your tests fail periodically, you don’t need more tests, you need tests you can believe in.