Green builds or it doesn’t ship: why teams must keep tests passing

Green builds or it doesn’t ship: why teams must keep tests passing

~ 3 min read

The Trust Crisis of Flakey Tests

Most teams value tests, but few treat a failing test run as a production incident for developer productivity. If your suite “usually passes” but periodically fails, it stops being a safety net and becomes background noise.

Adding more tests to a flaky suite compounds the problem. You aren’t increasing coverage; you’re increasing the surface area of instability.

Policy: Always keep main green. Fix flaky tests first. Don’t add new tests while the suite is failing.

TL;DR

  • Trust is the goal: A test suite is only useful if the signal is 100% reliable.
  • Stop-the-line: Treat red CI as a priority defect, not a nuisance.
  • Fix the root: Most flakiness stems from non-deterministic fake data or shared state.
  • Domain Builders: Prefer “valid by construction” factories over generic fakers.

The High Cost of “Mostly Green”

A flaky suite breaks the feedback contract. Instead of asking “did my change break something?”, developers start asking “did the dice roll against me today?” This leads to:

  • Rerunning CI until it passes.
  • Ignoring red builds and merging anyway.
  • Context switching and diagnostic waste.

Economically, stabilising tests is a high-ROI move. It improves lead time, reduces change failure rates, and protects developer flow.

Common Root Causes of Flakiness

  1. Randomness without seeds: Generators producing unique-constrained fields (emails, usernames) without a fixed seed.
  2. Domain violations: Fake data that is “realistic” but violates business rules (e.g. an end date before a start date).
  3. Shared state: Tests mutating shared fixtures or databases without proper isolation.
  4. Time leaks: Reliance on “now”, timezones, or month boundaries.
  5. Concurrency: Parallel tests colliding on the same data or resources.

A Practical Playbook for Stability

1. Stop the Bleeding

  • Freeze new tests unless the PR also fixes a flake.
  • Block merges on red. No “override because flaky”.
  • Make “fix flake” the default interrupt.

2. Measure and Categorise

Track failures by test name and signature (timeout vs. uniqueness violation). Focus on the top 3 offenders first. Systematic fixes beat whack-a-mole.

3. Fix the Data Patterns

  • Prefer Domain Builders: Create factories (e.g. validUser(), activeSubscription()) that enforce invariants by default.
  • Deterministic Uniqueness: Use monotonic suffixes or UUIDs instead of random strings.
  • Isolate State: Wrap DB tests in transactions or use unique namespaces per worker.
  • Control Time: Inject a clock or freeze time. Avoid “now” in assertions.

Stabilization Strategies

  • Flake Burn-down: Declare main must be green and swarm flakes as they appear. This is the strongest cultural path.
  • Quarantine: Move known flakes to a non-gating job. Warning: This must be time-boxed with clear ownership, or it becomes a “test graveyard”.

The Cultural Rule: Red is an Emergency

Teams choose one of two cultures:

  1. Red is normal: People shrug and rerun.
  2. Red is stop-the-line: People swarm and fix.

The second culture requires leadership’s permission to pause feature work to restore the delivery system’s integrity. Once teams experience a reliable suite, they never want to go back.

Closing Thought

Coverage is worthless without credibility. A small, trusted suite beats a large, flaky one every time. If your tests fail periodically, you don’t need more tests, you need tests you can believe in.

all posts →