The Trust Crisis of Flakey Tests
Most teams value tests, but few treat a failing test run as a production incident for developer productivity. If your suite “usually passes” but periodically fails, it stops being a safety net and becomes background noise.
Adding more tests to a flaky suite compounds the problem. You aren’t increasing coverage; you’re increasing the surface area of instability.
Policy: Always keep
maingreen. Fix flaky tests first. Don’t add new tests while the suite is failing.
TL;DR
- Trust is the goal: A test suite is only useful if the signal is 100% reliable.
- Stop-the-line: Treat red CI as a priority defect, not a nuisance.
- Fix the root: Most flakiness stems from non-deterministic fake data or shared state.
- Domain Builders: Prefer “valid by construction” factories over generic fakers.
The High Cost of “Mostly Green”
A flaky suite breaks the feedback contract. Instead of asking “did my change break something?”, developers start asking “did the dice roll against me today?” This leads to:
- Rerunning CI until it passes.
- Ignoring red builds and merging anyway.
- Context switching and diagnostic waste.
Economically, stabilising tests is a high-ROI move. It improves lead time, reduces change failure rates, and protects developer flow.
Common Root Causes of Flakiness
- Randomness without seeds: Generators producing unique-constrained fields (emails, usernames) without a fixed seed.
- Domain violations: Fake data that is “realistic” but violates business rules (e.g. an end date before a start date).
- Shared state: Tests mutating shared fixtures or databases without proper isolation.
- Time leaks: Reliance on “now”, timezones, or month boundaries.
- Concurrency: Parallel tests colliding on the same data or resources.
A Practical Playbook for Stability
1. Stop the Bleeding
- Freeze new tests unless the PR also fixes a flake.
- Block merges on red. No “override because flaky”.
- Make “fix flake” the default interrupt.
2. Measure and Categorise
Track failures by test name and signature (timeout vs. uniqueness violation). Focus on the top 3 offenders first. Systematic fixes beat whack-a-mole.
3. Fix the Data Patterns
- Prefer Domain Builders: Create factories (e.g.
validUser(),activeSubscription()) that enforce invariants by default. - Deterministic Uniqueness: Use monotonic suffixes or UUIDs instead of random strings.
- Isolate State: Wrap DB tests in transactions or use unique namespaces per worker.
- Control Time: Inject a clock or freeze time. Avoid “now” in assertions.
Stabilization Strategies
- Flake Burn-down: Declare
mainmust be green and swarm flakes as they appear. This is the strongest cultural path. - Quarantine: Move known flakes to a non-gating job. Warning: This must be time-boxed with clear ownership, or it becomes a “test graveyard”.
The Cultural Rule: Red is an Emergency
Teams choose one of two cultures:
- Red is normal: People shrug and rerun.
- Red is stop-the-line: People swarm and fix.
The second culture requires leadership’s permission to pause feature work to restore the delivery system’s integrity. Once teams experience a reliable suite, they never want to go back.
Closing Thought
Coverage is worthless without credibility. A small, trusted suite beats a large, flaky one every time. If your tests fail periodically, you don’t need more tests, you need tests you can believe in.