When AI Writes Most of the Code, Quality Has to Become Infrastructure

When AI Writes Most of the Code, Quality Has to Become Infrastructure

~ 14 min read


TL;DR

AI coding tools are no longer just autocomplete with a chat box.

Since late 2025, the combination of stronger coding models, agentic IDEs, long-running coding agents, code review assistants, and better tool integration has made it realistic for AI to produce most of the code in some teams.

That changes the shape of software engineering.

The bottleneck is no longer just writing code. It is proving that the generated code is correct, secure, reviewable, operable, and maintainable.

The software development lifecycle (SDLC) has to move from a human-paced delivery process to a control system around high-throughput code generation.

Or, more bluntly:

When code generation becomes cheap, confidence becomes a scarce resource.

The inflection point was not just model quality

AI coding tools have existed for years, but something changed around October and November 2025.

The shift was not just that models had become better at writing code. The surrounding workflow had become credible too:

  • agents could inspect a repository, make a plan, edit multiple files, run commands, and iterate
  • code-review tools became good enough to be part of the review loop
  • frontier models became more reliable on real software engineering tasks
  • IDEs and developer platforms started treating AI as an active participant in delivery, not just a suggestion engine

JetBrains made the useful point that older benchmarks from 2023 cannot really measure 2025 coding workflows because the task is no longer just “patch this issue”. Modern coding agents need to deal with reviews, coverage, compliance, multi-language changes, upgrades, and longer-running work. JetBrains introduced Developer Productivity AI Arena with that in mind.

In the same period, GitHub reported more than one million pull requests created by the Copilot coding agent between May and September 2025 in its 2025 Octoverse. Google launched Gemini 3 with a strong focus on agentic coding and Google Antigravity. OpenAI released GPT-5.1-Codex-Max for long-running agentic coding work. Anthropic released Claude Opus 4.5 with a similar emphasis on real-world software engineering and code review.

Those individual announcements have aged quickly.

The more important point is the pattern: coding models and tools crossed into a mode where it became plausible for AI to plan, write, test, review, and revise meaningful parts of a change.

That makes a lot of older research awkward. Anything that measured “AI coding ability” before that inflection point is still useful for understanding failure modes, but it is probably not a good measure of current capability.

The old SDLC assumed code was scarce

Most software delivery processes still assume code is expensive to produce.

That assumption leaks into everything:

  • requirements are often vague because humans can clarify as they go
  • review is treated as a human checkpoint
  • testing often happens after implementation
  • tests are written around what developers had time to build
  • production feedback is not always connected back to the exact change path
  • maintainability is mostly discussed when it has already become painful

That mostly worked when human implementation speed was the limiting factor.

But AI changes the economics of delivery.

If a coding agent can produce ten plausible implementations before a human reviewer has finished the first review, the limiting factor moves. The hard problem becomes verification.

Not “can we generate code?”

But:

Can we generate enough evidence to trust this change?

That evidence has to cover more than tests passing once.

A generated change may pass CI and still be wrong. It may be secure today and fragile tomorrow. It may satisfy the ticket while quietly increasing coupling, duplicating logic, or making a future migration harder.

That is why I think the SDLC needs to be redesigned around evidence rather than output.

The data points to a verification gap

The strongest research signal is not that AI code is always good or always bad. It is that AI increases throughput faster than most teams increase verification capacity.

Google’s 2025 DORA work reported broad AI adoption and productivity gains, but also framed AI as an amplifier of the system it enters. Strong teams can turn the extra speed into delivery; weak systems can turn it into instability faster. Google’s DORA summary is worth reading with that lens.

Sonar’s 2026 survey is even more direct. It reported that AI accounts for 42% of committed code, with developers expecting that to reach 65% by 2027. But it also found that 96% of developers do not fully trust AI-generated code to be functionally correct, and only 48% always check AI-assisted code before committing. Sonar calls this a critical verification gap.

Faros found a similar bottleneck from another angle. High-AI-adoption teams completed more tasks and merged more pull requests, but PR review time increased sharply. In other words, generation got faster, but human approval became a constraint. Their AI Productivity Paradox report is useful because it looks at the software lifecycle rather than only at individual developer productivity.

CodeRabbit’s analysis of AI-generated pull requests found higher issue rates across several categories, including logic, readability, security, formatting, and error handling. I would not treat one report as universal proof that AI code is worse, but I would treat it as a warning against assuming that “green CI” is enough. The report is here: State of AI vs. Human Code Generation.

The practical conclusion is simple:

AI code throughput can outrun human review throughput.

That does not mean teams should stop using AI.

It means the SDLC has to be instrumented so that generated code carries stronger evidence with it.

Abstract illustration of AI-generated code throughput narrowing into a verification bottleneck

Google is already talking about most new code being AI-generated

This is no longer a thought experiment.

At Google Cloud Next 2026, Sundar Pichai said that 75% of all new code at Google is now AI-generated and approved by engineers, up from 50% the previous fall. He also described engineers moving towards orchestrating autonomous “digital task forces”. The write-up is on Google’s own blog: Cloud Next ‘26: Momentum and innovation at Google scale.

Google is not a normal engineering organisation. Its tooling, infrastructure, review culture, and internal platforms are not what most teams have.

But that is exactly why the number matters.

If one of the most mature engineering organisations in the world is already operating with most new code AI-generated, the rest of us should not be asking whether this workflow is coming.

We should be asking what has to be true for it to be safe.

The SDLC needs to become a control system

I would stop thinking about the SDLC as a pipeline.

A pipeline suggests that work moves neatly from requirements, to implementation, to testing, to release.

That is too slow and too linear for agentic coding.

A better model is a control system:

  1. define intent
  2. generate a change
  3. collect evidence
  4. score risk
  5. apply gates
  6. release safely
  7. observe production
  8. feed outcomes back into the system

The point is not to remove humans.

The point is to use human attention where it matters most: intent, architecture, risk, judgement, and accountability.

Requirements should become executable intent

The first weakness AI exposes is vague requirements.

A human developer can often infer the missing pieces. They may ask questions, remember previous incidents, or know that a particular service has a weird edge case.

An AI agent may also infer the missing pieces, but it can do so confidently and incorrectly.

So the start of the workflow needs to become more explicit.

Before an agent writes code, the task should capture:

  • the user or business outcome
  • acceptance criteria
  • examples of expected behaviour
  • examples of rejected behaviour
  • risk class
  • architectural constraints
  • allowed services, APIs, and data stores
  • security and privacy requirements
  • performance and cost expectations
  • observability requirements
  • rollback or kill-switch expectations

This does not have to become heavy process.

It can be a lightweight issue template, an agent skill, or a pre-flight prompt that turns fuzzy intent into a clearer task contract.

But the requirement should be explicit enough that the agent is not inventing the goal while also inventing the implementation.

That is too much freedom.

Every AI-generated change should carry an evidence bundle

When an agent produces a pull request, the PR should not just contain code.

It should contain evidence.

A useful evidence bundle might look like this:

change:
    intent: "Add account-level API key rotation"
    risk_class: "high"
    ai_assisted: true
    agent:
        tool: "example-agent"
        model: "example-model-version"
        mode: "agentic"
    scope:
        files_changed:
            - "app/security/api_keys.ts"
            - "app/routes/account/api_keys.ts"
            - "tests/api_keys.test.ts"
        services_touched:
            - "account-service"
            - "audit-log-service"
    evidence:
        plan_recorded: true
        assumptions_recorded: true
        tests_added: true
        commands_run:
            - "npm run lint"
            - "npm test -- api_keys"
            - "npm run typecheck"
        security_review_required: true
        rollback_path: "feature flag: api_key_rotation_enabled"
    open_questions:
        - "Should old keys remain valid for 24 hours or 7 days?"

The exact schema does not matter.

The important point is that the agent must explain what it changed, why it changed it, what it tested, what it did not test, and where risk remains.

That lets reviewers judge the change without rediscovering the whole context from scratch.

The reviewer’s job becomes:

Do I trust this evidence enough to accept the risk?

Not:

Can I manually re-derive every line of this patch?

Abstract illustration of an AI-generated change carrying a connected evidence bundle for review

Review should be risk-based, not line-count-based

A 20-line change to authentication can be more dangerous than a 600-line generated form.

So review gates should be based on risk, not just PR size.

A simple version might look like this:

Risk classExampleGate
LowUI copy, docs, harmless internal toolingautomated checks + AI review + lightweight human approval
Mediumnormal feature work in a bounded areahuman review focused on intent, tests, edge cases, and maintainability
Highauth, billing, permissions, customer data, migrations, public APIssenior review + threat model + rollout plan + rollback or kill switch
Criticalregulated workflow, irreversible data operations, security-critical systemsno autonomous merging; require human-owned implementation or deep pair review

The aim is not to make AI-generated code special forever.

The aim is to recognise that generated code can arrive at a different speed and with different failure modes.

If the evidence is strong and the risk is low, the process can be light.

If the risk is high, the process should slow down deliberately.

Testing becomes more important, not less

I do not think AI removes testing.

I think it changes testing from a phase into an operating system.

In a human-paced SDLC, testing can sometimes sit after implementation and still catch enough issues.

In an AI-paced SDLC, that is too late.

Testing needs to work before, during, and after code generation.

Before coding:

  • turn requirements into executable examples
  • identify invariants
  • generate adversarial cases
  • define what must never happen
  • classify risk

During implementation:

  • run type checks, linting, SAST, dependency scanning, and secret scanning
  • require relevant unit, integration, and contract tests
  • use test-impact analysis so agents cannot run only the convenient tests
  • use an independent AI reviewer that did not author the code

After merging:

  • release behind flags
  • canary risky changes
  • monitor SLOs and error budgets
  • connect incidents and defects back to the model, tool, PR, and review path
  • feed escaped defects into internal evals

That changes the testing role.

Testing stops being a phase applied to finished software.

It becomes the practice of designing and operating verification for machine-generated software.

Security needs deterministic guardrails

Security is where “the model will probably do the right thing” is not good enough.

Veracode’s GenAI code security work found that a large share of AI-generated code samples failed security tests and introduced OWASP Top 10-style vulnerabilities. Their summary is here: Insights from 2025 GenAI Code Security Report.

The UK National Cyber Security Centre has also warned that human-review-only approaches will creak under aggressive AI adoption. Their “vibe coding” guidance argues for secure-by-default models, model provenance, AI-assisted review, deterministic architectures, and platforms that limit what poor-quality or malicious code can do. See Vibe check: AI may replace SaaS, but not for a while.

That last point is important.

The safest way to use AI-generated code is not to hope every generated line is perfect.

It is to reduce the blast radius when the generated code is wrong:

  • least privilege
  • sandboxing
  • secret scanning
  • dependency policy
  • infrastructure policy as code
  • runtime isolation
  • feature flags
  • rate limits
  • audit logging
  • automated rollback or kill switches

A secure AI-majority SDLC should assume that bad code will occasionally get generated.

The system should make that survivable.

Abstract illustration of AI-generated code passing through deterministic security and policy guardrails

Maintainability is the hard problem

A generated patch can pass tests and still make the codebase worse.

That is the part I worry about most.

AI is good at producing plausible local solutions. But long-term maintainability is not local. It is about how a change affects the shape of the system over time.

Useful maintainability signals include:

  • code duplication
  • cognitive complexity
  • dependency graph changes
  • coupling and fan-out
  • churn within 7, 30, and 90 days
  • corrective changes versus normal product changes
  • dead code growth
  • ownership coverage
  • refactoring ratio
  • time to understand and safely modify generated code

I would also track something more human:

Unknown code ratio: the proportion of production code that no current engineer can confidently explain.

That metric is uncomfortable, but it gets at the real risk.

The danger is not just that AI generates bad code.

The deeper danger is that AI generates good-looking code that nobody really owns.

There is some early research here. A 2026 preprint, Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source, found that agent-authored code in the studied projects survived longer at the line level, while also showing modestly higher corrective modification rates.

That is exactly the kind of nuance we need.

Code survival does not automatically mean quality. Code may survive because it is correct. It may also survive because nobody wants to touch it.

The dashboard I would build

If AI is producing a meaningful share of the code, I would want a dashboard that connects generation to quality outcomes.

Not a vanity dashboard showing “lines of code generated”.

Something closer to this.

Abstract illustration of an engineering quality dashboard connecting AI code generation to verification, review, production, and maintainability signals

Adoption and provenance

  • percentage of PRs that are AI-assisted
  • percentage of code that is AI-generated or significantly AI-assisted
  • model, tool, and version used
  • autocomplete versus chat-assisted versus agentic work
  • whether the agent generated tests as well as code
  • whether the agent reviewed its own work

Throughput

  • PRs opened and merged
  • lead time for change
  • cycle time by risk class
  • tasks completed
  • review queue depth

Verification

  • evidence bundle completeness
  • tests added per PR
  • relevant tests run per PR
  • static analysis findings
  • security scan findings
  • dependency findings
  • rejected AI review comments
  • accepted AI review comments

Review health

  • p50 and p95 review latency
  • PRs per reviewer per day
  • review comments per KLOC
  • issues found by humans only
  • issues found by AI only
  • issues found by both
  • escaped defects by review path

Production quality

  • change failure rate
  • incidents per PR
  • rollback or fix-forward events
  • SLO regressions
  • customer-impacting defects
  • MTTR
  • time from merge to first corrective change

Maintainability

  • complexity deltas
  • duplication growth
  • churn and rework
  • corrective modification rate
  • ownership coverage
  • unknown code ratio

The point is to close the loop.

If a particular model, prompt pattern, agent workflow, or repository area repeatedly produces defects, the team should know.

If AI-generated tests increase coverage but do not catch regressions, the team should know.

If reviewers are approving more PRs but production quality is falling, the team should know.

Internal evals matter more than public benchmarks

Public coding benchmarks are useful, but they are not enough.

OpenAI argued in 2026 that SWE-bench Verified no longer measured frontier coding capabilities well, because of test flaws and contamination concerns. Their write-up is here: Why SWE-bench Verified no longer measures frontier coding capabilities.

That does not mean benchmarks are useless.

It means organisations need internal evals based on their own reality.

Good internal evals should come from:

  • historical bugs
  • production incidents
  • security vulnerabilities
  • migration tasks
  • flaky test fixes
  • dependency upgrades
  • refactoring backlog
  • real code reviews
  • domain-specific business rules

A public benchmark can tell you whether a model is broadly capable.

An internal eval tells you whether it can work safely in your codebase.

That distinction matters.

Where humans stay in the loop

The phrase “AI writes most of the code” can make it sound like humans become optional.

I think that is the wrong framing.

Humans move up a level.

Humans should own:

  • product intent
  • architecture
  • risk acceptance
  • security posture
  • domain rules
  • production readiness
  • incident response
  • long-term codebase health

AI can own or assist with:

  • implementation
  • refactoring
  • test generation
  • documentation
  • code review
  • static analysis triage
  • migrations
  • debugging
  • release note drafting

That is still software engineering.

It is just less about typing the implementation and more about designing the system that safely produces, verifies, and evolves it.

Final take

AI coding agents are making code production dramatically cheaper.

That does not make engineering discipline less important.

It makes engineering discipline more important because mistakes can now be produced at machine speed.

The teams that benefit most will not be the ones that simply let AI write more code.

They will be the ones that redesign their SDLC around:

  1. executable intent
  2. evidence bundles
  3. risk-based review
  4. continuous testing
  5. deterministic security guardrails
  6. production feedback
  7. maintainability telemetry

The old question was:

How do we write this code faster?

The new question is:

How do we produce enough confidence to safely accept this much generated change?

That is the engineering problem now.

Further reading

all posts →