The Fake Confidence Refactor
AI rewrites working code with great confidence and quietly breaks edges.
What it is
Working code goes in. “Cleaner” code comes out. The diff is huge, the commit message is enthusiastic, the tests still pass — but specific behaviors that nobody tested (the encoding edge case, the timezone handling, the obscure permission check, the cache invalidation on the unusual code path) are subtly different. The refactor is a regression with a great-looking PR description.
How it happens
Someone asks AI to “clean this up” or “make this more idiomatic.” AI rewrites with high confidence: the new version is more readable, follows convention, removes apparent duplication. But “apparent duplication” was sometimes load-bearing — the two near-identical functions handled subtly different cases, and merging them silently picked one interpretation. The lookup table that “could be a switch” was actually data-driven for a reason. The check that “could move to the parent function” runs in a context the parent doesn’t.
The specific AI failure modes: the model can’t see the non-obvious constraints (compliance, performance, downstream-consumer expectations, the bug fix from two years ago that this line silently encodes). The diff is too large to read line by line. The commit message asserts the change is equivalent in confident prose. And “tests still pass” is read as “behavior preserved,” which it isn’t — it only means the tested behavior is preserved, and the dangerous edges are exactly the untested ones.
Why it’s dangerous
The bugs land in production and look unrelated to the refactor. The on-call engineer chases the symptom a week later and never thinks to suspect the cleanup PR. The original author is long gone — their code looked “messy” and got tidied up by someone with fresh eyes, except the messiness was load-bearing scar tissue from a real bug. The team trusts the commit’s confidence longer than it deserves.
The AI-era hinge: cleanups used to be small because they were laborious. Now they’re large because they’re cheap — and the variance in correctness scales with the size of the diff. Pre-AI, the friction of rewriting 800 lines was itself a brake on doing it casually. Post-AI, there is no brake, and the team needs to install one deliberately.
How to prevent it
Refactors that aren’t supposed to change behavior need behavior-preservation evidence, not vibes. The author has to point at the tests (or the contract, or property tests, or a shadow comparison) that prove behavior is preserved. “Tests still pass” alone is the absence of contradicting evidence — not the same thing as evidence. AI is excellent at producing the preservation evidence itself: running the same inputs through old and new and reporting diffs is exactly the kind of mechanical work it does well.
Scale to stakes: renaming a private helper isn’t a payment-processor rewrite. The friction signal is the moment the refactor PR is bigger than the feature PR for the same area — the burden of proof flipped, and nobody noticed.
The serious team fix
Three things, reinforcing each other:
- A team habit of explicit preservation claims. Refactor PRs state what behavior is being preserved and how the author knows — “characterization tests cover the X path,” “OpenAPI contract tests cover the response shape,” “shadow comparison for 24h shows no diff.” “Tests still pass” alone isn’t a preservation claim; it’s an assumption. The PR template asks the question so the author has to answer it.
- AI as before/after differ. A slash command or agent that runs the same input through the old and new code paths and reports the diff — not just on return values, but on side effects, log lines, queries emitted, events published, response shapes, status codes. AI is excellent at the comparison; it is terrible at deciding when a diff matters. Surface every diff; let the human judge which ones are acceptable.
- Characterization tests and shadow traffic for critical paths. Before any refactor lands on a load-bearing component, the system has tests that pin the current observable behavior in enough detail that a regression can’t land silently. For the most critical paths, shadow traffic runs both versions in parallel against real load and the diff is monitored. The infrastructure catches the silent regression that the diff review couldn’t.
The shift is: AI’s confidence is uncorrelated with AI’s correctness. The refactor is safe only when something other than the AI’s tone is confirming it.