What it is
The tests and the code under test were generated by the same model in the same session, asserting the values the model just produced. The coverage number climbs. The suite is green. Nothing has been checked — only confirmed.
How it happens
Write the function with AI. Then: “now write tests for
this.” The AI reads the function it just wrote, walks the
branches, and asserts whatever the implementation happens to
return. If the function returns null for a missing
record, the test asserts null for a missing record
— not because that’s the right behavior, but because
that’s the observed behavior. Coverage goes from 0% to 94%.
The PR shows +340 lines, including tests. The
reviewer thinks: tests, great.
Nobody zoomed out to ask: what requirement do these tests prove? The AI never had the requirement. It had the code. It wrote tests about the code. The tests are a mirror, not a check.
Why it’s dangerous
A mirrored test suite is worse than no test suite, because no test suite is honest about what it doesn’t know. A green mirror tells you the code is safe to change when it isn’t. Bugs in the implementation get faithfully transcribed into the assertions — the off-by-one becomes the expected value, the swallowed exception becomes the expected silence. Refactoring becomes terrifying, because the tests are married to the current shape of the code, not to its behavior. Touch anything and dozens of tests go red for reasons that don’t correspond to any user-visible change.
And it compounds. The next AI session inherits the high coverage number, treats the suite as a safety net, and generates new mirrored tests against new mirrored code. The number keeps climbing while the confidence behind it keeps falling. The AI era hasn’t introduced this failure mode — engineers have always been able to write tautological tests — but it has industrialized it. A pattern that used to take ten minutes of bad discipline now takes ten seconds and arrives looking polished.
How to prevent it
Tests have to come from a different source than the code, or they’re not testing anything. The source can be a written spec, a ticket, an API contract, example inputs and outputs from the real world, the prior version’s behavior — anything that wasn’t produced by the same prompt as the implementation. If the only artifact in the room is the function, the tests will be a mirror.
Match rigor to stakes. A typo fix doesn’t need a new test. A new branch in business logic does — and the cheapest way to get a real one is to feed AI the requirement (not the implementation) and ask it to draft the test against the requirement. Then, before you trust the green, have AI mutate the function in a plausible way and rerun the suite. If nothing fails, the suite isn’t actually testing. The friction signal: when writing a test feels like rewriting the function in the negative, you’re mirroring.
The serious team fix
Three things, reinforcing each other:
- A team habit of writing tests from intent, not implementation. For any change with meaningful behavior, the author has to be able to answer one question: what requirement does this test prove? “It covers the function” is not an answer. The requirement might be a ticket, a contract, an example, a prior version’s output — but it has to exist outside the code under test. PR reviews ask the question out loud.
- An AI-leveraged adversarial test pass. A slash command or agent that takes the spec, contract, or example data as input — not the implementation — and writes tests against that artifact. Or a mutation testing pass: AI introduces small plausible bugs into the function (negate a comparison, off-by-one a loop, swap two arguments) and reports which mutations the existing suite fails to catch. Either way, the test source and the code source come from different artifacts, and the AI is doing the adversarial work the author won’t do alone.
- Real integration tests and mutation testing in CI. Unit tests that mock the world can mirror with impunity, because nothing outside the model is grounding the assertions. Tests that hit a real database, a real queue, a sandboxed third-party API — those anchor on something the model didn’t write. Mutation testing tools (Stryker, mutmut, PIT, and friends) make the lie visible at the CI gate: if the suite doesn’t notice when the code is sabotaged, the coverage number is fiction and the build can say so.
The shift is: coverage is a number that goes up. Confidence is the property you actually wanted, and it lives one layer down — in tests that came from a different artifact than the code, and in systems too real to deceive.