Back to posts
AINews

Your AI tests aren't actually testing anything

AI is great at generating green CI. It's worse at generating tests that catch real bugs. Five patterns I see in the wild — and a checklist to run before you trust an AI-written test suite.

Your tests are passing. Your prod is broken. The AI wrote both.

The single most common quality failure in AI-built codebases isn't "AI wrote bad code" — it's "AI wrote tests that pass without testing the thing." Green CI, red prod. Confidence without coverage.

This post is what those tests actually look like, why they slip through review, and a checklist you can run before merging.

Why this happens

When you ask a model to "write a test for this function," it has two ways to make the test pass: write a test that genuinely verifies the behavior, or write a test that trivially passes. Both score equally well on "tests pass." The model has no incentive to pick the harder one — and if it doesn't have access to your real database, your real schema, or the actual integration it's testing, it often can't.

So it mocks. It asserts on the mocks. It tests the happy path with values it just made up. The test goes green and ships.

Here are the five patterns to look for.

1. The unit under test is mocked

The most expensive version of this. AI is asked to test processPayment(). It mocks Stripe, mocks the database, and — quietly — also mocks processPayment itself. The test calls the mock, asserts the mock was called, passes, and tells you nothing.

A real example I saw: a checkout function was deleted entirely from the codebase during a refactor. Its tests still passed, because every test had set up a mock for the function being tested. The test suite ran green for two weeks before someone noticed the import error.

How to spot it: in any test file, search for the name of the function being tested. If it appears in vi.mock(...), jest.mock(...), or any mock setup, that's the smell. The thing you're testing should be the only thing in the file you're not mocking.

2. Asserting on the mock, not on the behavior

This one is subtle. The test does this:

expect(sendEmail).toHaveBeenCalledWith(user.email, expect.any(String));

But it never checks what the email said. So when the templating logic later breaks and sends Hello {{name}}, your order #{{orderId}} literally — with the curly braces — to 4,000 customers, the test still passes. The mock was called with a string. That string just happened to be wrong.

AI loves this pattern because it's easy to generate and never fails on edge data. The test verifies the call, not the output.

How to spot it: if every assertion in a test is toHaveBeenCalled or toHaveBeenCalledWith, the test isn't checking behavior — it's checking that the function did the thing the function was written to do. That's a tautology dressed as a test.

3. The happy-path-only suite

test('returns user by id', () => {
  const user = getUser(1);
  expect(user.name).toBe('Alice');
});

Beautiful. Passes. Useless.

What about getUser(999999) (doesn't exist)? getUser(null)? getUser(-1)? getUser('1') (string instead of number)? getUser(1) when the user has been soft-deleted? When the DB connection times out?

AI defaults to the inputs it can imagine working. It rarely generates the inputs that would expose the bug. So you get five tests that all assert variations of "the function works when called correctly," and none that assert what happens when it isn't.

How to spot it: count the test cases per function. If there's one test, or three tests that look like rephrasings of the same scenario, the suite is testing existence — not correctness.

4. The try/catch that swallows the assertion

test('handles invalid input', () => {
  try {
    parseConfig(badInput);
  } catch (e) {
    // expected
  }
  expect(true).toBe(true);
});

This test passes whether parseConfig throws, returns silently, or formats your hard drive. The try block has no assertion that the catch branch was even reached. AI writes this pattern because it pattern-matches "test that handles errors" without understanding that the test has to prove the error path was taken.

How to spot it: any test where the assertion is outside the try/catch, or where the catch block has no expect.fail() to enforce that the error actually fired. If removing the function call doesn't break the test, the test isn't testing the function.

5. The tautological test

The most insidious. AI writes a function and its test in one prompt. The test mirrors the implementation exactly:

function isEligible(user) {
  if (user.age < 18) return false;
  if (!user.verified) return false;
  return true;
}

test('isEligible returns false for under-18', () => {
  expect(isEligible({ age: 17, verified: true })).toBe(false);
});

The test will catch a regression on this exact branch. But the spec — what "eligible" actually means in your business — is locked in at whatever the AI guessed. If the requirement was "must be 18 and have completed onboarding," the test will happily certify a half-broken implementation forever, because the test was generated from the same misunderstanding.

How to spot it: does the test reference any source of truth that isn't the implementation? A spec doc, a Jira ticket pasted in a comment, real production data, a schema constraint. If the only authority for "what should happen" is the function being tested, you have a tautology.

The unlock: stop pretending units, run the system

The five patterns above are symptoms of the same disease — pretending you can isolate something that only exists when it's plugged in. The unit calls a database. The unit hits an API. The unit renders to a screen. Mock those away and the test is just confirming the mocks behave like you told them to.

The fix in 2026 is to run integration tests against real infrastructure instead of mocking it.

Backend: spin up a real Postgres, a real Redis, whatever the code actually talks to. Docker Compose makes it a 30-second move. The test that catches the migration bug is the one that runs against a real schema — not the one that trusts a mock ORM to behave like Postgres does. If your CI can't run a real DB, that's the bug. Fix that first; everything else downstream is shadow-boxing.

Frontend: skip snapshot tests. They tell you nothing useful unless someone reads every diff, which nobody does. Run Playwright (or your equivalent). Open a real browser. Click the actual button. Submit the actual form. Take a screenshot.

The new move AI unlocks: show the AI the screenshot and ask if it looks right. Not "did the assertion pass" — did the rendered UI actually come out acceptable. "Is the layout broken? Is the text readable? Does the error state match what we designed?" An AI reading a Playwright screenshot catches the visual regressions snapshot tests miss, and it explains them in plain English. That's a test you couldn't write in code before — and it's the only test that catches "the modal renders, but it renders off-screen on mobile."

Combine the two: run the actual flow against real infra, screenshot it, let an AI sanity-check the screenshot. That's the test that survives a refactor and catches what nobody anticipated.

The pre-merge checklist

Before you trust an AI-written test, run through these. None take more than 30 seconds:

  1. Is the unit under test mocked? Search the file for its name in vi.mock / jest.mock. If yes, the test is hollow.
  2. Does every assertion check behavior, not just calls? toHaveBeenCalled is fine with an output check. Alone, it's a smell.
  3. How many edge cases? At minimum: empty input, null/undefined, wrong type, boundary values, error path. If the test only covers one of these, ask for more.
  4. Does the error path have an assertion that proves it ran? expect.fail() in the try, or an explicit assertion the catch was reached.
  5. Where did the spec come from? If the AI generated both the function and the test from the same prompt, regenerate the test from the spec, then hide the implementation and ask: would this test pass against a wrong implementation?
  6. Comment out the function body and rerun. A real test should fail when the implementation is broken. If it still passes — the test was the bug.
  7. Did it run against the real thing? Not a mocked DB, not a stubbed API, not a snapshot of HTML — the actual integration. For frontend, that means a real browser via Playwright and a screenshot you (or an AI) can eyeball. If the test never touched the system it's testing, it didn't test it.

How to ask AI for tests that actually test

The cheap fix isn't more tests. It's better prompts.

  • Paste the actual schema, actual types, actual API contract. Don't let the AI invent the data shape.
  • Ask for "a test that fails if [specific behavior] is broken." Concrete failure modes, not "good test coverage."
  • Ask for the edge cases explicitly: "include null, empty, boundary, wrong type, and error path."
  • Generate the test from the spec, not from the implementation. If you have the function already, hide it. Give the AI the description of what it should do, not the code.
  • After the test is written, mutation-test it: change the function so it's wrong, and verify the test catches it. If it doesn't, the test is decorative.

A green CI is not a guarantee. It's a hypothesis. AI-generated tests turn that hypothesis into a confident lie if you don't audit them.

The good news: once you know what to look for, the audit takes a minute per file. The cost of skipping it is the bug that ships, the customer that finds it, and the on-call evening you spend tracing why all the tests passed.

Get the next post when it ships

One email on Sunday with the new post and a short list of what shipped that week — new guides, tool updates, and a couple of links worth reading.