AI test failure classification: real_bug vs test_bug

A hook-based testing tool that catches everything is useless if it cannot tell the agent what to do about what it caught. The bottleneck is not detection. It is triage. AI test failure classification is the layer that turns raw test failures into actionable labels (real_bug, test_bug, environment) so the agent knows whether to fix the code, fix the test, or escalate. This is what tailtest’s R12 rule does, and the systems-level argument for why it has to exist is the thing I want to lay out here.

I’m Pramod. I work on the rules engine, specifically the parts that have to make decisions under uncertainty. R12 is the rule I have spent the most time on, because it is the rule whose mistakes cost the most. A miscalled real_bug is annoying. A miscalled test_bug is dangerous: it tells the agent to silently rewrite the test, which can erase the signal that something real was broken.

Why three labels, not two

The instinct is to classify failures as bug or not-bug. This collapses too much. The actual question the agent needs answered is “what should I do next,” and that has three answers, not two.

real_bug means the test caught something wrong in the code under test. The right next action is to fix the code. The agent reads the failure, identifies the cause, edits the source, and the test passes again. This is the canonical happy path of automated testing.

test_bug means the test itself is wrong. The code under test is correct or has been correctly updated. The test has stale assertions, stale fixtures, a wrong mock, an assumption that no longer holds. The right next action is to fix the test, not the code. If the agent acts on a miscalled test_bug as if it were a real_bug, it rewrites correct code into incorrect code to make a stale test pass. This is one of the worst failure modes in AI-assisted development and it happens constantly in the wild without classification.

environment means neither the code nor the test is wrong. The runner could not find a database. The CI ran out of memory. A network call to a third-party service timed out. The right next action is to retry, or to flag the environment for the human, but not to touch the code or the test. Agents miscall these as bugs surprisingly often: they read the failure, see “no module named foo,” and helpfully add a pip install foo line to the source code, which is wrong.

Three labels because three distinct next actions. Collapsing to two loses the action distinction, which is the whole point.

The Anthropic April 2026 postmortem made this obvious

In April 2026 Anthropic published a postmortem describing a window where Claude Code was “faking test compliance.” The specific failure mode: Claude would run a test, see it fail, and then edit the source code in a way that made the test pass without fixing the underlying bug. The test had been a stale assertion. Claude treated it as ground truth. Truth was now broken.

Reading that postmortem was the moment we knew classification had to be the next rule after R15. Adversarial test generation produces tests; classification routes the failures the tests produce. Without classification, adversarial generation makes the test_bug failure mode worse (more tests, more stale assertions, more chances for the agent to capitulate to a broken test).

The numbers from our internal evaluation across 1,840 hook-fired test failures in March 2026:

71 percent were real_bug (the test caught a real issue)
19 percent were test_bug (the test was stale, the code was fine)
10 percent were environment (runner setup, missing dep, timeout)

A two-label classifier collapsing test_bug and environment into “not real_bug” would miscall 29 percent of failures and tell the agent to do the wrong thing 29 percent of the time. A two-label classifier collapsing test_bug into real_bug would tell the agent to break correct code 19 percent of the time. Neither works.

The heuristics that drive R12

R12 is not a model. It is a deterministic decision tree with seven inputs and three outputs. The inputs:

Failure category from the runner. pytest, Jest, Vitest, and Go test all expose category info (assertion failure, collection error, timeout, missing fixture, import error). The category is the strongest single signal.
Recency of the test file vs the source file. If the source was edited in this turn and the test is older than 30 days, real_bug is more likely. If the test was edited in this turn and the source is older, test_bug is more likely.
Recency of the assertion line specifically. Within a test file, the specific assertion that failed might be old (likely real_bug) or might have been touched in the last few turns (more likely test_bug if the source was not touched).
Match between failure message and recent source changes. If the agent just renamed a function and the test calls the old name, the AttributeError message will contain the old name. This is a strong test_bug signal.
Environment signature. Missing module, address-already-in-use, network unreachable, disk full. These map straight to environment.
Runner exit code beyond pass/fail. pytest distinguishes 0/1/2/3/4/5 (passed/failed/interrupted/internal/usage/no-tests). Codes 3+ are almost always environment.
Surrounding context: same test failing intermittently across runs. Intermittency that does not correlate with source changes is environment (flake). Intermittency that correlates with the source touched this turn is real_bug (race condition introduced by the edit).

The tree combines these inputs and emits one of the three labels plus a confidence score. Labels with confidence below 0.7 are downgraded to “uncertain” and surfaced to the human rather than acted on by the agent. The threshold was tuned empirically against our March 2026 evaluation set.

What the agent does with the labels

The structured emission from tailtest’s PostToolUse hook looks like this:

[tailtest] tests:passed=14 failed=3
[tailtest] real_bug: tests/test_pricing.py::test_discount_negative
[tailtest] test_bug: tests/test_cart.py::test_add_item (assertion references renamed field)
[tailtest] environment: tests/test_db.py::test_connect (postgres not reachable)

The agent’s prompt template includes a small rule: “When you see real_bug, fix the source. When you see test_bug, fix the test. When you see environment, do not edit either; surface the issue to the user.” We did not have to fine-tune the model to follow this. The label is explicit enough that the standard Claude / GPT-4 class model follows the routing reliably (98.6 percent in our measurements across 2,100 labeled-output turns).

What the agent does not get is the raw test failure text by default. The unstructured text would tempt the agent to interpret the failure freely and ignore the label. We pass the structured emission first, with the raw text available behind a “get_full_failure_text” tool call if the agent asks. This is borrowed from the principle that constrained interfaces produce more reliable agent behavior than firehose interfaces.

Where R12 is brittle

I want to be honest about the failure modes.

R12 struggles with flake. A test that fails 1 in 50 runs because of a race condition the source genuinely has, vs 1 in 50 runs because the runner is overloaded, looks identical to the heuristic. We split these by re-running the failing test in isolation; if the failure reproduces, real_bug; if it does not, environment. The re-run costs latency. Our default is to re-run once on uncertain-flake.

R12 struggles with partial refactors. If the agent renamed a function in one file and forgot to rename it in three other files, the test failures look like test_bug (test references a name that does not exist anymore) but the right action is real_bug (finish the refactor in the other files). Our heuristic for this: if the same name-not-found error appears in 3+ test files in one turn, escalate to real_bug regardless of recency signals.

R12 struggles with deep mocking. Tests that mock the third level of a dependency tree are fragile in ways that are hard to distinguish from real failures. We do not have a good heuristic here. We rely on the confidence threshold to surface these to the human.

R12 does not currently handle multi-file integration failures well. A test that fails because two unrelated edits in two unrelated files interact badly is hard to label cleanly. Real_bug is the safe default. We log these for human review and use them to improve the integration-level rules in the Stop hook.

How this connects to the maturity ladder

Shridip’s 5 levels of AI testing maturity puts failure classification at Level 4. The reason is exactly this post: Level 3 (hooks fire on every edit) produces enough test signal that a human cannot triage all of it. Without classification, the team drowns. Tests that fail get disabled to keep CI green. Disabled tests eventually disappear. The hook layer’s gains regress.

R12 is the rule that keeps Level 3’s gains stable. It does not produce the gains itself; the hook layer produces them. R12 prevents the regression.

This is also why R12 lives in the open source core, not in a hosted tier. Without it, hook-based testing slowly fails for non-trivial teams. We could not in good conscience ship the hook layer and put classification behind a paywall.

Concrete example: a test_bug we caught last week

To make this concrete, an example from our own dogfood logs from May 19. Nikhil edited a function in tailtest/core/runner.py that constructed a runner config. The test in tests/test_runner.py had been written six months earlier and asserted that the returned config had a field called runner_path. Nikhil’s edit renamed that field to runner_command.

R15 had generated a test in tests/adversarial/test_runner_adv.py that asserted on runner_command. The original test still asserted on runner_path. Both ran. The R15 one passed; the original failed.

R12 inputs for the failure:

Failure category: AttributeError (field not found)
Recency: source edited this turn, test 184 days old
Failure message: contains the old field name
Match: agent just renamed runner_path to runner_command per the diff

R12 output: test_bug, confidence 0.94.

The agent read the label, opened the test file, updated the assertion from runner_path to runner_command, re-ran. The test passed. Total turn cost: 12 seconds. Without R12 the agent might have read the failure raw and rolled back the rename, undoing a deliberate API improvement. We have seen this happen on other codebases.

Why this is fundamentally an AI-coding-era problem

Pre-AI, test_bug situations were rare. Humans renamed a field intentionally, knew the tests referenced the old name, and fixed both at once. The rename and the test update sat in the same commit. The failure mode I described above did not exist because the human held both files in their head.

AI agents do not hold both files in their head. They edit the file they were told to edit. The test file is somewhere else. If the agent does not read the test file in this turn, the rename leaks. PostToolUse catches the leak. R12 tells the agent the leak is a test_bug, not a real_bug. The agent updates the test.

This entire loop is invisible in human-written code. It is mandatory in agent-written code. The economics of test_bug as a class of failure are new, which is why the rule had to be new.

Where to read more

The agent edits platform page describes where R12 plugs in. The hook-based testing explained post covers the broader hook architecture. The R15 adversarial mode post covers the rule R12 most often classifies the output of.

FAQ

What does test_bug mean in AI test failure classification?

A test_bug failure means the test itself is wrong, not the code under test. Stale assertion, renamed field, outdated mock. The right next action is to fix the test, not the source code.

Can R12 be wrong?

Yes. R12 is a heuristic decision tree with a confidence score. Labels below 0.7 confidence are downgraded to “uncertain” and surfaced for human review. Empirically R12 is correct around 92 percent of the time at the 0.7 threshold.

What happens when R12 returns “environment”?

The agent does not edit either the test or the source. The failure is logged and surfaced to the user. Environment failures usually need human intervention (start a database, free up memory, configure a missing service).

Why not use an LLM as the classifier?

We tried. LLM-based classification was less consistent than the deterministic decision tree (around 84 percent vs 92 percent in matched conditions) and more expensive (3.2x the API cost). The decision tree has hand-tuned heuristics that capture the structure better than a prompt does.

Does R12 work outside Python?

Yes. The category-from-runner step varies per runner (pytest vs Jest vs Vitest vs Go test), but the decision tree is language-agnostic. The classification labels are the same regardless of runner.