R15 adversarial mode: 8 edge cases AI agents miss
Adversarial test generation against 47 OSS Python repos found 16 real bugs across 8 categories of edge cases AI agents systematically miss. The full taxonomy.
Adversarial test generation against AI-written code is the only test mode where the unit economics line up. A human writing eight categories of edge-case tests for every function would be slow and expensive. An LLM doing the same takes three minutes and a small amount of API quota. R15 is tailtest’s rule for that pass. This post walks through the eight edge-case categories R15 generates against, why those eight, and the empirical data behind picking them.
I’m Pallavi. I work on the rules engine and the empirical evaluation set that backs it. The eight categories below are not theoretical. They came out of running adversarial test passes against 47 open-source Python repositories between November 2025 and February 2026 and tracking which edge case categories produced real, confirmable, file-able bugs. We found 16. The categories below are where the 16 fell.
What adversarial test generation actually means
The phrase gets used loosely. In tailtest, adversarial test generation has a specific shape: after a file edit, the runner prompts the configured LLM to write tests against the just-edited code with explicit instruction to try to break it. The prompt template lives in tailtest/rules/r15/prompt.md. It does not say “write tests.” It says “write tests that the implementation is most likely to fail.”
The distinction matters. A standard test prompt produces tests that resemble the happy path of the implementation, because the LLM pattern-matches on the code structure. An adversarial prompt produces tests that resemble what a senior reviewer would ask about: boundary inputs, malformed payloads, race windows. The output is a test file with 8 to 20 cases per function, partitioned across the categories below.
The cost is meaningful but not prohibitive. R15 adds roughly 2.4x the API token cost of standard test generation in our measurements (because the prompt is longer and the output is more cases). It is gated by a per-session budget so you do not spend it on every minor edit. For functions that survive an R15 pass without a real bug surfacing, the resulting test file becomes the regression baseline.
The eight categories AI agents systematically miss
We started with a longer list. Empirical filtering brought it to eight. The criterion was: did running adversarial tests in this category produce at least one confirmable bug across the 47-repo evaluation set. Categories that produced zero bugs got cut. Categories that produced bugs but only in code we wrote ourselves got cut (selection effect). What remained is below.
1. Boundary inputs at extreme values
The pattern: code that operates on numbers, strings, or collections silently breaks when handed MAX_INT, MIN_INT, the empty string, a string of length 10 million, or a deeply nested data structure. The agent wrote the code against test fixtures of normal size and shape. The boundary cases were never in the agent’s context window.
Real bug found: a JSON parsing utility in a popular ETL library accepted a value of 2**63 - 1 without issue but silently truncated to 0 at 2**63. The agent had imported a C-backed parser without checking its overflow semantics. R15 generated the test that pinned the exact boundary.
2. Format injection paths
The agent writes the happy path. The SQL-injection-resistant version, the shell-escape-aware version, the HTML-encoded version, are all “follow up” work the agent forgets to do. In code that constructs queries, commands, or markup from user input, this is consistently where the bugs live.
Real bug found: a CLI tool that built a shell command from a user-provided filename. The filename foo; rm -rf / would have done what you think. R15’s format injection category produced the test in one pass.
3. Type confusion under loose typing
Python especially, but also TypeScript with any. The agent assumes the duck typing it has most often seen. Calling .lower() on what might be None. Calling len() on what might be an integer. The static type checker either is not configured or accepts the broader signature.
Real bug found: a config loader that called .strip() on a value the agent assumed was a string. The actual config sometimes provided an integer (because YAML parses unquoted numbers as numbers). R15 enumerated the type mismatch and produced the failing case.
4. Concurrent state and race windows
Any shared state across threads, processes, or asyncio tasks is a coin flip. The agent writes code that “looks right sequentially” and adds nothing about ordering. Race conditions appear under load and disappear under debugging.
Real bug found: a singleton cache initializer in a Django middleware that was not thread-safe. R15’s concurrency category produced a threading.Barrier-based test that forced the race and reproduced the duplicate initialization.
5. Time, locale, and timezone edges
DST transitions. Leap years. Negative time zones. UTC offsets that include minutes (yes, those exist; Asia/Kolkata is UTC+5:30). Code that does date arithmetic without explicit awareness of these. AI agents have read about them, can name them, and consistently produce code that ignores them.
Real bug found: a scheduler library that computed “next run time” by adding timedelta(days=1). Across a DST transition this drifts by an hour. The library’s tests were all run in a timezone-naive test harness that never hit the bug. R15 forced a freezegun context around the transition and surfaced the drift.
6. Partial and mid-call failures
Network mid-call failures. Disk-full conditions. EINTR. The success path gets written. A try/except block sometimes appears. The timing of the failure (during the read, during the write, during the commit) is almost never tested.
Real bug found: an HTTP retry helper that retried on connection failure but not on a partial-read failure mid-response, leaving the caller with a truncated body. R15 generated a test that used a streaming mock to fail at byte 17.
7. Resource exhaustion
Deeply nested input. Very large input. Many concurrent file descriptors. The agent’s code allocates without bound because the test fixtures it pattern-matched on were small.
Real bug found: a YAML parser wrapper that recursed without depth limit. A 10,000-level-deep YAML document blew the stack. R15 generated the depth-stress test that pinned the maximum the implementation could safely handle.
8. Off-by-one in iteration boundaries
Fence-post errors. Especially in pagination, date math, and slicing. The agent writes for i in range(n) correctly most of the time and writes for i in range(1, n) or range(0, n+1) wrong some of the time.
Real bug found: a pagination helper that skipped the first item on every page after the first. The off-by-one was in the offset = page * page_size calculation, which should have been (page - 1) * page_size given the helper’s 1-indexed convention. The agent wrote the 0-indexed version. R15’s pagination test surfaced the missing row.
How we got from 8 categories to 16 bugs
The 47-repo evaluation set was selected for properties that matter for AI testing tools: active Python projects, real users, code complex enough that adversarial generation has something to find. Repos were selected from a stratified sample of GitHub’s most-starred Python projects in domains we care about (web frameworks, data tooling, CLI tools, ML utility libraries). We excluded testing tools themselves, because those have unusual test surface bias.
For each repo we ran tailtest with R15 enabled at standard depth for two days of background work, then filed the bugs that R15 found as PRs or issues upstream. Sixteen of those were confirmed by maintainers as real bugs and either accepted as patches or acknowledged for fix. The category distribution:
- Boundary inputs: 4 bugs
- Format injection: 2 bugs
- Type confusion: 3 bugs
- Concurrent state: 1 bug
- Time and locale: 2 bugs
- Partial failures: 2 bugs
- Resource exhaustion: 1 bug
- Off-by-one iteration: 1 bug
The distribution is not uniform. Boundary inputs dominate because they are the easiest category to generate plausible tests for. Concurrent state under-represents because reproducing concurrency bugs deterministically is hard even for an adversarial pass. The distribution also reflects the language: Python’s duck typing is why type confusion ranks third. A similar evaluation across a Rust corpus would not produce the same shape.
You can see the full case-study writeups on the case studies page. Each one links the PR or issue upstream, so the claim is verifiable.
How to think about R15 in your own workflow
Adversarial test generation is not free. It costs API tokens and it costs CI time. The pattern that works for our own development is:
- Run R15 at standard depth on changed files in the per-edit hook. This catches the obvious cases at edit time.
- Run R15 at thorough depth as a nightly scheduled job over the diff since the previous night. This catches the cases that need more samples per function.
- Treat the R15-generated test files as part of the test surface. They are checked in, reviewed, and live in
tests/adversarial/next to the regular suite.
The point is not to run adversarial generation on every function in your codebase forever. The point is to run it consistently on new code and on code that just changed, where the bug-introduction probability is highest. The 16 bugs we filed were all in code that had been actively edited in the previous month.
If you are evaluating tailtest, the R15 pass is the part that surprises people the most. Most testing tools do happy-path generation by default and call it good. R15 starts from “what is the agent most likely to have gotten wrong” and works backwards.
Where to read more
The why testing AI-generated code is different essay covers the underlying argument for why edge cases in AI-written code are predictably weak. The platform agent edits page shows the runtime where R15 plugs in. The case studies page lists the 16 bugs by repo, category, and outcome.
FAQ
What does R15 mean in tailtest’s rules system?
The R-series is tailtest’s internal rule numbering. R15 is the fifteenth rule we shipped and corresponds to adversarial test generation. R12 is failure classification. R1 through R14 cover the standard test-generation modes.
Why eight edge case categories specifically?
The eight came from empirical filtering. We started with a longer list of candidate categories and kept the ones that produced at least one confirmable bug across the 47-repo evaluation. Categories that produced zero confirmable bugs got cut.
Does R15 work on TypeScript and Go?
Yes. The category list is language-agnostic. The prompt templates vary slightly per language (type confusion looks different in TypeScript than in Python; Go’s concurrency model needs different test shapes). The TypeScript and Go rule shapes are in tailtest/rules/r15/.
How much API quota does R15 consume?
Roughly 2.4x a standard test generation pass per function in our measurements. The per-session budget gates this. The default budget allows R15 on the most recently edited files and on a sampled basis for the rest.
Are the generated tests checked into the repo?
We recommend yes. R15-generated tests in tests/adversarial/ are part of the regression surface and get re-run as the code evolves. The alternative (regenerating from scratch each time) costs more and loses the signal that comes from a stable test baseline.