From 47 OSS repos to 16 real bugs: testing Python with AI

We ran tailtest’s adversarial test generation against 47 open source Python repositories over four months and filed 16 confirmable bugs upstream. The result, on its own, is a small dataset. The shape of the result is what matters: which categories the bugs fell into, which categories produced nothing, what the false-positive rate looked like, and what it suggests about AI testing OSS bugs as a class of activity in 2026.

This post is the public version of the work. The full per-bug writeups are on the case studies page; this is the meta-analysis. If you want to know whether tailtest is real or marketing, the meta-analysis is the right level to look at.

Why we did this

A testing tool that has never caught a real bug in code its creators did not write is a testing tool no one should trust. The hardest version of “does it work” is “does it work against codebases your team has never touched, written by maintainers who have already shipped the code through their own tests and reviews.” Adversarial testing in that setting is the actual proof.

The 47 repos were chosen to be plausible targets. Not toy repos. Not abandoned ones. Active Python projects with real users, complex enough that adversarial generation has something to find. Stratified across web frameworks, data tooling, CLI tools, and machine learning utility libraries. We excluded testing tools themselves, because our adversarial passes against testing infrastructure have a selection bias that overstates results.

The full repo list and methodology are documented at case studies. The names you would recognize include some popular FastAPI ecosystem libraries, a few Pandas extensions, a couple of CLI tools with 10k+ stars, and some long-tail data utility packages. We did not file against repos with active deprecation notices or repos whose maintainers had publicly opted out of automated bug reports.

Methodology

For each repo:

Clone fresh. Install the development dependencies. Confirm the existing test suite passes.
Install tailtest with the Claude Code plugin. Run a 48-hour adversarial pass at standard depth across the source tree, partitioned by package.
Collect every test that R15 generated which produced a failure against the existing source.
For each failure, manually verify: is this a real bug, a test_bug (R15 generated a wrong test), or environment.
For confirmed real bugs, prepare a minimal reproducer in the upstream issue style and file via the maintainer’s preferred channel (PR or issue).
Track outcomes: accepted, acknowledged, rejected, no response after 30 days.

The pass produced 184 R15 test failures across the 47 repos. After manual verification, 23 looked like genuine bugs. After filing upstream, 16 were confirmed real by maintainers (accepted as patches or acknowledged for fix). 5 were rejected as “working as intended” (mostly cases where the maintainer had a documented reason for the behavior we thought was a bug). 2 had no response after 30 days. We are not counting the 2 no-responses as bugs; they may be, but we cannot claim them.

A false positive rate of 7 out of 23 (30 percent) for filing-grade candidates sounds high. It is in line with what experienced human bug hunters get when filing against unfamiliar codebases. The cost of investigating a false positive is real (maintainer time) and we have tightened our filing criteria since.

What the 16 bugs were

By category, using the R15 edge-case taxonomy:

Category	Bugs
Boundary inputs at extremes	4
Type confusion under loose typing	3
Format injection paths	2
Partial and mid-call failures	2
Time, locale, timezone edges	2
Concurrent state and race windows	1
Resource exhaustion	1
Off-by-one in iteration	1

The distribution is not uniform, and the non-uniformity is informative. Boundary inputs dominate because they are the easiest category for an LLM to generate plausible adversarial tests against (large integer, empty string, max-length string). The implementations break in this category because the original developers wrote against fixture sizes that did not include the boundary.

Type confusion ranks high because Python is type-loose. The same evaluation against a Rust corpus would show a different shape. Format injection is small because most modern Python web frameworks have defaults that prevent the most obvious cases; the bugs we found were in code that bypassed the defaults.

Concurrency under-represents because reproducing concurrency bugs deterministically is hard even for an adversarial pass. We have ideas for improving this in R16 (the concurrency-focused rule we are designing for Q3 2026) but in May 2026 R15 is honest about its weakness here.

A representative bug, end to end

To make this concrete, one of the 16 in full.

Repo: a popular Pandas extension library for time-series financial data. About 12k stars.

The code: a utility function that returned the “previous business day” for a given date. The implementation walked back one day at a time and skipped weekends and a hard-coded list of US bank holidays. It did not handle the case where the input date itself was a holiday or weekend. The first iteration of the loop did not pre-check.

R15 generated a test that asserted: “previous business day before Monday Jan 19 2026 (MLK Day, observed) should be Friday Jan 16, not Sunday Jan 18.” The implementation returned Sunday because the first iteration did not consider the input being non-business.

We filed. The maintainer confirmed the bug, thanked us for the fixture (the failing test case shipped with the bug report), and merged a patch within four days. The patch was a two-line check at the top of the function. The full case-study writeup is on case studies with the upstream PR linked.

This bug had survived two years of human review and a test suite that tested the function dozens of times. The test suite had tested it against weekdays only. R15 found the case because R15’s prompt explicitly asks “what input is the implementation most likely to fail on” rather than “write a test for this function.”

The qualitative difference matters. Human-written tests pattern-match on the implementation. Adversarial tests pattern-match on the implementation’s likely blind spots.

What the result implies for AI testing OSS bugs broadly

A few observations from the 47-repo dataset.

First, the bugs are not in obscure code. The 16 bugs were in functions called by real users in real production. They had escaped review by maintainers who are good at their jobs. The bugs were not difficult once stated; they were difficult to think of stating. Adversarial generation closes that gap.

Second, the bugs cluster in eight categories, which is itself a stable result. Across the 47 repos, the categories of failure were predictable in advance. This is what made R15 possible: the categories are an enumeration of “what AI-era testing has to specifically cover,” and they remain useful even when the code under test was written by humans. Pallavi wrote up the categories in detail in the R15 adversarial mode post if you want the full breakdown.

Third, the bug-yield-per-repo curve is not linear. Some repos produced 2 or 3 bugs each. Most produced none. The distribution is power-law-ish, which suggests that repo selection matters more than the per-repo intensity of the pass. A 47-repo run at standard depth found more than a 5-repo run at thorough depth would have. This is informative for anyone running similar campaigns: breadth beats intensity at this stage of the tooling.

Fourth, the false-positive rate before maintainer triage was 30 percent of filing-grade candidates. We can tighten this with better verification heuristics (we have already cut the post-tightening rate to 18 percent on the current evaluation set). The number to internalize is that AI-driven bug discovery in OSS is not noise-free; it requires human judgment to file responsibly.

Why we ran this against ourselves first

Before the OSS campaign we ran the same passes against tailtest’s own code. Five months of internal dogfooding. We found 31 bugs in our own code, fixed all of them, and only after that did we feel honest running against external repos.

This sequencing is the standard one I think any responsible AI-driven bug discovery effort has to follow. The tool’s claim to be useful is undercut if its creators have not put their own code through it first. Tailtest’s plugin test suites (1,234 tests across four hosts as of May 24) include the adversarial tests that surfaced those 31 bugs. The regression baseline is there for anyone to inspect.

What this does not prove

A few honest disclaimers.

It does not prove that tailtest catches more bugs than another tool would have. We did not run the same campaign with a competing tool’s adversarial mode as a control. Maybe Mabl, TestSprite, or hand-written hypothesis tests would have found a similar set. The comparison would be valuable; we have not done it.

It does not prove that tailtest catches all the bugs in the 47 repos. We caught the ones the adversarial pass surfaced at standard depth in 48 hours per repo. Deeper or longer passes would likely catch more. The 16 is a lower bound on what was findable, not a complete count.

It does not prove that tailtest will catch bugs in your code. Your codebase has its own shape. Some categories will hit harder; some will not. The 47-repo run is evidence the categories generalize, not a guarantee.

What it does prove is that the eight categories are not theoretical, that R15 generates real failing tests against real code, and that maintainers confirm the bugs as real when filed. That is the claim. The rest is your judgment.

What we are doing next

The 47-repo campaign is ongoing. We are adding 10 to 15 repos per month to the rotation. The current target is 100 repos by end of 2026. Each new bug filed gets added to case studies.

We are also running a parallel evaluation in TypeScript (Cline and Cursor users will care). The TypeScript evaluation set is 22 repos as of May 24 and the bug count so far is 4 confirmed. Lower than Python at the same repo-count, which is consistent with the type system catching more at compile time.

The R16 rule (concurrency-focused, beyond R15’s category 4) is in design. Race conditions are the category R15 under-finds and the one most likely to cause production incidents in long-running services.

What to do if you want to try this yourself

If you maintain an OSS Python project and you want to run tailtest against it, the install is uvx tailtest install --agent claude from the repo root. R15 is enabled by default at standard depth. Let it run overnight. Read the report at .tailtest/reports/latest.html. The categories that show up are the ones to think about first.

If you find bugs in your own code, file them as you normally would. If you find bugs you can confirm, we would love to hear about them; the case studies page accepts community-contributed case studies if you want to add yours.

The broader argument for adversarial testing against AI-generated code is in why testing AI-generated code is fundamentally different. The argument for adversarial testing against human-written OSS code is the result above. The economics are the same in both cases. The cost of generating tests has dropped to near zero. The cost of not generating them has not.

The 16 bugs above are 16 fewer surprises for 16 sets of users. That, at the end, is the work.