AI Code QA in CI: Where Your Tests Actually Belong

I get a version of this question on every infra call. “We’re going to put tailtest in CI, right? That’s where our tests live.” The instinct is reasonable. For two decades the answer to “where do tests run” was “CI, with a green check on the PR.” The default is so deep that engineering leads reach for it before considering whether the AI coding era still rewards that placement.

It does not. Not as the primary catch-net. CI is still useful, but it is now the second line of defense. The first line has to live inside the build loop, at the agent’s edit boundary, before the diff has reached the PR.

I’m Nikhil, the CTO. I have spent nine months watching what happens when AI coding agents produce 5,000 to 25,000 lines of code per engineer-day and the testing layer sits one stage downstream of where the bug is born. The math does not work. The fix is not “make CI faster.” The fix is “move the discovery moment closer to the edit.”

The framing problem: CI as a habit, not an analysis

When an engineer adds AI tools to their team, the test conversation usually starts and ends with CI. “We have GitHub Actions running pytest. We’ll add coverage gates. We’ll make Claude write tests; CI will run them.” That sentence is reasonable on its surface and incomplete underneath.

CI’s placement in the workflow was designed for the human-engineer cadence. A human writes a feature over a day. Pushes a branch. Opens a PR. CI runs. A reviewer reads it. The discovery moment is the PR. That made sense when a feature took a day, a code review took an hour, and the weekly diff volume was bounded by how fast humans could type and reason.

AI agents do not work at that cadence. An AI agent edits a file every 30 seconds. A multi-turn task touches twelve files in twenty minutes. By the time CI runs, the agent has moved on, the context window has rolled forward, and the bug is strictly downstream of where it could have been fixed cheaply.

The 2026 industry data backs this up. Lightrun’s 2026 survey found 43 percent of AI-generated changes need debugging in production. Autonoma’s vibe-coded app study found 53 percent ship with security holes. ICSE 2026’s systematic review of 101 sources called QA the single most consistently overlooked dimension of vibe coding workflows. Most of those teams had CI pipelines. The pipelines were not the layer that caught the bugs.

Four gates where bugs can be caught

A bug in AI-generated code can be caught at one of four gates. Each gate has a cost-per-discovery that compounds the later you hit.

Gate 1: Per-edit. The agent just wrote a function. A hook fires. Tests run. The agent sees the failure inside the same turn it produced the code. The fix happens in-context, with the agent’s reasoning chain still warm. No human, no PR, no CI minutes. The marginal cost is one extra tool turn, which is roughly free.

Gate 2: Per-turn. The agent finishes a multi-file change. A coordinated test pass runs across the changed files. If something fails, the agent fixes it before handing back to the human. Slightly more expensive than per-edit, still inside the agent’s head.

Gate 3: Per-PR. The agent hands the change to a human. CI runs. A reviewer reads it. If something fails, the author has to context-switch back, re-read what the agent did three hours ago, and either fix it manually or ask the agent to fix it. The agent has to rebuild context from the diff alone. Cost: roughly one human-hour per discovered bug.

Gate 4: Per-deploy. The bug reached staging or production. Cost: an incident, a roll-forward or revert, post-mortem cycles, and a customer apology if user impact landed.

The cost curve from Gate 1 to Gate 4 is roughly exponential. A bug caught at Gate 1 costs nothing measurable. A bug caught at Gate 4 costs hundreds of dollars in engineering time at minimum.

Which gate should be the primary catch-net? In the human-engineer era the answer was Gate 3, because edits were rare and reviewers had the bandwidth. In the AI-coding era the answer is Gate 1, because edits are constant and reviewers do not.

What CI is structurally good at

I want to be careful here because the next section is going to sound like “skip CI” and that is not the argument. CI is excellent at a specific set of jobs no other gate can do well.

CI runs in a clean, reproducible environment, which makes it the right place for build-environment validation: the change must compile and pass tests somewhere that is not the engineer’s laptop. CI has privileged access to fixtures, services, and credentials that are not safe to expose to the agent’s local context. That makes it the right place for integration tests against a real database, end-to-end smoke tests that drive a real browser, security scans that need privileged scanners (SAST, dependency CVE checks), and the deploy gate itself.

These are real bug categories. A change that passes the per-edit hook might still fail the integration test because the new code talks to a service with a different schema. A change that passes integration might still fail the security scan because of a vulnerable dependency. A change that passes both might still fail the deploy gate because of a production-only config. CI catches these, and nothing else does, because surfacing them requires infrastructure the agent does not have.

The argument is not “remove CI.” The argument is “do not pretend CI is the unit-test layer for AI code.” Confusing those two sentences is what produces the 43 percent production-debugging rate.

What the in-build test layer looks like

The first-line catch-net has to fire at Gate 1, inside the build loop, before the agent’s context window has moved on. In practice this means a hook at the agent’s event boundary: PostToolUse in Claude Code and Codex CLI, afterFileEdit in Cursor, structured MCP tool returns in Cline. We’ve shipped four plugins, one per host, sharing the same R1 through R15 rule layer underneath.

The hook fires per edit. When it fires it runs a test pass biased toward the eight categories where AI agents systematically fail. These are not random; we built them from the empirical evidence of running adversarial tests against 55 open-source Python repositories and finding 17 real bugs. Almost every bug fell into one of these buckets:

Boundary inputs (MAX_INT, MIN_INT, very large strings, deeply nested structures, unicode outside the BMP)
Format and injection (the happy path is written; the injection-resistant version is a follow-up the agent forgets)
Type confusion (especially in Python; the agent assumes the duck typing it has read the most of)
Concurrent state (race conditions get written without the writer noticing)
Time and locale edges (DST, leap years, timezone shifts)
Partial failures (mid-network failures, disk-full, EINTR; agents handle the success path)
Resource exhaustion (deep input nesting, many concurrent file descriptors)
Off-by-one in iteration boundaries (fence-post errors, pagination, date math)

That category list is the R15 adversarial rule. The R12 classification rule does the next step: when a scenario fails, an automatic labeler tags the failure as real_bug (source is wrong), test_bug (test is wrong), or environment (infra-related). The label tells the agent what to act on. Without it the agent has to triage every failure manually, which costs context tokens and slows the loop.

In nine months of running this against the tailtest repo itself: 11,800 PostToolUse events, 100 percent triggered the test cycle, median 1.4 seconds added latency per edit. Compliance is 100 percent, not 70, because the hook does not depend on the model deciding to test. Hooks are deterministic. Prompts are not. That distinction is the entire reason this works.

For how the hook plumbing fires, see hook-based testing explained. For where this sits on the testing maturity ladder, see the 5 levels of AI testing maturity. The economic reason this is suddenly viable is in why testing AI-generated code is different.

Where CI still belongs, sharpened

With Gate 1 doing its job, CI gets to do its job. The pipeline should be running the test categories that genuinely need its environment.

Integration across services. When two services have to agree on a schema, the agent that edited one did not have the other in context. Integration testing in CI catches the disagreement. Gate 1 cannot, because it only sees one repo at a time.

Real-data fixture tests. Some tests need a representative dataset that does not belong on the engineer’s laptop. CI mounts the fixture. Gate 1 runs against synthetic inputs.

Privileged security scans. SAST tools, dependency CVE feeds, secrets detection. These need entitlements Gate 1 does not have.

The deploy gate. The final check before code reaches production. CI is the only gate with authority to block a deploy.

What CI should not be doing in 2026 is acting as the discovery layer for “the agent wrote a function and might have left a boundary bug.” That is a Gate 1 question. By the time it reaches CI, the cost of answering has multiplied tenfold.

The roadmap for in-build testing

The four plugins ship unit and scenario testing on a per-edit basis today. The next layer extends the same hook architecture to coverage classes that have historically required separate disciplines and separate tools.

Our end-to-end roadmap page lays out the Q4 2026 target: E2E scenarios that fire on relevant edits, inside the same hook lifecycle. When a change touches a flow with E2E coverage, the affected specs re-execute before the turn completes. No separate CI step required for the user-flow check. Visual regression for UI edits follows the same pattern. API contract checks for service edits do too. Security scans tied into the hook are on the same roadmap.

The pattern is the same in each case: the edit boundary is the trigger. The discovery moment is the same moment as the edit. The fix happens in the same turn. CI becomes the deploy gate it should always have been.

What to do this week

If you are shipping AI-generated code into production right now, the move is not to refactor your CI pipeline. The move is to add a layer at Gate 1 and keep CI for what it does well.

Install one of the four plugins. Claude Code: uvx tailtest install --agent claude. Cursor: swap claude for cursor. Codex CLI: --agent codex. Cline: --agent cline. The per-host install paths are on the comparison page.
Keep your existing CI as it is. Do not delete tests. Do not lower the coverage gate. The point is additive, not subtractive.
Measure for two weeks. Count how many failures fire at Gate 1 (the per-edit hook) versus how many surface at Gate 3 (CI on the PR) versus how many reach Gate 4 (production). A manual tally in two notebooks works if your metrics tooling does not.
Look at the ratio. In our internal measurements and across our case studies of 17 real bugs filed against 55 OSS Python repos, Gate 1 catches the overwhelming majority of issues that would otherwise have needed a human-author context switch. CI-discovered bug counts drop to the categories CI is actually good at: integration, security, deploy safety.

The two-week math is usually surprising. Teams that ran the experiment expected Gate 1 to catch maybe 20 percent of issues. The actual figure was closer to 70 percent of unit-class issues, with CI failure rates dropping in proportion. The remaining CI failures were almost entirely integration, schema, or security, which is what CI should be running anyway.

tailtest is MIT-licensed. No SaaS account. No telemetry. The four plugins share the same R-series rule layer. If the architectural argument lands, the install is one command per host.

FAQ

Is in-build AI code testing supposed to replace CI?

No. CI keeps doing integration tests, security scans, build-environment validation, and the deploy gate. In-build testing covers the layer CI was structurally bad at: per-edit unit and scenario testing for AI-generated code, fired at the agent’s edit boundary before the diff has reached the PR.

Why does the discovery moment matter so much?

The cost of fixing a bug is roughly exponential in how far it is from the moment of creation. A bug caught at the per-edit gate gets fixed in the same context window the agent generated it from. A bug caught at the per-PR gate requires a context switch back. A bug caught at the deploy gate requires an incident response. Moving the catch from later gates to earlier gates is the single highest-leverage testing change for AI-coding teams.

What categories of bug should still be caught by CI in 2026?

Integration across services, real-data fixture tests, privileged security scans (SAST, dependency CVEs, secrets), and the deploy-safety gate itself. These need infrastructure or entitlements that the per-edit hook does not have. Everything else (boundary inputs, type confusion, off-by-one, format injection, partial failures) belongs at Gate 1.