Shipping today

Agent-edit testing.
Tests fire when the AI does.

Q: What does hook-based actually buy me over prompt-based testing?

Deterministic enforcement. Prompt-based instructions like 'always write tests' get 70-90 percent compliance because the agent may forget, deprioritize, or get convinced to skip. Hooks fire at the system level outside the LLM's reasoning chain, so compliance is 100 percent. For anything that must happen, hooks are the enforcement mechanism.

Every time your AI coding agent edits a file, tailtest queues the file, generates production-shaped scenarios, runs them, and surfaces failures back to the agent within the same turn. Hook-based. Deterministic. No prompting. Works across Claude Code, Cursor, Codex CLI, and Cline.

The problem this solves

AI coding agents ship code faster than the human review loop can verify it. The Lightrun 2026 survey found 43 percent of AI-generated changes need debugging in production. Anthropic's own April 2026 postmortem on Claude Code documented a window where the agent was "faking test compliance", writing tests that pass by working around the broken code instead of catching the bug.

The root issue: when the agent moves faster than the test suite, every guarantee tests were supposed to provide degrades. Unit tests get skipped. Regressions slip in silently. The diff looks clean. CI is green. Then the production bug surfaces three weeks later and the question becomes "when did this break?" and the answer is "we have no idea."

Prompt-based instructions ("always write tests for the code you write") get 70-90 percent compliance under typical conditions. Hook-based enforcement gets 100 percent compliance, because the test cycle fires whether the agent wants it to or not. tailtest is built on hooks for that reason.

How agent-edit testing works

Two hooks collaborate to keep pending_files accurate across the agent's turn:

1. PostToolUse hook (per-edit, mid-turn)

Fires after every file-mutating tool call. Parses the patch payload to identify which file changed. Applies the intelligence filter (skip test files, skip generated code, skip binaries). Queues qualified source files. Surfaces them to the agent as additionalContext so the agent sees what needs testing before the turn ends.

2. Stop hook (turn-end safety net)

Fires at the end of the agent's turn. Sweeps the project for any files modified during the turn that PostToolUse missed (shell-driven writes, redirected output, files touched outside the standard apply_patch tool). Returns decision: block if testing is still pending, prompting the agent to complete the test cycle before responding.

3. Scenario generation + execution

When a file is queued, tailtest's R1-R15 rule layer generates production-shaped scenarios covering the public surface. The agent writes those tests, executes them with the project's existing runner (pytest, jest, go test, cargo test, etc.), and reports failures with R12 classification (real_bug vs test_bug vs environment).

Per-host implementation

Same R1-R15 rule layer ships across four hosts. Hook integration is per-host because each editor exposes a different hook contract.

Host	Per-edit fire	Turn-end fire	Tests
Claude Code	PostToolUse hook	Stop hook	491
Codex CLI	PostToolUse hook (v4.9.0+)	Stop hook	400
Cursor	afterFileEdit hook	stop hook	181
Cline	.clinerules instruction + auto-approve (v1.0.1); PostToolUse hook migration on roadmap (V14.10)	.clinerules instruction	162

Total: 1,234 plugin tests across all four hosts. Same R1-R15 rule layer, same scenario generation logic, same R12 classification on failures.

Adversarial mode (R15)

Standard test generation confirms happy-path correctness. Adversarial mode (configured via "depth": "adversarial" in .tailtest/config.json) biases scenarios toward breakage paths across 8 categories:

1. Boundary inputs (MAX_INT, MIN_INT, empty, unicode, null bytes)

2. Format / injection (path traversal, regex specials, shell metachars, SQL fragments)

3. Type confusion (wrong type passed where another was expected)

4. Concurrent state (race conditions, shared mutable state)

5. Time / locale edges (DST, leap year, timezone shifts)

6. Error handling under partial failures (network mid-call fail, disk full, EINTR)

7. Resource exhaustion (very large input, deeply nested, many file descriptors)

8. Off-by-one logic (boundary indices, fence-post errors)

Adversarial mode generates 8-12 scenarios per file nearly all of which probe breakage. Categories that genuinely do not apply to a given file are skipped, with the reason stated explicitly.

Real bugs found, real bugs filed

We have run adversarial test passes against 47 open-source Python repositories and found 16 real bugs that maintainers have since acknowledged and worked through. Specific findings include:

jaraco/inflect #242: number_to_words(n) leaks IndexError instead of documented NumOutOfRangeError for n ≥ 10³⁶
majiidd/persiantools #62: JalaliDateTime copy-construct silently drops timezone info
mattrobenolt/jinja2-cli #145: Single-line docstring filter bug; merged and shipped
python-cmd2/cmd2 #1650: Module-level mutation in alias/macro create methods; merged within 24 hours

See all 16 findings on the case studies page

Frequently asked

Does tailtest slow down my agent?

No. The PostToolUse hook completes in under one second on a 5,000-file project. The Stop hook is similarly bounded. Scenario generation and execution take whatever pytest/jest/etc takes to run those tests, that latency is the same whether the agent runs them or you do.

Will my agent see the test failures or will I?

Both. tailtest emits additionalContext mid-turn so the agent sees what is queued and what failed; the failure summary is also visible to you in the terminal or IDE. By design, the agent's first response to a failing test is to fix it, not to show you.

What languages are supported?

Python, JavaScript, TypeScript, Java, Kotlin, C#, Go, Rust, Ruby, PHP. Test runners auto-detected per project. See the docs for the full matrix.

Is this just unit testing?

Today, mostly yes: unit-level scenarios covering the public surface of edited files. End-to-end and integration testing are on the platform roadmap. The platform page lists what is shipping now versus what is planned.

What does "Hook-based" actually buy me over prompt-based testing?

Deterministic enforcement. Prompt-based instructions ("always write tests") get 70-90 percent compliance because the agent may forget, deprioritize, or get convinced to skip. Hooks fire at the system level outside the LLM's reasoning chain, so compliance is 100 percent. For anything that must happen, hooks are the enforcement mechanism.

Get tailtest working in your project

One install command, zero config files. The agent picks up from there.

See install commands Back to platform overview