How tailtest works¶
This page covers the internal mechanics: how each platform detects file changes, how shared library modules handle the heavy lifting, how session state is stored, how depth is chosen per file, and how history persists across sessions.
The hook loop, per platform¶
Claude Code¶
Claude Code exposes three lifecycle hooks that tailtest uses.
SessionStart fires when you open a Claude Code session. Tailtest scans for test runners (pytest, vitest, jest, etc.), writes initial session state to .tailtest/session.json, and injects context into Claude. If history from prior sessions exists, failures and regressions from those sessions are included in the startup context so Claude knows about them.
PostToolUse fires after every file write or edit (Write, Edit, MultiEdit, NotebookEdit). Tailtest checks whether the file is a testable source file (not a test file itself, not generated code, not a config). If it passes the filter, a context note is added: tailtest: billing.py queued (new, python). write test to tests/test_billing.py. runner: pytest. Claude sees this note and knows what to do before it responds to you.
SessionEnd fires when the session closes. Tailtest persists scenario outcomes (which files were tested, whether they passed or needed fixes) to the scenario log in session state, then flushes that session's results to the cross-session history file.
Cursor¶
Cursor uses two hooks.
afterFileEdit fires when a file is saved. Tailtest filters the file the same way as Claude Code. Qualifying files are written to the session state at .cursor/hooks/state/tailtest.json.
stop fires when Cursor's AI finishes a turn. Tailtest checks the session state for any queued files and, if found, fires a followup_message that instructs the model to generate tests before the next user turn. The model handles the testing in that follow-up turn, then clears the queue.
Codex CLI¶
Codex uses a single hook.
stop fires at the end of every Codex turn. Tailtest sweeps the project for source files whose mtime has changed since the last sweep. Changed files that pass the filter are queued. If the queue is non-empty, the hook returns decision: block, which pauses Codex and prompts it to run tests before continuing. This mtime-based approach works because Codex does not expose a per-file-write event.
Shared lib/ modules¶
All three platforms ship a copy of the same Python library. The modules are:
| Module | What it does |
|---|---|
runners.py |
Detects test runners by scanning pyproject.toml, package.json, pom.xml, Cargo.toml, etc. Resolves the run command and test file location per language. |
filter.py |
Decides whether a file should be tested. Applies .tailtest-ignore, skips test files, generated code, config files, and build output. |
complexity_scorer.py |
Scores a file for depth. Uses path signals and content patterns (see the Depth system section below). |
history_manager.py |
Reads and writes .tailtest/history.json. Classifies entries as gap, regression, fixed, or passed. Detects recurring failures across sessions. |
scenario_log.py |
Builds per-file outcome entries (passed, fixed, unresolved, deferred) at session end. Feeds into history. |
impact_tracer.py |
Optional, Python only. Finds files that import the changed file, so Claude knows which other files may be affected by the change. |
api_validator.py |
Optional, Python only. Verifies that the public functions and classes in a Python file are importable before tests are written. Guards against hallucinated APIs where both the source and the tests reference a function that does not actually exist. |
Session state¶
Tailtest writes a session state file while the session is active.
| Platform | Path |
|---|---|
| Claude Code | .tailtest/session.json |
| Codex | .tailtest/session.json |
| Cursor | .cursor/hooks/state/tailtest.json |
The file tracks:
pending_files: files queued for testing in the current sessionfix_attempts: how many times Claude tried to fix each failing filegenerated_tests: which source files had tests written this sessionscenario_log: per-file outcomes appended at session endlast_failures: failures from the previous session, injected into startup context
Add .tailtest/ (and .cursor/hooks/state/) to your .gitignore. These are working files, not artifacts to commit.
Depth system¶
Depth controls how many test scenarios tailtest asks Claude to generate for each file.
| Depth | Scenarios | What is covered |
|---|---|---|
simple |
2-4 | Happy path only |
standard |
5-8 | Happy path plus key edge cases (default) |
thorough |
10-15 | Happy path, edge cases, and failure modes |
You can set a session-level default in .tailtest/config.json:
Automatic per-file override¶
Even if you set depth: simple, certain files get upgraded automatically based on a complexity score. The scorer runs on every queued file and returns a score from 0 upward. The score maps to a depth band:
- 0-5: simple
- 6-9: standard
- 10+: thorough
What drives the score:
- Path signals: files with
auth,permission,billing,payment,checkout,invoice, orsubscriptionin the path add 4 points. Files withadmin,upload,delete,remove,purge, ormigrateadd 3 points. - HTTP calls: any use of
requests,fetch,axios,httpx,aiohttp, or similar adds 3 points. - Database access: queries, ORM calls (
.filter(),.save(),.commit()), raw SQL keywords: 3 points. - Branching: each
if,elif,else,match, orswitchadds 1 point, up to 4. - Public functions: each exported or public function adds 1 point, up to 5.
If the computed depth is higher than the configured depth, the context note sent to Claude includes the override and the reasoning: for example, Complexity: thorough (billing: +4 billing +3 HTTP +3 DB +2 branches = 12 scenarios). Generate ~12 scenarios.
The configured depth is never upgraded downward. If you set depth: thorough, every file gets at least thorough.
Cross-session memory¶
Tailtest maintains a persistent log of every file it has tested, across all sessions.
Location: .tailtest/history.json (all three platforms)
Cap: 1000 entries. Oldest entries are dropped when the cap is reached.
Each entry records:
file: relative path of the source filestatus:passed,fixed,unresolved, ordeferredattempts: how many fix cycles it tooksession_id: which session produced this entrytimestamp: UTC ISO timestampclassification:gap,regression,fixed, orpassed
Classifications:
gap: first time this file has been tested (no prior history)passed: passed with no fix attemptsfixed: failed initially, resolved within the sessionregression: was passing in the most recent prior session, now failing
Recurring failures: If a file has failed (unresolved or deferred) in 3 or more distinct sessions, tailtest flags it as a recurring failure. At the start of the next session, Claude sees a note: Recurring failures across sessions: billing.py. These files have failed in multiple sessions -- consider adding validation.
Regressions at startup: If a file was passing in the last session and is now failing, the startup context includes: Recent regressions: billing.py (was passing, now failing).
This gives Claude useful context at the start of each session without requiring you to remember or re-explain anything.
Failure classification¶
When tests fail, tailtest requires Claude to classify the failure before asking whether to fix it. Three categories apply:
| Category | Meaning | What happens |
|---|---|---|
| Real bug | The source code has incorrect logic. The test is exposing a genuine defect. | Claude states the bug and asks: "Want me to fix this?" |
| Environment issue | A missing dependency, misconfigured setup, or unavailable external service. The source code is not at fault. | Claude surfaces the issue without modifying source code. |
| Test bug | The test itself has a wrong expectation, wrong fixture, or wrong assertion. The source code is correct. | Claude corrects the test, not the source. |
Claude states the category and one sentence of reasoning before taking any action. For example: This is a real bug -- calculate_tax returns None when input is zero instead of 0.0. Want me to fix it?
Failures are never silently skipped. If Claude is uncertain which category applies, it defaults to treating the failure as a real bug and surfaces it.