Skip to content

How tailtest works

This page covers the internal mechanics: how each platform detects file changes, how shared library modules handle the heavy lifting, how session state is stored, how depth is chosen per file, and how history persists across sessions.


The hook loop, per platform

Claude Code

Claude Code exposes three lifecycle hooks that tailtest uses.

SessionStart fires when you open a Claude Code session. Tailtest scans for test runners (pytest, vitest, jest, etc.), writes initial session state to .tailtest/session.json, and injects context into Claude. If history from prior sessions exists, failures and regressions from those sessions are included in the startup context so Claude knows about them.

PostToolUse fires after every file write or edit (Write, Edit, MultiEdit, NotebookEdit). Tailtest checks whether the file is a testable source file (not a test file itself, not generated code, not a config). If it passes the filter, a context note is added: tailtest: billing.py queued (new, python). write test to tests/test_billing.py. runner: pytest. Claude sees this note and knows what to do before it responds to you.

SessionEnd fires when the session closes. Tailtest persists scenario outcomes (which files were tested, whether they passed or needed fixes) to the scenario log in session state, then flushes that session's results to the cross-session history file.

Cursor

Cursor uses two hooks.

afterFileEdit fires when a file is saved. Tailtest filters the file the same way as Claude Code. Qualifying files are written to the session state at .cursor/hooks/state/tailtest.json.

stop fires when Cursor's AI finishes a turn. Tailtest checks the session state for any queued files and, if found, fires a followup_message that instructs the model to generate tests before the next user turn. The model handles the testing in that follow-up turn, then clears the queue.

Codex CLI

Codex uses a single hook.

stop fires at the end of every Codex turn. Tailtest sweeps the project for source files whose mtime has changed since the last sweep. Changed files that pass the filter are queued. If the queue is non-empty, the hook returns decision: block, which pauses Codex and prompts it to run tests before continuing. This mtime-based approach works because Codex does not expose a per-file-write event.


Shared lib/ modules

All three platforms ship a copy of the same Python library. The modules are:

Module What it does
runners.py Detects test runners by scanning pyproject.toml, package.json, pom.xml, Cargo.toml, etc. Resolves the run command and test file location per language.
filter.py Decides whether a file should be tested. Applies .tailtest-ignore, skips test files, generated code, config files, and build output.
complexity_scorer.py Scores a file for depth. Uses path signals and content patterns (see the Depth system section below).
history_manager.py Reads and writes .tailtest/history.json. Classifies entries as gap, regression, fixed, or passed. Detects recurring failures across sessions.
scenario_log.py Builds per-file outcome entries (passed, fixed, unresolved, deferred) at session end. Feeds into history.
impact_tracer.py Optional, Python only. Finds files that import the changed file, so Claude knows which other files may be affected by the change.
api_validator.py Optional, Python only. Verifies that the public functions and classes in a Python file are importable before tests are written. Guards against hallucinated APIs where both the source and the tests reference a function that does not actually exist.

Session state

Tailtest writes a session state file while the session is active.

Platform Path
Claude Code .tailtest/session.json
Codex .tailtest/session.json
Cursor .cursor/hooks/state/tailtest.json

The file tracks:

  • pending_files: files queued for testing in the current session
  • fix_attempts: how many times Claude tried to fix each failing file
  • generated_tests: which source files had tests written this session
  • scenario_log: per-file outcomes appended at session end
  • last_failures: failures from the previous session, injected into startup context

Add .tailtest/ (and .cursor/hooks/state/) to your .gitignore. These are working files, not artifacts to commit.


Depth system

Depth controls how many test scenarios tailtest asks Claude to generate for each file.

Depth Scenarios What is covered
simple 2-4 Happy path only
standard 5-8 Happy path plus key edge cases (default)
thorough 10-15 Happy path, edge cases, and failure modes

You can set a session-level default in .tailtest/config.json:

{
  "depth": "standard"
}

Automatic per-file override

Even if you set depth: simple, certain files get upgraded automatically based on a complexity score. The scorer runs on every queued file and returns a score from 0 upward. The score maps to a depth band:

  • 0-5: simple
  • 6-9: standard
  • 10+: thorough

What drives the score:

  • Path signals: files with auth, permission, billing, payment, checkout, invoice, or subscription in the path add 4 points. Files with admin, upload, delete, remove, purge, or migrate add 3 points.
  • HTTP calls: any use of requests, fetch, axios, httpx, aiohttp, or similar adds 3 points.
  • Database access: queries, ORM calls (.filter(), .save(), .commit()), raw SQL keywords: 3 points.
  • Branching: each if, elif, else, match, or switch adds 1 point, up to 4.
  • Public functions: each exported or public function adds 1 point, up to 5.

If the computed depth is higher than the configured depth, the context note sent to Claude includes the override and the reasoning: for example, Complexity: thorough (billing: +4 billing +3 HTTP +3 DB +2 branches = 12 scenarios). Generate ~12 scenarios.

The configured depth is never upgraded downward. If you set depth: thorough, every file gets at least thorough.


Cross-session memory

Tailtest maintains a persistent log of every file it has tested, across all sessions.

Location: .tailtest/history.json (all three platforms)

Cap: 1000 entries. Oldest entries are dropped when the cap is reached.

Each entry records:

  • file: relative path of the source file
  • status: passed, fixed, unresolved, or deferred
  • attempts: how many fix cycles it took
  • session_id: which session produced this entry
  • timestamp: UTC ISO timestamp
  • classification: gap, regression, fixed, or passed

Classifications:

  • gap: first time this file has been tested (no prior history)
  • passed: passed with no fix attempts
  • fixed: failed initially, resolved within the session
  • regression: was passing in the most recent prior session, now failing

Recurring failures: If a file has failed (unresolved or deferred) in 3 or more distinct sessions, tailtest flags it as a recurring failure. At the start of the next session, Claude sees a note: Recurring failures across sessions: billing.py. These files have failed in multiple sessions -- consider adding validation.

Regressions at startup: If a file was passing in the last session and is now failing, the startup context includes: Recent regressions: billing.py (was passing, now failing).

This gives Claude useful context at the start of each session without requiring you to remember or re-explain anything.


Failure classification

When tests fail, tailtest requires Claude to classify the failure before asking whether to fix it. Three categories apply:

Category Meaning What happens
Real bug The source code has incorrect logic. The test is exposing a genuine defect. Claude states the bug and asks: "Want me to fix this?"
Environment issue A missing dependency, misconfigured setup, or unavailable external service. The source code is not at fault. Claude surfaces the issue without modifying source code.
Test bug The test itself has a wrong expectation, wrong fixture, or wrong assertion. The source code is correct. Claude corrects the test, not the source.

Claude states the category and one sentence of reasoning before taking any action. For example: This is a real bug -- calculate_tax returns None when input is zero instead of 0.0. Want me to fix it?

Failures are never silently skipped. If Claude is uncertain which category applies, it defaults to treating the failure as a real bug and surfaces it.