Inside Codex CLI PostToolUse: what fires on apply_patch

The Codex CLI PostToolUse hook is the youngest of the four agent hook surfaces tailtest integrates with. It shipped in tailtest-codex v4.9.0 in May, after about six weeks of dogfooding it against bug-hunt work on real Python repositories. This post is the post I wish I had when I started: the actual payload shape, the apply_patch parser quirks, what additionalContext does and does not do, and the bug categories the hook catches that Codex’s own loop misses.

I am Pallavi. I work on the Codex plugin and most of the case-study bug hunt sits on my desk. I write the deep-dives that follow the bugs back to the hook event that surfaced them. That means I have read more PostToolUse payloads than I would like to admit.

Why Codex CLI needed PostToolUse

Codex CLI ships with three hook events: SessionStart, PostToolUse, and Stop. Earlier tailtest-codex versions, through 4.8, used only SessionStart and Stop. The reasoning was that Stop fires at turn boundary and you can run the full test cycle once per turn with a smaller config surface.

That was wrong on long turns. A Codex turn can include a dozen apply_patch calls plus several shell invocations. Running tests only at turn end gave the agent no per-edit feedback. The model wrote, kept writing, and only saw the failures after the entire turn was committed to context. By the time it saw a failing test, the relevant edit was eight tool calls back and the agent had moved on.

v4.9.0 added a PostToolUse hook that fires per tool call, accumulates pending files, and surfaces structured context inside the same turn. The Stop hook stayed registered as a turn-end safety net. Both hooks share the scanner library at hooks/lib/scanner.py, which is the unit of code that decides “this tool call touched these files.” The PostToolUse hook calls into the same scanner the Stop hook always used. No logic duplicated, no risk of drift.

The hooks.json shape

Codex CLI’s hook config lives in ~/.codex/plugins/<plugin>/hooks/hooks.json. The shape that tailtest-codex 4.9.0 ships:

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup",
        "hooks": [
          { "type": "command", "command": "python3 $HOME/.codex/plugins/tailtest/hooks/session_start.py" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": ".*",
        "hooks": [
          { "type": "command", "command": "python3 $HOME/.codex/plugins/tailtest/hooks/post_tool_use.py" }
        ]
      }
    ],
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          { "type": "command", "command": "python3 $HOME/.codex/plugins/tailtest/hooks/stop.py" }
        ]
      }
    ]
  }
}

A few notes on the schema. The outer matcher on PostToolUse is a regex against tool names. .* says “fire for every tool call.” If you only care about file mutations, you can narrow it to apply_patch|patch|shell, but tailtest filters inside the hook script because the regex matcher is per-config and we want the filter logic to live with the runtime code.

The Stop matcher is empty, which Codex CLI reads as “always fire.” There is no equivalent to Claude Code’s file_path_regex or cwd_regex matcher. All filtering happens in the script.

The event payload

PostToolUse pipes a JSON payload to stdin. The shape, abbreviated:

{
  "event": "PostToolUse",
  "tool_name": "apply_patch",
  "tool_input": {
    "patch": "*** Begin Patch\n*** Update File: src/pricing.py\n@@\n-def discount(p, q):\n-    return p * q\n+def discount(p, q):\n+    if q < 0:\n+        raise ValueError(\"negative qty\")\n+    return p * q\n*** End Patch\n"
  },
  "tool_result": { "success": true, "exit_code": 0 },
  "cwd": "/abs/path/to/project",
  "session_id": "01HMXN..."
}

The interesting field is tool_input.patch. Codex CLI’s apply_patch tool takes a custom patch format that is not unified diff. It has its own headers (*** Begin Patch, *** Update File, *** Add File, *** Delete File) and its own context conventions. Tailtest parses this format in hooks/lib/scanner.py via extract_files_from_patch. The parser handles four cases: update, add, delete, and rename. The patch can touch multiple files in one call; the parser returns the full set.

For shell tool calls (shell, bash, exec), the patch field is absent and the scanner falls back to an mtime sweep over the project tree since the last PostToolUse fire. The mtime sweep catches files written indirectly by shell commands (pip install modifying installed packages, a script generating output, a code formatter rewriting files in place). It is slower than the patch parser, which is why we only fall back when the patch field is missing.

What the hook actually does

Inside the hook script, the flow looks like this:

Read stdin, parse JSON, exit zero on parse failure.
Quick exit if tool_name is not in the file-mutating tool set.
Identify changed files: parse the patch if present, or mtime-sweep if not.
Run each changed file through the ignore filter and language detector.
Append eligible files to pending_files in .codex/hooks/state/tailtest.json.
Emit additionalContext via the hookSpecificOutput envelope so the agent sees a short structured summary on the next turn.

Step 6 is the part that matters most. Codex CLI reads stdout from PostToolUse hooks looking for a specific JSON envelope:

{
  "hookSpecificOutput": {
    "additionalContext": "tailtest: 2 files queued for verification (src/pricing.py, src/cart.py)"
  }
}

Anything in additionalContext is surfaced to the agent as context on the next turn, without blocking the current turn’s response. This is the closest Codex CLI gets to Claude Code’s stdout-to-tool-output channel. The envelope is the only sanctioned way to talk to the agent from a PostToolUse hook. Bare stdout outside the envelope is logged but not surfaced to the model.

The summary we emit is deliberately small. Tailtest does not run the test suite inside PostToolUse on Codex; it only queues files. The actual run happens at turn end inside the Stop hook, because the Codex turn structure makes per-tool-call test runs too expensive at p99. The PostToolUse hook exists for queue management and context surfacing, not for runner dispatch.

Latency budget

Across 1,800 PostToolUse events on the tailtest-codex repo and three external bug-hunt projects in May, the latency distribution at the hook entry point was:

p50: 64ms
p90: 118ms
p99: 310ms

The p99 tail is dominated by mtime sweeps on shell tool calls in projects with large node_modules or vendored Python dependencies. The patch-parser path is reliably under 100ms because the patch text is small and the parse is local.

The Stop hook is where the runner dispatch happens. It carries the actual test latency: p50 around 1.1 seconds, p99 around 3.4 seconds when the R15 adversarial pass runs. The split is deliberate. PostToolUse stays cheap so it can fire on every tool call without making the user notice. Stop pays the runner cost once per turn over the union of queued files.

What we have caught

The case-study work is the empirical case for the per-edit cycle. Across 55 open-source Python repositories, tailtest has filed 17 real bugs, eight of which surfaced first on the Codex plugin specifically. A few representative patterns:

A timezone handler that worked at noon UTC and broke at midnight local time. The agent added a datetime.now() call to a billing function. The happy-path tests passed because the test suite ran during business hours. R15’s time and locale edges category generated a midnight-UTC test that failed. The PostToolUse hook queued the file, the Stop hook ran the new test, and the failure surfaced in the same turn. The agent rewrote the function to use datetime.now(timezone.utc) and the bug never reached the repo.

A JSON parser that accepted Infinity as a valid number. The agent imported json and trusted the defaults. R15’s format and injection category fed an Infinity literal through the parser, which Python’s json accepts by default but the repo’s downstream consumer did not. The test failed at turn end, the agent saw the failure, the patch landed with parse_constant=lambda c: None and validation downstream.

An off-by-one in a paginator that only showed up at exactly the page-size boundary. Standard tests used page sizes of 5 and 50. R15’s off-by-one category tested exactly the boundary. The paginator returned an empty page when the total count was an exact multiple of the page size. The PostToolUse hook saw the patch, the Stop hook ran the new test, the test failed, the agent fixed the slice. R15 is the rule that earns its keep on bugs like this.

The full set is at the case studies page. The categories that recur most often are boundary inputs, time and locale edges, and partial failures. The R12 classification labels them as real_bug (vs test_bug or environment) so the agent knows to fix the production code rather than rewrite the test.

Non-obvious behaviors

The Codex hook layer has its share of edge cases. The ones that cost me time:

SessionStart does not fire on resume. If the user resumes a Codex session from history, you get PostToolUse and Stop events but no SessionStart. Tailtest’s session_start.py writes initial runner config, so on a resumed session the config has to be lazy-loaded by the PostToolUse hook instead. We check for .tailtest/session.json presence and run a lightweight init if it is missing.

tool_result.success lies for shell calls that wrote files before exiting non-zero. A shell command that wrote a file and then failed at the last step shows success: false but the file is on disk. Tailtest’s mtime sweep does not look at success; it scans the tree regardless. This is intentional. We would rather queue a file unnecessarily than miss an edit.

apply_patch can include zero-context hunks. The Codex patch format allows hunks without surrounding context lines. The parser has to handle this without anchoring to context. The reference parser at hooks/lib/scanner.py is the source of truth on the corner cases.

additionalContext over a few hundred tokens gets truncated by the model’s context window pressure. Even if Codex CLI accepts the full envelope, the agent may not act on a long summary. The structured one-line form (tailtest: tests passed=10 failed=2 classified=real_bug,test_bug) is what survives the context squeeze. Long prose summaries get ignored.

How it fits with the other agents

Tailtest ships across four agents: Claude Code, Cursor, Codex CLI, and Cline. The four share the R1-R15 rule layer, the R12 three-label failure classifier, and the 8-category R15 adversarial pass (boundary inputs, format and injection, type confusion, concurrent state, time and locale edges, partial failures, resource exhaustion, off-by-one). The per-agent code is the hook entry point and the payload parser; everything past the scanner is shared.

The architectural argument is in hook-based testing explained. The per-agent posts are at:

Claude Code: the PostToolUse deep-dive
Cursor: the afterFileEdit deep-dive
Cline: the MCP-and-clinerules deep-dive (forthcoming)

The plugins keep cross-coverage honest. We currently run 1,234 tests across the four plugins, and a shared-scanner regression has to clear the test suite in all four. That is what keeps the R12 and R15 behavior identical regardless of which agent the user is on.

How to wire your own

The minimum viable PostToolUse hook for Codex CLI, without tailtest, looks like:

#!/usr/bin/env bash
# hooks/post_tool_use.sh
payload="$(cat /dev/stdin)"
tool="$(echo "$payload" | jq -r '.tool_name')"
case "$tool" in
  apply_patch|patch)
    # parse patch, run pytest on touched files
    pytest --testmon -q
    ;;
  *) exit 0 ;;
esac
echo '{"hookSpecificOutput":{"additionalContext":"tests run"}}'

Wire it into hooks.json:

{
  "hooks": {
    "PostToolUse": [
      { "matcher": ".*", "hooks": [{ "type": "command", "command": "bash ./hooks/post_tool_use.sh" }] }
    ]
  }
}

That is the MVP. Forty lines including the JSON. Tailtest adds the patch parser, the mtime fallback, the R12 classifier, the R15 adversarial pass, the structured report at .tailtest/reports/latest.json, and the four-agent abstraction. If you only need pytest on every Codex apply_patch, you do not need a framework. Tailtest is MIT licensed, ships no telemetry, and requires no SaaS account. The installer is uvx tailtest install --agent codex. The Codex solution page walks through the full integration.

FAQ

What is the Codex CLI PostToolUse hook?

PostToolUse is a Codex CLI hook event that fires after every tool call (apply_patch, shell, and others). It is configured in the plugin’s hooks/hooks.json and receives a JSON payload on stdin with the tool name, tool input, and tool result.

When did tailtest-codex add PostToolUse support?

v4.9.0, shipped May 2026. Earlier versions used only SessionStart and Stop. The PostToolUse addition gives the agent per-edit feedback inside the same turn.

What is the apply_patch parser?

Codex CLI uses a custom patch format with *** Begin Patch, *** Update File, *** Add File, and *** Delete File headers. Tailtest parses it in hooks/lib/scanner.py via extract_files_from_patch, which returns the set of files touched by the patch.

What does additionalContext do?

It is the sanctioned channel for a PostToolUse hook to surface context to the agent without blocking the current turn. The hook prints {"hookSpecificOutput": {"additionalContext": "..."}} to stdout, and Codex CLI surfaces the message on the next turn.

How does Codex CLI’s PostToolUse compare to Claude Code’s?

The naming was borrowed deliberately. Both fire after tool calls and both can surface context back to the agent. The differences: Codex CLI uses a custom patch format that needs a dedicated parser, and the context channel is the hookSpecificOutput.additionalContext envelope rather than raw stdout.