Open Source AI Agent Testing

The pytest
for AI agents

You don't write tests. You build your agent — tailtest watches, learns, tests. Scan your project, auto-generate tests, run them in CI, guard production.

Apache 2.0 · Zero telemetry · Fully local · Python 3.11+ · Any framework · Any model

Terminal
$ pip install tailtester
$ tailtest scan .
Detected: OpenAI Agents SDK, 3 agent files, 12 tool calls
Generated: 8 deterministic tests, 4 LLM-judge tests, 2 red-team suites
$ tailtest run
14 passed, 0 failed, 0 skipped (2.11s)

93% of developers building AI agents don't test them

The tooling doesn't exist. The patterns aren't established. If you ship an agent today, you're shipping it blind.

93%
of developers don't test AI agents
29%
trust AI output accuracy (down from 40%)
94.4%
of agents are vulnerable to prompt injection
85%
accuracy per step = 20% for a 10-step workflow

This is what happens without testing

Replit's GPT-4 assistant

Deleted a CEO's production database, then fabricated reports claiming data was irrecoverable.

$47K LangChain loop

Four agents entered an infinite conversation loop for 11 days. No one noticed until the bill arrived.

$3.2M procurement fraud

Attackers compromised a vendor-validation agent through supply chain prompt injection.

Promptfoo acquired by OpenAI

300K+ developers lost their vendor-neutral testing tool overnight. The category has a vacuum.

22
Assertion Types
64
Red-Team Attacks
6
Framework Detectors
13
CLI Commands
$0
Forever

How it works

Four positions. From first scan to production guardian. No config files. No account creation.

0

Watch & Auto-Generate

Tailtest scans your codebase, detects your framework, watches file edits and OTel traces. It auto-generates deterministic, LLM-judged, and red-team tests without you writing a single line.

$ tailtest scan .
Detected: LangChain, 5 agents, 23 tool calls
Generated: 12 tests (8 deterministic, 4 LLM-judge)
1

Run Tests in CI

Run your test suite in CI/CD with exit codes, JUnit XML output, and parallel execution. Deterministic assertions run instantly at zero cost. LLM-judged assertions use local models via Ollama by default.

$ tailtest run --ci --output junit.xml
22 passed, 1 failed, 0 skipped (3.4s)
JUnit XML written to junit.xml
2

Watch for Regressions

Continuously watch your agent in development. When behavior changes, tailtest detects drift and generates new regression tests automatically. Your test suite grows as your agent evolves.

$ tailtest watch
Watching 5 agent files...
Drift detected: tool_selector changed behavior
Generated: 2 new regression tests
3

Guard Production

Ingest production traces via OpenTelemetry, detect anomalies, and auto-generate regression tests from real failures. When your agent breaks in prod, the fix comes with a test.

$ tailtest guard --otel-endpoint http://localhost:4318
Ingesting production traces...
Anomaly: latency spike on order_lookup (p99: 8.2s)
Generated: 1 regression test from production failure

Everything you need to test AI agents

One framework. Deterministic + LLM-judged + red-team assertions. Any framework, any model.

22 Assertion Types

10 deterministic (cost, latency, tool calls, PII, regex) + 7 LLM-judged (faithfulness, tone, helpfulness) + 5 reliability (pass rate, consistency). Deterministic first - free, fast, instant.

64 Red-Team Attacks

Prompt injection, jailbreak, PII extraction, and more across 8 categories. Built-in OWASP LLM Top 10 and Agent Top 10 compliance checks. Know your vulnerabilities before attackers do.

6 Framework Detectors

Auto-detects OpenAI, Anthropic, LangChain, CrewAI, PydanticAI, and generic agents. Scans your codebase and generates framework-specific tests without configuration.

Production Monitoring

Ingest OpenTelemetry traces from production, detect behavioral drift, and auto-generate regression tests from real failures. Your test suite grows from actual production issues.

MCP Server

6 MCP tools for IDE integration. Run tests, generate assertions, check coverage, and view results - all from your editor. Works with Claude Code, Cursor, Windsurf, and any MCP-compatible IDE.

Visual HTML Reports

5 report formats: terminal, JUnit XML, JSON, compliance text, and rich HTML reports. Beautiful, shareable test results with pass/fail breakdowns, assertion details, and trend tracking.

The expect() API

Familiar. Expressive. Deterministic assertions run instantly at zero cost. LLM-judged assertions default to local models.

test_my_agent.py
from tailtest import agent_test, expect

@agent_test
async def test_order_lookup():
    response = await agent.chat("What's the status of order #12345?")

    # Deterministic assertions  -  free, instant
    expect(response).to_call_tool("lookup_order")
    expect(response).tool_called_with("lookup_order", order_id="12345")
    expect(response).to_contain("order")
    expect(response).no_pii()
    expect(response).latency_under(3000)
    expect(response).cost_under(0.50)

@agent_test
async def test_response_quality():
    response = await agent.chat("Explain your return policy")

    # LLM-judged assertions  -  local model via Ollama
    expect(response).faithful_to(context="Returns accepted within 30 days...")
    expect(response).helpful()
    expect(response).tone("professional", "empathetic")

@agent_test(retries=10)
async def test_reliability():
    response = await agent.chat("What are your business hours?")

    # Reliability assertions  -  statistical guarantees
    expect(response).to_contain("9am")
    expect(response).pass_rate(0.95)
10
Deterministic
Free. Instant. No LLM needed.
7
LLM-Judged
Local model via Ollama by default.
5
Reliability
Statistical pass rates over N runs.

How tailtest compares

The only tool that auto-generates tests, runs deterministic-first, and never phones home.

Open source
Tailtest Apache 2.0
DeepEval Apache 2.0
Promptfoo Acquired by OpenAI
Braintrust Proprietary
Vendor-neutral
Tailtest Yes
DeepEval Partial (OpenAI default)
Promptfoo No (OpenAI owned)
Braintrust Yes
Auto-generate tests
Tailtest Position 0 (scan + watch)
DeepEval No
Promptfoo No
Braintrust No
Deterministic-first
Tailtest 10 types, zero cost
DeepEval LLM-first (expensive)
Promptfoo Mixed
Braintrust LLM-first
Red-team attacks
Tailtest 64 attacks, 8 categories
DeepEval Limited
Promptfoo Yes (being absorbed)
Braintrust No
OWASP compliance
Tailtest LLM Top 10 + Agent Top 10
DeepEval Partial
Promptfoo LLM Top 10
Braintrust No
Production monitoring
Tailtest OTel ingestion + drift
DeepEval Via Confident AI (paid)
Promptfoo No
Braintrust Yes (paid platform)
Telemetry
Tailtest Zero. Never.
DeepEval Forced (Confident AI)
Promptfoo OpenAI-controlled
Braintrust Platform-dependent
Runs fully local
Tailtest Yes, forever
DeepEval Partial
Promptfoo Was yes, now unclear
Braintrust No (cloud platform)
CLI-first / CI/CD native
Tailtest Yes (exit codes, JUnit XML)
DeepEval Yes (pytest plugin)
Promptfoo Yes
Braintrust Dashboard-first
Framework agnostic
Tailtest 6 auto-detectors
DeepEval Yes
Promptfoo Yes
Braintrust Yes
Pricing
Tailtest $0 forever
DeepEval Free + paid cloud
Promptfoo OpenAI pricing TBD
Braintrust $800M valuation, enterprise

Truly open source

Not "open core" with a paid cloud. Not "source available" with restrictions. Actually open source. Actually free.

Apache 2.0 License

Use it in production, modify it, fork it, sell products built on it. No CLA, no contributor licensing traps. Apache 2.0 with patent grant.

Zero Telemetry

No data leaves your machine. No analytics. No usage tracking. No phone-home. Not even opt-out - the code simply doesn't exist.

Fully Local

Runs entirely on your machine. LLM-judged assertions use Ollama by default. No cloud accounts, no API keys required for core functionality.

Self-Hostable

Every feature works on your own infrastructure. Production monitoring, report generation, MCP server - all run on your terms.

What we are NOT building

- Not a dashboard-first enterprise product (that's Braintrust)
- Not a framework-specific tool (that's LangSmith)
- Not a security-only scanner (that's Promptfoo/OpenAI now)
- Not a cloud-required service (runs fully local, forever)

Ready to test your agents properly?

Three commands. No config files. No account creation. Meaningful test results in under 3 minutes.

Quick Start
# Install
$ pip install tailtester
# Scan your project and auto-generate tests
$ tailtest scan .
# Run all tests
$ tailtest run