The pytest
for AI agents
You don't write tests. You build your agent — tailtest watches, learns, tests. Scan your project, auto-generate tests, run them in CI, guard production.
Apache 2.0 · Zero telemetry · Fully local · Python 3.11+ · Any framework · Any model
93% of developers building AI agents don't test them
The tooling doesn't exist. The patterns aren't established. If you ship an agent today, you're shipping it blind.
This is what happens without testing
Replit's GPT-4 assistant
Deleted a CEO's production database, then fabricated reports claiming data was irrecoverable.
$47K LangChain loop
Four agents entered an infinite conversation loop for 11 days. No one noticed until the bill arrived.
$3.2M procurement fraud
Attackers compromised a vendor-validation agent through supply chain prompt injection.
Promptfoo acquired by OpenAI
300K+ developers lost their vendor-neutral testing tool overnight. The category has a vacuum.
How it works
Four positions. From first scan to production guardian. No config files. No account creation.
Watch & Auto-Generate
Tailtest scans your codebase, detects your framework, watches file edits and OTel traces. It auto-generates deterministic, LLM-judged, and red-team tests without you writing a single line.
Run Tests in CI
Run your test suite in CI/CD with exit codes, JUnit XML output, and parallel execution. Deterministic assertions run instantly at zero cost. LLM-judged assertions use local models via Ollama by default.
Watch for Regressions
Continuously watch your agent in development. When behavior changes, tailtest detects drift and generates new regression tests automatically. Your test suite grows as your agent evolves.
Guard Production
Ingest production traces via OpenTelemetry, detect anomalies, and auto-generate regression tests from real failures. When your agent breaks in prod, the fix comes with a test.
Everything you need to test AI agents
One framework. Deterministic + LLM-judged + red-team assertions. Any framework, any model.
22 Assertion Types
10 deterministic (cost, latency, tool calls, PII, regex) + 7 LLM-judged (faithfulness, tone, helpfulness) + 5 reliability (pass rate, consistency). Deterministic first - free, fast, instant.
64 Red-Team Attacks
Prompt injection, jailbreak, PII extraction, and more across 8 categories. Built-in OWASP LLM Top 10 and Agent Top 10 compliance checks. Know your vulnerabilities before attackers do.
6 Framework Detectors
Auto-detects OpenAI, Anthropic, LangChain, CrewAI, PydanticAI, and generic agents. Scans your codebase and generates framework-specific tests without configuration.
Production Monitoring
Ingest OpenTelemetry traces from production, detect behavioral drift, and auto-generate regression tests from real failures. Your test suite grows from actual production issues.
MCP Server
6 MCP tools for IDE integration. Run tests, generate assertions, check coverage, and view results - all from your editor. Works with Claude Code, Cursor, Windsurf, and any MCP-compatible IDE.
Visual HTML Reports
5 report formats: terminal, JUnit XML, JSON, compliance text, and rich HTML reports. Beautiful, shareable test results with pass/fail breakdowns, assertion details, and trend tracking.
The expect() API
Familiar. Expressive. Deterministic assertions run instantly at zero cost. LLM-judged assertions default to local models.
from tailtest import agent_test, expect @agent_test async def test_order_lookup(): response = await agent.chat("What's the status of order #12345?") # Deterministic assertions - free, instant expect(response).to_call_tool("lookup_order") expect(response).tool_called_with("lookup_order", order_id="12345") expect(response).to_contain("order") expect(response).no_pii() expect(response).latency_under(3000) expect(response).cost_under(0.50) @agent_test async def test_response_quality(): response = await agent.chat("Explain your return policy") # LLM-judged assertions - local model via Ollama expect(response).faithful_to(context="Returns accepted within 30 days...") expect(response).helpful() expect(response).tone("professional", "empathetic") @agent_test(retries=10) async def test_reliability(): response = await agent.chat("What are your business hours?") # Reliability assertions - statistical guarantees expect(response).to_contain("9am") expect(response).pass_rate(0.95)
How tailtest compares
The only tool that auto-generates tests, runs deterministic-first, and never phones home.
| Feature | Tailtest | DeepEval | Promptfoo / OpenAI | Braintrust |
|---|---|---|---|---|
| Open source | Apache 2.0 | Apache 2.0 | Acquired by OpenAI | Proprietary |
| Vendor-neutral | Yes | Partial (OpenAI default) | No (OpenAI owned) | Yes |
| Auto-generate tests | Position 0 (scan + watch) | No | No | No |
| Deterministic-first | 10 types, zero cost | LLM-first (expensive) | Mixed | LLM-first |
| Red-team attacks | 64 attacks, 8 categories | Limited | Yes (being absorbed) | No |
| OWASP compliance | LLM Top 10 + Agent Top 10 | Partial | LLM Top 10 | No |
| Production monitoring | OTel ingestion + drift | Via Confident AI (paid) | No | Yes (paid platform) |
| Telemetry | Zero. Never. | Forced (Confident AI) | OpenAI-controlled | Platform-dependent |
| Runs fully local | Yes, forever | Partial | Was yes, now unclear | No (cloud platform) |
| CLI-first / CI/CD native | Yes (exit codes, JUnit XML) | Yes (pytest plugin) | Yes | Dashboard-first |
| Framework agnostic | 6 auto-detectors | Yes | Yes | Yes |
| Pricing | $0 forever | Free + paid cloud | OpenAI pricing TBD | $800M valuation, enterprise |
Truly open source
Not "open core" with a paid cloud. Not "source available" with restrictions. Actually open source. Actually free.
Apache 2.0 License
Use it in production, modify it, fork it, sell products built on it. No CLA, no contributor licensing traps. Apache 2.0 with patent grant.
Zero Telemetry
No data leaves your machine. No analytics. No usage tracking. No phone-home. Not even opt-out - the code simply doesn't exist.
Fully Local
Runs entirely on your machine. LLM-judged assertions use Ollama by default. No cloud accounts, no API keys required for core functionality.
Self-Hostable
Every feature works on your own infrastructure. Production monitoring, report generation, MCP server - all run on your terms.
What we are NOT building
Ready to test your agents properly?
Three commands. No config files. No account creation. Meaningful test results in under 3 minutes.