Promptfoo

CLI and CI-integrated LLM test harness that runs YAML-defined test suites across models and prompts, with built-in automated red-teaming and prompt injection testing.

Evaluated Mar 07, 2026 (0d ago) vcurrent
Homepage ↗ Repo ↗ Developer Tools ai llm testing red-teaming ci yaml prompt-evaluation security
⚙ Agent Friendliness
66
/ 100
Can an agent use this?
🔒 Security
48
/ 100
Is it safe for agents?
⚡ Reliability
59
/ 100
Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality
--
Documentation
88
Error Messages
84
Auth Simplicity
95
Rate Limits
88

🔒 Security

TLS Enforcement
0
Auth Strength
75
Scope Granularity
0
Dep. Hygiene
82
Secret Handling
83

LLM API keys stored in environment variables or config files — ensure promptfoo config files are gitignored if they contain keys. Red-team results may contain sensitive adversarial outputs; secure storage recommended.

⚡ Reliability

Uptime/SLA
0
Version Stability
80
Breaking Changes
75
Error Recovery
82
AF Security Reliability

Best When

You want to integrate LLM quality and safety regression tests into a CI/CD pipeline with YAML-defined test cases and model comparison without writing custom test infrastructure.

Avoid When

You need real-time output validation or conversation flow control — promptfoo evaluates offline test suites, not live requests.

Use Cases

  • Regression testing LLM prompts in CI/CD pipelines to catch quality degradations before deploying prompt changes to production
  • Side-by-side comparison of multiple models (GPT-4o vs Claude vs Gemini) on the same test suite to drive model selection decisions
  • Automated red-teaming that generates adversarial inputs to probe for jailbreaks, PII leakage, and harmful content generation
  • Evaluating RAG pipeline quality by defining test cases with expected retrieved context and checking answer faithfulness
  • A/B testing prompt variants with statistical assertions to determine which version performs better on a labeled dataset

Not For

  • Runtime production guardrails — promptfoo is a test harness, not a middleware layer that intercepts live traffic
  • Building agent workflows or orchestrating multi-step LLM pipelines — use LangGraph or Agno for that
  • Teams that need a GUI-first evaluation platform without CLI or YAML configuration

Interface

REST API
No
GraphQL
No
gRPC
No
MCP Server
No
SDK
Yes
Webhooks
No

Authentication

Methods: api_key
OAuth: No Scopes: No

LLM provider API keys set via environment variables or promptfoo config. Promptfoo Cloud sharing features require a free account.

Pricing

Model: open_source
Free tier: Yes
Requires CC: No

MIT open source. Primary costs are LLM API calls consumed during test runs and red-team generation.

Agent Metadata

Pagination
none
Idempotent
Full
Retry Guidance
Documented

Known Gotchas

  • Red-team generation makes many LLM calls to create adversarial inputs — large test suites can exhaust API rate limits without --concurrency tuning
  • promptfoo is a test harness, not a runtime library — agents trying to use it for live validation are misusing the tool
  • YAML test files are the primary interface; teams accustomed to code-first testing frameworks find the YAML-only approach limiting for complex assertion logic
  • Model comparison results are snapshot-in-time — LLM provider model updates can change results without any code or config change
  • The built-in LLM grader for subjective assertions (e.g., 'is this response helpful?') is itself an LLM call that adds cost and introduces evaluation variance

Alternatives

Full Evaluation Report

Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for Promptfoo.

AI-powered analysis · PDF + markdown · Delivered within 30 minutes

$99

Package Brief

Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.

Delivered within 10 minutes

$3

Score Monitoring

Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.

Continuous monitoring

$3/mo

Scores are editorial opinions as of 2026-03-07.

6470
Packages Evaluated
26150
Need Evaluation
173
Need Re-evaluation
Community Powered