Promptfoo

CLI and CI-integrated LLM test harness that runs YAML-defined test suites across models and prompts, with built-in automated red-teaming and prompt injection testing.

Evaluated Mar 07, 2026 (0d ago) vcurrent

Homepage ↗ Repo ↗ Developer Tools ai llm testing red-teaming ci yaml prompt-evaluation security

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

LLM API keys stored in environment variables or config files — ensure promptfoo config files are gitignored if they contain keys. Red-team results may contain sensitive adversarial outputs; secure storage recommended.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You want to integrate LLM quality and safety regression tests into a CI/CD pipeline with YAML-defined test cases and model comparison without writing custom test infrastructure.

Avoid When

You need real-time output validation or conversation flow control — promptfoo evaluates offline test suites, not live requests.

Use Cases

• Regression testing LLM prompts in CI/CD pipelines to catch quality degradations before deploying prompt changes to production
• Side-by-side comparison of multiple models (GPT-4o vs Claude vs Gemini) on the same test suite to drive model selection decisions
• Automated red-teaming that generates adversarial inputs to probe for jailbreaks, PII leakage, and harmful content generation
• Evaluating RAG pipeline quality by defining test cases with expected retrieved context and checking answer faithfulness
• A/B testing prompt variants with statistical assertions to determine which version performs better on a labeled dataset

Not For

• Runtime production guardrails — promptfoo is a test harness, not a middleware layer that intercepts live traffic
• Building agent workflows or orchestrating multi-step LLM pipelines — use LangGraph or Agno for that
• Teams that need a GUI-first evaluation platform without CLI or YAML configuration

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: api_key

OAuth: No Scopes: No

LLM provider API keys set via environment variables or promptfoo config. Promptfoo Cloud sharing features require a free account.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

MIT open source. Primary costs are LLM API calls consumed during test runs and red-team generation.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Documented

Known Gotchas

⚠ Red-team generation makes many LLM calls to create adversarial inputs — large test suites can exhaust API rate limits without --concurrency tuning
⚠ promptfoo is a test harness, not a runtime library — agents trying to use it for live validation are misusing the tool
⚠ YAML test files are the primary interface; teams accustomed to code-first testing frameworks find the YAML-only approach limiting for complex assertion logic
⚠ Model comparison results are snapshot-in-time — LLM provider model updates can change results without any code or config change
⚠ The built-in LLM grader for subjective assertions (e.g., 'is this response helpful?') is itself an LLM call that adds cost and introduces evaluation variance

Alternatives

dspy-api langsmith-api braintrust-api

Full Evaluation Report

Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for Promptfoo.

AI-powered analysis · PDF + markdown · Delivered within 30 minutes

$99

Package Brief

Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.

Delivered within 10 minutes

Score Monitoring

Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.

Continuous monitoring

$3/mo

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-07.