{"id":"accenture-mcp-bench","name":"mcp-bench","af_score":49.0,"security_score":46.2,"reliability_score":32.5,"what_it_does":"MCP-Bench is an evaluation framework that benchmarks tool-using LLM agents on complex tasks using the Model Context Protocol (MCP). It orchestrates discovery/connection to multiple MCP servers, runs benchmark tasks (single- and multi-server settings), and evaluates outputs (including LLM-as-judge).","best_when":"You want repeatable, local benchmarking of MCP tool-using agents across many heterogeneous tool servers with configurable model providers and judge-based scoring.","avoid_when":"You cannot safely provide and manage multiple third-party API keys or you need a minimal dependency surface; also avoid when you require strong guarantees that tool calls and logs won’t expose sensitive data.","last_evaluated":"2026-03-30T13:27:52.586183+00:00","has_mcp":false,"has_api":false,"auth_methods":["Environment variables for LLM provider API keys (OPENROUTER_API_KEY, AZURE_OPENAI_API_KEY/ENDPOINT)","Environment-file based API keys for MCP servers (./mcp_servers/api_key)"],"has_free_tier":false,"known_gotchas":["Hard-coded judge model requirement for reproducing results (o4-mini mentioned) can surprise automated harnesses.","Benchmark relies on many external MCP servers (28 listed); failures may be tool/provider-specific rather than framework-specific.","Sensitive API keys are required for several MCP servers; misconfiguration can lead to partial server availability.","Multi-server task settings increase complexity of tool routing and failure modes."],"error_quality":0.0}