{"id":"open-compass-opencompass","name":"opencompass","homepage":"https://opencompass.org.cn/","repo_url":"https://github.com/open-compass/opencompass","category":"ai-ml","subcategories":[],"tags":["ai-ml","benchmark","evaluation","llm","python","open-source","cli","batch-processing"],"what_it_does":"OpenCompass is an open-source LLM evaluation platform. It provides configurable evaluation pipelines (via CLI and Python scripts) to run model benchmarks across many datasets, including support for local/open-source models and API-based models (e.g., OpenAI/Qwen), with optional inference acceleration backends (e.g., vLLM, LMDeploy).","use_cases":["Running standardized and custom LLM benchmark suites across many datasets","Evaluating chat/reasoning/long-context/multimodal-style tasks using configurable evaluators and scoring pipelines","Integrating model-as-judge and other evaluation strategies (e.g., LLM-as-judge, math verification)","Reproducing leaderboard-style results via provided example configs/scripts","Testing model outputs with different inference backends for speed/cost tradeoffs"],"not_for":["Building an end-user hosted web service API for evaluation without running your own infrastructure (it is primarily a local/batch evaluation framework)","Real-time interactive evaluation/serving where you need low-latency request/response APIs","Environments that require strict data residency/compliance guarantees without careful deployment controls"],"best_when":"You need reproducible offline/batch evaluation of LLMs with configurable datasets and scoring logic, and you are comfortable running Python tooling with model/dataset dependencies.","avoid_when":"You need a simple SaaS-style API with documented HTTP endpoints, or you want turnkey managed hosting without handling datasets, credentials, and compute yourself.","alternatives":["lm-evaluation-harness (EleutherAI)","promptfoo","HELM (holistic evaluation)","LangSmith/LangChain evaluation tooling","OpenAI Evals (for OpenAI-focused workflows)","BentoML/others are not direct substitutes; use evaluation frameworks instead"],"af_score":36.2,"security_score":48.8,"reliability_score":32.5,"package_type":"skill","discovery_source":["openclaw"],"priority":"high","status":"evaluated","version_evaluated":null,"last_evaluated":"2026-03-29T14:56:44.992714+00:00","interface":{"has_rest_api":false,"has_graphql":false,"has_grpc":false,"has_mcp_server":false,"mcp_server_url":null,"has_sdk":false,"sdk_languages":[],"openapi_spec_url":null,"webhooks":false},"auth":{"methods":["Environment-variable based API key for OpenAI-style API model evaluation (e.g., OPENAI_API_KEY)"],"oauth":false,"scopes":false,"notes":"Authentication details appear to be primarily delegated to the underlying model backend (e.g., API keys for API models) rather than an OpenCompass-managed auth system."},"pricing":{"model":null,"free_tier_exists":false,"free_tier_limits":null,"paid_tiers":[],"requires_credit_card":false,"estimated_workload_costs":null,"notes":"OpenCompass itself is open-source; costs mainly come from compute and any external API model usage (not specified in provided README excerpt)."},"requirements":{"requires_signup":false,"requires_credit_card":false,"domain_verification":false,"data_residency":[],"compliance":[],"min_contract":null},"agent_readiness":{"af_score":36.2,"security_score":48.8,"reliability_score":32.5,"mcp_server_quality":0.0,"documentation_accuracy":55.0,"error_message_quality":0.0,"error_message_notes":null,"auth_complexity":70.0,"rate_limit_clarity":0.0,"tls_enforcement":80.0,"auth_strength":55.0,"scope_granularity":10.0,"dependency_hygiene":40.0,"secret_handling":55.0,"security_notes":"The README indicates use of API keys via environment variables for API model evaluation, which is generally better than hardcoding. However, there is no provided evidence here of fine-grained scopes, explicit secret logging protections, or secure-by-default handling/error redaction. TLS enforcement and dependency hygiene cannot be fully verified from the excerpt; score reflects typical expectations for Python tooling but with limited observable documentation.","uptime_documented":0.0,"version_stability":60.0,"breaking_changes_history":40.0,"error_recovery":30.0,"idempotency_support":"false","idempotency_notes":null,"pagination_style":"none","retry_guidance_documented":false,"known_agent_gotchas":["Primarily CLI/Python batch workflow, so an agent must orchestrate runs rather than call a stable request/response API","Large dependency surface (datasets, models, acceleration backends) can cause environment-specific failures; agent may need careful setup/installation extras","Authentication is backend-specific (e.g., API keys in env vars); agents should avoid logging secrets and ensure the correct environment variables are set","Evaluation results/reproducibility depend heavily on configuration files and dataset/model versions; changes in config structure (noted breaking change around 0.4.0) can break automation"]}}