DataHub

Open-source data catalog and metadata management platform. DataHub indexes metadata from databases, data warehouses, ETL tools, ML platforms, and BI tools to provide a unified search, discovery, and lineage interface. Agents can query DataHub's GraphQL API to discover available datasets, understand schema, trace data lineage, and check data quality status — enabling data-aware agent pipelines. LinkedIn-origin, widely adopted enterprise data catalog.

Evaluated Mar 06, 2026 (0d ago) v0.13+

Homepage ↗ Repo ↗ Developer Tools data-catalog lineage metadata open-source governance discovery data-mesh

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Apache 2.0 open source with active security community. SOC2 for DataHub Cloud. OIDC/SSO support. PAT-based API access. Self-hosted deployments control all data residency. Fine-grained access control available in enterprise tier.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You have complex data infrastructure (multiple warehouses, ETL tools, ML platforms) and need agents to discover, understand, and trace data across your stack without hardcoding dataset locations.

Avoid When

Your data estate is small and well-documented — a simple data dictionary or dbt docs provides sufficient metadata without DataHub's operational overhead.

Use Cases

• Enable agents to discover available datasets by searching DataHub's catalog — 'find tables containing customer transaction data' before writing SQL
• Trace data lineage for agent pipeline debugging — understand upstream dependencies when agent outputs are incorrect
• Check dataset freshness and quality status via DataHub API before agents consume data that may be stale
• Allow agents to understand schema context for unfamiliar tables by querying DataHub's schema registry
• Build self-documenting agent pipelines that register their data consumption in DataHub for governance and lineage tracking

Not For

• Real-time data quality enforcement — DataHub stores metadata about data, not the data itself; use dbt tests for quality enforcement
• Teams without existing data infrastructure — DataHub is valuable when you have many data sources to catalog; overkill for small setups
• Non-technical users needing a simple data dictionary — DataHub is powerful but has a steep operational learning curve

Interface

REST API

Yes

GraphQL

Yes

gRPC

MCP Server

SDK

Yes

Webhooks

Yes

Authentication

Methods: api_key bearer_token

OAuth: Yes Scopes: Yes

Personal access tokens (PAT) for API access. OIDC/SSO for dashboard login. DataHub Cloud adds role-based access control. Token passed as Authorization Bearer header. Tokens created in DataHub settings.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

Apache 2.0 open source core is fully free. Self-hosting requires Kubernetes and significant DevOps investment. DataHub Cloud is the managed offering. Acryl Data provides commercial support and DataHub Cloud.

Agent Metadata

Pagination

cursor

Idempotent

Full

Retry Guidance

Documented

Known Gotchas

⚠ DataHub's primary query API is GraphQL — agents must use GraphQL syntax for entity search and metadata retrieval; REST API is primarily for ingestion
⚠ Entity URNs have a specific format (urn:li:dataset:(platform,name,env)) — agents must construct URNs correctly to reference specific entities
⚠ Self-hosted DataHub requires Kafka, Elasticsearch, and MySQL/PostgreSQL — significant infrastructure footprint for self-hosting
⚠ Search relevance tuning may be needed for specific data estates — default search ranking may not surface the most relevant datasets
⚠ DataHub's Python SDK (datahub) has separate ingestion and graph client libraries — agents must import from correct module for their use case
⚠ Metadata updates take seconds to appear in search after ingestion due to async processing through Kafka
⚠ Access control in open source is limited — DataHub Cloud or custom authorization middleware required for fine-grained access control

Alternatives

openmetadata-api atlan-api collibra-api alation-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for DataHub.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.