DataHub

Open-source data catalog and metadata management platform. DataHub indexes metadata from databases, data warehouses, ETL tools, ML platforms, and BI tools to provide a unified search, discovery, and lineage interface. Agents can query DataHub's GraphQL API to discover available datasets, understand schema, trace data lineage, and check data quality status — enabling data-aware agent pipelines. LinkedIn-origin, widely adopted enterprise data catalog.

Evaluated Mar 06, 2026 (0d ago) v0.13+
Homepage ↗ Repo ↗ Developer Tools data-catalog lineage metadata open-source governance discovery data-mesh
⚙ Agent Friendliness
58
/ 100
Can an agent use this?
🔒 Security
84
/ 100
Is it safe for agents?
⚡ Reliability
78
/ 100
Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality
--
Documentation
80
Error Messages
75
Auth Simplicity
78
Rate Limits
72

🔒 Security

TLS Enforcement
95
Auth Strength
82
Scope Granularity
78
Dep. Hygiene
85
Secret Handling
82

Apache 2.0 open source with active security community. SOC2 for DataHub Cloud. OIDC/SSO support. PAT-based API access. Self-hosted deployments control all data residency. Fine-grained access control available in enterprise tier.

⚡ Reliability

Uptime/SLA
82
Version Stability
78
Breaking Changes
75
Error Recovery
78
AF Security Reliability

Best When

You have complex data infrastructure (multiple warehouses, ETL tools, ML platforms) and need agents to discover, understand, and trace data across your stack without hardcoding dataset locations.

Avoid When

Your data estate is small and well-documented — a simple data dictionary or dbt docs provides sufficient metadata without DataHub's operational overhead.

Use Cases

  • Enable agents to discover available datasets by searching DataHub's catalog — 'find tables containing customer transaction data' before writing SQL
  • Trace data lineage for agent pipeline debugging — understand upstream dependencies when agent outputs are incorrect
  • Check dataset freshness and quality status via DataHub API before agents consume data that may be stale
  • Allow agents to understand schema context for unfamiliar tables by querying DataHub's schema registry
  • Build self-documenting agent pipelines that register their data consumption in DataHub for governance and lineage tracking

Not For

  • Real-time data quality enforcement — DataHub stores metadata about data, not the data itself; use dbt tests for quality enforcement
  • Teams without existing data infrastructure — DataHub is valuable when you have many data sources to catalog; overkill for small setups
  • Non-technical users needing a simple data dictionary — DataHub is powerful but has a steep operational learning curve

Interface

REST API
Yes
GraphQL
Yes
gRPC
No
MCP Server
No
SDK
Yes
Webhooks
Yes

Authentication

Methods: api_key bearer_token
OAuth: Yes Scopes: Yes

Personal access tokens (PAT) for API access. OIDC/SSO for dashboard login. DataHub Cloud adds role-based access control. Token passed as Authorization Bearer header. Tokens created in DataHub settings.

Pricing

Model: open_source
Free tier: Yes
Requires CC: No

Apache 2.0 open source core is fully free. Self-hosting requires Kubernetes and significant DevOps investment. DataHub Cloud is the managed offering. Acryl Data provides commercial support and DataHub Cloud.

Agent Metadata

Pagination
cursor
Idempotent
Full
Retry Guidance
Documented

Known Gotchas

  • DataHub's primary query API is GraphQL — agents must use GraphQL syntax for entity search and metadata retrieval; REST API is primarily for ingestion
  • Entity URNs have a specific format (urn:li:dataset:(platform,name,env)) — agents must construct URNs correctly to reference specific entities
  • Self-hosted DataHub requires Kafka, Elasticsearch, and MySQL/PostgreSQL — significant infrastructure footprint for self-hosting
  • Search relevance tuning may be needed for specific data estates — default search ranking may not surface the most relevant datasets
  • DataHub's Python SDK (datahub) has separate ingestion and graph client libraries — agents must import from correct module for their use case
  • Metadata updates take seconds to appear in search after ingestion due to async processing through Kafka
  • Access control in open source is limited — DataHub Cloud or custom authorization middleware required for fine-grained access control

Alternatives

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for DataHub.

$99

Scores are editorial opinions as of 2026-03-06.

5215
Packages Evaluated
26151
Need Evaluation
173
Need Re-evaluation
Community Powered