DataHub
Open-source data catalog and metadata management platform. DataHub indexes metadata from databases, data warehouses, ETL tools, ML platforms, and BI tools to provide a unified search, discovery, and lineage interface. Agents can query DataHub's GraphQL API to discover available datasets, understand schema, trace data lineage, and check data quality status — enabling data-aware agent pipelines. LinkedIn-origin, widely adopted enterprise data catalog.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Apache 2.0 open source with active security community. SOC2 for DataHub Cloud. OIDC/SSO support. PAT-based API access. Self-hosted deployments control all data residency. Fine-grained access control available in enterprise tier.
⚡ Reliability
Best When
You have complex data infrastructure (multiple warehouses, ETL tools, ML platforms) and need agents to discover, understand, and trace data across your stack without hardcoding dataset locations.
Avoid When
Your data estate is small and well-documented — a simple data dictionary or dbt docs provides sufficient metadata without DataHub's operational overhead.
Use Cases
- • Enable agents to discover available datasets by searching DataHub's catalog — 'find tables containing customer transaction data' before writing SQL
- • Trace data lineage for agent pipeline debugging — understand upstream dependencies when agent outputs are incorrect
- • Check dataset freshness and quality status via DataHub API before agents consume data that may be stale
- • Allow agents to understand schema context for unfamiliar tables by querying DataHub's schema registry
- • Build self-documenting agent pipelines that register their data consumption in DataHub for governance and lineage tracking
Not For
- • Real-time data quality enforcement — DataHub stores metadata about data, not the data itself; use dbt tests for quality enforcement
- • Teams without existing data infrastructure — DataHub is valuable when you have many data sources to catalog; overkill for small setups
- • Non-technical users needing a simple data dictionary — DataHub is powerful but has a steep operational learning curve
Interface
Authentication
Personal access tokens (PAT) for API access. OIDC/SSO for dashboard login. DataHub Cloud adds role-based access control. Token passed as Authorization Bearer header. Tokens created in DataHub settings.
Pricing
Apache 2.0 open source core is fully free. Self-hosting requires Kubernetes and significant DevOps investment. DataHub Cloud is the managed offering. Acryl Data provides commercial support and DataHub Cloud.
Agent Metadata
Known Gotchas
- ⚠ DataHub's primary query API is GraphQL — agents must use GraphQL syntax for entity search and metadata retrieval; REST API is primarily for ingestion
- ⚠ Entity URNs have a specific format (urn:li:dataset:(platform,name,env)) — agents must construct URNs correctly to reference specific entities
- ⚠ Self-hosted DataHub requires Kafka, Elasticsearch, and MySQL/PostgreSQL — significant infrastructure footprint for self-hosting
- ⚠ Search relevance tuning may be needed for specific data estates — default search ranking may not surface the most relevant datasets
- ⚠ DataHub's Python SDK (datahub) has separate ingestion and graph client libraries — agents must import from correct module for their use case
- ⚠ Metadata updates take seconds to appear in search after ingestion due to async processing through Kafka
- ⚠ Access control in open source is limited — DataHub Cloud or custom authorization middleware required for fine-grained access control
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for DataHub.
Scores are editorial opinions as of 2026-03-06.