OpenLineage
Open standard for data lineage collection and propagation. OpenLineage defines a spec for how data pipelines emit lineage events (who ran what job, which datasets were inputs, which were outputs). Compatible backends: Marquez (open-source server), DataHub, Atlan. Integrations for Spark, Airflow, dbt, Flink, and more auto-emit lineage events without code changes.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Apache 2.0 open-source — Linux Foundation project, highly auditable. Auth delegated to backend implementation. HTTPS for event transport recommended. No credentials stored in spec — backend-specific. Strong security by design.
⚡ Reliability
Best When
You're using Airflow, Spark, or dbt and want automatic data lineage tracking across your data pipelines without writing custom lineage code.
Avoid When
You need commercial enterprise lineage management with support SLAs — use a commercial product like Collibra or Atlan that consumes OpenLineage events.
Use Cases
- • Automatically track data lineage across AI training pipelines using OpenLineage Airflow/Spark integrations — no custom code required
- • Feed lineage data to Marquez or DataHub for catalog enrichment, enabling agents to understand data provenance
- • Implement impact analysis — understand which downstream datasets and models are affected when an upstream data source changes
- • Audit ML model training data lineage for compliance — prove which datasets contributed to a model version via lineage events
- • Build data observability pipelines that trigger alerts when expected lineage events are missing (indicating pipeline failure)
Not For
- • Real-time row-level data tracking — OpenLineage tracks job-level dataset lineage, not individual record provenance
- • Column-level lineage natively — column lineage is being added but not fully supported across all integrations
- • Teams using only proprietary tools — lineage integration requires OpenLineage-compatible data tools
Interface
Authentication
OpenLineage spec itself is auth-agnostic — backends implement authentication. Marquez (open-source backend) has optional API key auth. DataHub uses its own auth. HTTP transport supports Authorization header.
Pricing
OpenLineage is a standard, not a product — free to use. The receiver/backend (Marquez, DataHub) has its own cost. Linux Foundation project with strong industry backing.
Agent Metadata
Known Gotchas
- ⚠ OpenLineage is a standard, not a complete product — you need a backend (Marquez, DataHub) to store and query lineage
- ⚠ Integrations auto-emit lineage for standard operations — custom dataset transformations may require manual event emission
- ⚠ Lineage event schema is strict — invalid events are dropped silently by some backends
- ⚠ Column-level lineage support varies by integration — not all Spark/Airflow operations emit column lineage
- ⚠ Backend choice significantly affects query capabilities — Marquez offers simpler queries; DataHub offers richer search
- ⚠ Async event transport means lineage data may lag behind pipeline execution — don't query lineage immediately after job completion
- ⚠ OpenLineage facets extend the base spec — verify that your backend supports the specific facets your integration emits
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for OpenLineage.
Scores are editorial opinions as of 2026-03-06.