Daft

Distributed DataFrame library for large-scale data processing with native multi-modal support (images, embeddings, tensors, URLs alongside tabular data). Written in Rust with Python API. Runs locally or distributed on Ray/AWS. Unlike Spark DataFrames or Pandas, Daft treats images and tensors as first-class column types — enabling AI/ML data pipelines that mix tabular and model data without custom serialization. Designed specifically for ML data preprocessing at scale.

Evaluated Mar 06, 2026 (0d ago) v0.3+
Homepage ↗ Repo ↗ Other dataframe distributed python ray multi-modal parquet images embeddings rust
⚙ Agent Friendliness
64
/ 100
Can an agent use this?
🔒 Security
81
/ 100
Is it safe for agents?
⚡ Reliability
66
/ 100
Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality
--
Documentation
78
Error Messages
75
Auth Simplicity
98
Rate Limits
95

🔒 Security

TLS Enforcement
90
Auth Strength
80
Scope Granularity
75
Dep. Hygiene
78
Secret Handling
82

No network API surface — pure Python library. Cloud storage credentials use standard provider auth chains. Rust core reduces memory safety vulnerabilities. Apache 2.0 source available for audit. Ray cluster security is separate concern.

⚡ Reliability

Uptime/SLA
68
Version Stability
65
Breaking Changes
62
Error Recovery
68
AF Security Reliability

Best When

You're building ML data pipelines at scale that mix tabular data with images, embeddings, or tensors and need distributed execution beyond single-machine capacity.

Avoid When

Your data fits in memory and you don't need multi-modal types — Polars or Pandas are significantly faster and simpler for pure tabular data.

Use Cases

  • Process millions of images, embeddings, or video frames in distributed agent data pipelines with Daft's native multi-modal column types
  • Scale Pandas-like DataFrame transformations to datasets too large for single machines using Daft's Ray-based distributed execution
  • Build ML training data pipelines that mix tabular metadata with image/embedding columns without custom serialization or format conversion code
  • Query and filter large Parquet/Delta Lake/Iceberg datasets using Daft's lazy evaluation and predicate pushdown for efficient data loading
  • Run distributed Python UDFs on GPU clusters via Ray for batch inference or feature extraction on multi-modal agent datasets

Not For

  • Simple single-machine data analysis — Pandas or Polars are simpler and faster for datasets that fit in memory
  • SQL-first analytics workloads — use DuckDB, ClickHouse, or Spark SQL for SQL-centric analytics without Python UDFs
  • Transactional or real-time data — Daft is a batch processing framework, not a streaming or OLTP system

Interface

REST API
No
GraphQL
No
gRPC
No
MCP Server
No
SDK
Yes
Webhooks
No

Authentication

Methods: none
OAuth: No Scopes: No

No authentication — Daft is a Python library. Cloud storage access uses standard cloud credentials (AWS_ACCESS_KEY_ID, GCP Application Default Credentials). Ray cluster authentication is separate from Daft.

Pricing

Model: open_source
Free tier: Yes
Requires CC: No

Apache 2.0 open source. Compute costs depend on Ray cluster size or local machine. Free to use in production. Eventual Inc. (creators) provides commercial support.

Agent Metadata

Pagination
none
Idempotent
Full
Retry Guidance
Not documented

Known Gotchas

  • Daft uses lazy evaluation — transformations build a logical plan but don't execute until .collect() or .show() is called; errors in transformations only surface at collect time
  • Multi-modal column types (ImageType, TensorType) are Daft-specific and require understanding Daft's type system; LLM-generated code may default to serializing these as bytes without using native types
  • Ray cluster setup is required for distributed execution — local mode works for development but agents must configure Ray for production scale; Ray initialization is separate from Daft
  • Daft is early-stage (v0.x) — API may have breaking changes between minor versions; pin exact version in agent environments and test on upgrades
  • GPU UDFs require Ray actors with GPU resources configured — GPU-enabled execution is not automatic; agents must annotate UDFs with @daft.udf(return_dtype=...) and configure Ray resource requests
  • Parquet reading uses predicate pushdown for filter optimization — agents must write filters using Daft expressions (not Python lambdas) to benefit from pushdown; Python UDF filters don't get pushed down
  • Memory management differs from Pandas — Daft batches data internally; for large datasets, out-of-memory errors may surface mid-execution rather than upfront with a clear memory estimate

Alternatives

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Daft.

$99

Scores are editorial opinions as of 2026-03-06.

5173
Packages Evaluated
26151
Need Evaluation
173
Need Re-evaluation
Community Powered