Daft

Distributed DataFrame library for large-scale data processing with native multi-modal support (images, embeddings, tensors, URLs alongside tabular data). Written in Rust with Python API. Runs locally or distributed on Ray/AWS. Unlike Spark DataFrames or Pandas, Daft treats images and tensors as first-class column types — enabling AI/ML data pipelines that mix tabular and model data without custom serialization. Designed specifically for ML data preprocessing at scale.

Evaluated Mar 06, 2026 (0d ago) v0.3+

Homepage ↗ Repo ↗ Other dataframe distributed python ray multi-modal parquet images embeddings rust

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

No network API surface — pure Python library. Cloud storage credentials use standard provider auth chains. Rust core reduces memory safety vulnerabilities. Apache 2.0 source available for audit. Ray cluster security is separate concern.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You're building ML data pipelines at scale that mix tabular data with images, embeddings, or tensors and need distributed execution beyond single-machine capacity.

Avoid When

Your data fits in memory and you don't need multi-modal types — Polars or Pandas are significantly faster and simpler for pure tabular data.

Use Cases

• Process millions of images, embeddings, or video frames in distributed agent data pipelines with Daft's native multi-modal column types
• Scale Pandas-like DataFrame transformations to datasets too large for single machines using Daft's Ray-based distributed execution
• Build ML training data pipelines that mix tabular metadata with image/embedding columns without custom serialization or format conversion code
• Query and filter large Parquet/Delta Lake/Iceberg datasets using Daft's lazy evaluation and predicate pushdown for efficient data loading
• Run distributed Python UDFs on GPU clusters via Ray for batch inference or feature extraction on multi-modal agent datasets

Not For

• Simple single-machine data analysis — Pandas or Polars are simpler and faster for datasets that fit in memory
• SQL-first analytics workloads — use DuckDB, ClickHouse, or Spark SQL for SQL-centric analytics without Python UDFs
• Transactional or real-time data — Daft is a batch processing framework, not a streaming or OLTP system

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

No authentication — Daft is a Python library. Cloud storage access uses standard cloud credentials (AWS_ACCESS_KEY_ID, GCP Application Default Credentials). Ray cluster authentication is separate from Daft.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

Apache 2.0 open source. Compute costs depend on Ray cluster size or local machine. Free to use in production. Eventual Inc. (creators) provides commercial support.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Not documented

Known Gotchas

⚠ Daft uses lazy evaluation — transformations build a logical plan but don't execute until .collect() or .show() is called; errors in transformations only surface at collect time
⚠ Multi-modal column types (ImageType, TensorType) are Daft-specific and require understanding Daft's type system; LLM-generated code may default to serializing these as bytes without using native types
⚠ Ray cluster setup is required for distributed execution — local mode works for development but agents must configure Ray for production scale; Ray initialization is separate from Daft
⚠ Daft is early-stage (v0.x) — API may have breaking changes between minor versions; pin exact version in agent environments and test on upgrades
⚠ GPU UDFs require Ray actors with GPU resources configured — GPU-enabled execution is not automatic; agents must annotate UDFs with @daft.udf(return_dtype=...) and configure Ray resource requests
⚠ Parquet reading uses predicate pushdown for filter optimization — agents must write filters using Daft expressions (not Python lambdas) to benefit from pushdown; Python UDF filters don't get pushed down
⚠ Memory management differs from Pandas — Daft batches data internally; for large datasets, out-of-memory errors may surface mid-execution rather than upfront with a clear memory estimate

Alternatives

polars-api spark-api ray-api modin-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Daft.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.