PyIceberg
Official Python implementation of Apache Iceberg, the open table format for huge analytic datasets. Provides a Python API for reading/writing Iceberg tables stored in S3, GCS, HDFS, or local filesystem with catalog support (REST, Hive, AWS Glue, Nessie). Enables Python agents to interact with Iceberg data lakes — schema evolution, time travel, partition management — without Spark or Java.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Security depends on catalog and object store configuration. Cloud credentials follow standard practices (IAM, service accounts). Data encrypted at rest by object store.
⚡ Reliability
Best When
You're building Python data engineering tools that need to interact with Apache Iceberg tables — catalog management, schema evolution, time travel — without a Spark dependency.
Avoid When
You need high-throughput production ETL (use Spark/Flink with Java Iceberg), simple ad-hoc analytics (use DuckDB), or haven't set up Iceberg infrastructure.
Use Cases
- • Read and write Apache Iceberg tables from Python without requiring a Spark cluster for data lake workloads
- • Execute schema evolution on Iceberg tables (add columns, rename, change types) from Python agent workflows
- • Use Iceberg's time travel capabilities to query historical table snapshots for data validation and debugging
- • Register and discover Iceberg tables via REST, Glue, or Hive catalogs for data mesh and lakehouse architectures
- • Build Python ETL pipelines that write to Iceberg tables with ACID semantics for consistent lakehouse data
Not For
- • Production heavy-write workloads — PyIceberg is Python and slower than Spark/Flink for high-throughput writes; use Java Iceberg for production ETL
- • Teams without Iceberg infrastructure — significant setup required (catalog, object store); Delta Lake or Hudi may have simpler entry points
- • Ad-hoc SQL analytics — use DuckDB or ClickHouse for interactive SQL; PyIceberg is table format management, not a query engine
Interface
Authentication
Auth depends on catalog type: REST catalog uses bearer tokens, AWS Glue uses IAM/boto3, Hive uses Kerberos. Object store auth via standard cloud credentials.
Pricing
Apache 2.0 license. Apache Software Foundation project.
Agent Metadata
Known Gotchas
- ⚠ PyIceberg requires a catalog — there is no standalone mode; must configure REST, Hive, Glue, or in-memory catalog before any table operations
- ⚠ Writing to Iceberg tables creates Parquet files in object store AND updates catalog metadata — both must succeed atomically or table state becomes inconsistent
- ⚠ Partition spec changes require new table rewrites or explicit partition evolution — adding new partition specs doesn't automatically re-partition existing data
- ⚠ PyIceberg's write performance is limited by Python's GIL — for high-throughput writes, use Spark with Java Iceberg or batch writes with PyArrow
- ⚠ Catalog connection strings vary by type (REST, Hive, Glue) — configuration syntax is not standardized across catalog types
- ⚠ Time travel queries require specifying snapshot ID or timestamp — PyIceberg's snapshot history is in the catalog; expired snapshots via table maintenance are not queryable
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for PyIceberg.
Scores are editorial opinions as of 2026-03-06.