Apache Parquet (PyArrow)
Columnar binary storage format for analytical workloads, accessed via PyArrow or pandas, with efficient compression and predicate pushdown.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Column-level encryption available in Parquet spec but not widely implemented in PyArrow — use filesystem-level encryption.
⚡ Reliability
Best When
Best for storing large structured datasets for analytical reads where columnar efficiency and compression matter.
Avoid When
Avoid for OLTP workloads, row-level updates, or when human-readable formats are required for debugging.
Use Cases
- • Store and retrieve large agent datasets from S3/GCS/Azure with column pruning for cost efficiency
- • Build data lakes where AI training pipelines read only needed columns from parquet partitions
- • Exchange large structured datasets between agents without CSV parsing overhead
- • Implement efficient time-series data storage with Parquet partitioning by date columns
- • Cache expensive ML feature computations to Parquet for reuse across agent runs
Not For
- • Row-oriented workloads with frequent single-record updates — use databases instead
- • Small datasets under 10MB where CSV or JSON is simpler and fast enough
- • Streaming data that requires append-to-existing-file semantics
Interface
Authentication
Format library — auth for remote storage (S3, GCS) handled by filesystem layer.
Pricing
Apache 2.0 licensed. Cloud storage costs apply when reading/writing remote files.
Agent Metadata
Known Gotchas
- ⚠ Parquet files are immutable — 'updating' a record requires rewriting the entire file or using Delta Lake/Iceberg
- ⚠ Schema evolution is limited — adding nullable columns is safe, but renaming or changing types breaks readers
- ⚠ Partition column values are encoded in directory paths (Hive-style) not in the file — readers must infer partition schema
- ⚠ Row group size defaults (128MB) affect read performance — too small means many file opens, too large means wasted reads
- ⚠ pyarrow.parquet.read_table() reads entire file by default — use filters= parameter for predicate pushdown to avoid loading all data
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Apache Parquet (PyArrow).
Scores are editorial opinions as of 2026-03-06.