vaex
Out-of-core lazy dataframe library for Python — processes billion-row datasets without loading into RAM using memory-mapped files. vaex features: open() for memory-mapped HDF5/Arrow/CSV, lazy evaluation (computations deferred until needed), virtual columns (computed on-the-fly), df.mean/std/sum operations at billion rows/second, df.plot1d/plot2d for fast statistical plots, filtering with boolean expressions, df.apply() for UDFs, df.export_hdf5() for efficient storage, string operations, ML feature engineering, and JIT compilation via Numba/Pytables. Enables desktop analysis of datasets too large for pandas.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Local dataframe library with no network calls. HDF5 files are binary and machine-readable — protect large dataset files with filesystem permissions. No remote data access unless explicitly configured.
⚡ Reliability
Best When
Processing large datasets (100M-10B rows) that don't fit in RAM — vaex's memory-mapped lazy evaluation enables desktop analysis of truly large data without distributed infrastructure.
Avoid When
Small data (use pandas), complex multi-table operations (use DuckDB), mutable in-place updates, or real-time streaming.
Use Cases
- • Agent large dataset analysis — import vaex; df = vaex.open('large_data.hdf5'); mean_score = df[df['status'] == 'success']['score'].mean() — lazy evaluation; agent analyzes billion-row event log without loading into RAM; operations execute on memory-mapped file; result computed only when accessed
- • Agent CSV to HDF5 conversion — df = vaex.from_csv('large.csv', convert=True, chunk_size=1_000_000) — streaming conversion; agent converts large CSV to vaex HDF5 format for fast future access; chunk_size streams chunks without memory exhaustion; resulting HDF5 processes 100x faster
- • Agent feature engineering — df['log_value'] = vaex.log(df['value']); df['norm'] = (df['value'] - df['value'].mean()) / df['value'].std() — virtual columns; agent adds computed columns without copying data; virtual columns evaluated lazily on access; no memory overhead for computed features
- • Agent statistical plots — df = vaex.open('events.hdf5'); df.plot1d(df['latency_ms'], limits=[0, 1000]) — fast plot; agent visualizes 1 billion events in seconds; vaex uses statistical sampling for visualization; matplotlib integration with fast binning
- • Agent filtering pipeline — df_filtered = df[(df['error_code'] == 0) & (df['latency'] < 100)]; count = df_filtered.count() — lazy filter chain; agent applies multiple filters before counting; filters compose without executing until count(); memory-efficient pipeline for complex queries
Not For
- • Small datasets — vaex has overhead for small data; for <10M rows use pandas
- • Complex joins — vaex has limited join support vs pandas; for multi-table joins use pandas or DuckDB
- • Mutable operations — vaex DataFrames are immutable (no in-place updates); for mutable tabular data use pandas
Interface
Authentication
No auth — local dataframe library.
Pricing
vaex is MIT licensed. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ vaex DataFrame is not pandas DataFrame — vaex has similar API but not identical; df.groupby() works differently; df.merge() has limitations; agent code porting from pandas to vaex must test operations explicitly; don't assume pandas behavior
- ⚠ Lazy evaluation surprises — vaex operations return expressions not values; df['col'].mean() returns Expression; float(df['col'].mean()) triggers actual computation; agent code must force evaluation: .values or float() or numpy() to get actual numbers
- ⚠ CSV is slow — vaex.open('file.csv') is significantly slower than HDF5; for performance: vaex.from_csv('file.csv', convert=True) converts to HDF5 on first run; subsequent opens use HDF5; agent pipeline should convert CSV data once before analysis
- ⚠ Missing values different from pandas NaN — vaex uses masked arrays; vaex.ismissing(df['col']) to check; df.dropna() removes missing; but pandas NaN handling code doesn't directly apply to vaex missing values; agent code porting pandas NaN logic must adapt
- ⚠ vaex 4.x has breaking changes from 3.x — vaex 4.x moved to Apache Arrow backend; vaex 3.x HDF5 files may need migration; agent code upgrading must test data compatibility; check vaex version: import vaex; vaex.__version__
- ⚠ Memory mapping requires local filesystem — vaex.open() uses OS memory mapping; network filesystems (NFS, SMB) don't support memory mapping efficiently; agent code on cloud VMs should use local SSD not network storage; S3 requires full download or specialized cloud-native format
Alternatives
Full Evaluation Report
Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for vaex.
AI-powered analysis · PDF + markdown · Delivered within 30 minutes
Package Brief
Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.
Delivered within 10 minutes
Score Monitoring
Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.
Continuous monitoring
Scores are editorial opinions as of 2026-03-06.