CuPy
GPU-accelerated NumPy/SciPy drop-in replacement — runs array operations on NVIDIA CUDA or AMD ROCm GPUs. CuPy features: numpy-compatible API (cp.array, cp.dot, cp.fft), cupy.ndarray on GPU memory, cp.asnumpy() / cp.asarray() for CPU↔GPU transfer, element-wise custom kernels (cp.ElementwiseKernel), reduction kernels, cupy.RawKernel for CUDA C, sparse matrix support (cupyx.scipy.sparse), linear algebra (cupyx.scipy.linalg), and DLPack interop with PyTorch/JAX. 10-100x speedup over NumPy for large arrays on GPU. Used for signal processing, image processing, and scientific computing without PyTorch/TensorFlow overhead.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Local GPU computation — no network access, no data exfiltration risk. CUDA kernel compilation runs locally. Ensure cupy wheel comes from PyPI (not unofficial sources) for supply chain safety.
⚡ Reliability
Best When
Your agent needs GPU-accelerated numerical computation (FFT, linear algebra, signal processing) without full PyTorch/TensorFlow framework overhead — CuPy provides a near-identical NumPy API running on GPU.
Avoid When
You don't have a CUDA/ROCm GPU, your arrays are small, or you need neural network autograd (use PyTorch instead).
Use Cases
- • Agent GPU array computation — import cupy as cp; x = cp.array([1, 2, 3]); result = cp.fft.fft(x) — FFT on GPU 50x faster than NumPy; agent signal processing runs in milliseconds not seconds; cp.asnumpy(result) converts back to CPU for downstream processing
- • Agent large matrix operations — A = cp.random.randn(10000, 10000); B = cp.random.randn(10000, 1000); C = cp.dot(A, B) — GPU matrix multiply of 10K×10K matrix; agent numerical computations on large datasets run on GPU without PyTorch dependency
- • Agent custom CUDA kernel — kernel = cp.ElementwiseKernel('float32 x, float32 y', 'float32 z', 'z = x * x + y'); agent applies element-wise custom math to GPU arrays; ElementwiseKernel compiles once and reuses across calls
- • Agent sparse GPU computation — from cupyx.scipy.sparse import csr_matrix; sparse_gpu = csr_matrix(cpu_sparse); result = sparse_gpu.dot(vector_gpu) — agent graph or embedding operations on sparse GPU matrices; cupyx.scipy.sparse mirrors scipy.sparse API
- • Agent PyTorch interop — cp_array = cp.asarray(torch_tensor) — zero-copy conversion via DLPack; agent pipelines mixing CuPy preprocessing with PyTorch inference avoid CPU round-trips; torch.as_tensor(cp_array) converts back to PyTorch tensor on same GPU
Not For
- • CPU-only machines — CuPy requires NVIDIA CUDA or AMD ROCm GPU; no CPU fallback; agent deployments without GPU cannot use CuPy
- • General ML model training — use PyTorch or JAX for neural networks; CuPy is for array/numerical computation not autograd
- • Small arrays — GPU transfer overhead makes CuPy slower than NumPy for small arrays (<10K elements); only beneficial at scale
Interface
Authentication
No auth — local GPU library.
Pricing
CuPy is MIT licensed. Requires NVIDIA CUDA toolkit (free) or ROCm (free). GPU hardware required.
Agent Metadata
Known Gotchas
- ⚠ Install cupy-cuda12x not cupy — pip install cupy installs base without CUDA bindings; agent CI must install cupy-cuda11x, cupy-cuda12x, or cupy-rocm-5-0 matching GPU driver; wrong wheel causes ImportError: cannot import name 'ndarray' from 'cupy'
- ⚠ GPU memory not freed automatically — cupy arrays live on GPU until garbage collected; agent loops creating many cp.arrays without del or cp.get_default_memory_pool().free_all_blocks() cause GPU OOM; use mempool.free_all_blocks() between agent tasks
- ⚠ cp.asnumpy() is synchronous transfer — converting GPU array to NumPy blocks until GPU computation completes; agent pipelines doing cp.asnumpy() after every operation lose GPU parallelism; batch GPU ops then single asnumpy() at pipeline end
- ⚠ CuPy version must match CUDA toolkit exactly — cupy-cuda12x requires CUDA 12.x; agent Docker images with CUDA 11 cannot use cupy-cuda12x wheel; pin both CuPy version and CUDA base image version in agent Dockerfiles
- ⚠ Multi-GPU requires explicit device context — cp.cuda.Device(1).use() switches GPU; without it all agent operations default to GPU 0; agent multi-GPU code must use 'with cp.cuda.Device(n):' context manager to isolate operations per GPU
- ⚠ First kernel call triggers JIT compilation — first cp.ElementwiseKernel call compiles CUDA C and takes seconds; agent benchmarks without warmup show misleadingly slow first-call times; always warm up kernels with dummy call before benchmarking agent workloads
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for CuPy.
Scores are editorial opinions as of 2026-03-06.