pypdf
Pure-Python library for reading, splitting, merging, encrypting, decrypting, and extracting content from PDF files without any native binary dependencies.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Pure Python with no native code reduces binary attack surface. PDF encryption support uses RC4 and AES — validate that encrypted PDFs use AES-256, not legacy RC4, before trusting content security.
⚡ Reliability
Best When
You need to manipulate PDF structure (merge, split, rotate, encrypt/decrypt, extract pages) using a zero-dependency pure-Python solution.
Avoid When
Accurate text extraction with positional awareness or table parsing is required; pypdf's text extraction is best-effort and often produces garbled output on complex layouts.
Use Cases
- • Merge multiple PDF files into a single document for report assembly workflows
- • Split a large PDF into individual pages or chapter ranges for per-page processing pipelines
- • Decrypt password-protected PDFs before passing them to extraction or analysis tools
- • Read and populate PDF form field values (AcroForm) for automated document completion workflows
- • Rotate, crop, or reorder pages in a PDF as part of a document normalization step before OCR or ingestion
Not For
- • High-quality text extraction with layout preservation — pdfplumber or PyMuPDF produce far better results for text extraction
- • Table extraction from PDFs — use pdfplumber or camelot for structured table data
- • Rendering PDF pages to images — pypdf cannot render; use pdf2image or PyMuPDF for rasterization
Interface
Authentication
Library — no authentication required.
Pricing
BSD 3-Clause licensed. Successor to the deprecated PyPDF2 package.
Agent Metadata
Known Gotchas
- ⚠ pypdf (formerly PyPDF2) underwent a rename and significant API changes — code using PyPDF2 imports will break; always use 'from pypdf import PdfReader, PdfWriter'.
- ⚠ extract_text() produces unreliable output for PDFs with complex layouts, ligatures, or custom font encodings; do not rely on it for data extraction without manual validation.
- ⚠ Encrypted PDFs must be decrypted with reader.decrypt(password) before any page access; attempting to access pages on an encrypted PDF raises a FileNotDecryptedError with no partial content.
- ⚠ PdfWriter does not copy form field data automatically when merging pages from a PdfReader; AcroForm dictionaries must be copied separately to preserve interactive forms.
- ⚠ Large PDFs are loaded entirely into memory; for files exceeding a few hundred MB, memory pressure can cause OOM errors in constrained agent environments — stream page-by-page when possible.
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for pypdf.
Scores are editorial opinions as of 2026-03-06.