pdfminer.six

Pure-Python PDF text extraction library (maintained fork of pdfminer) that parses PDF structure and returns text with precise character-level position coordinates.

Evaluated Mar 06, 2026 (0d ago) v20231228
Homepage ↗ Repo ↗ Developer Tools python pdf text-extraction layout-analysis pure-python coordinates no-binary-deps
⚙ Agent Friendliness
65
/ 100
Can an agent use this?
🔒 Security
88
/ 100
Is it safe for agents?
⚡ Reliability
80
/ 100
Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality
--
Documentation
80
Error Messages
76
Auth Simplicity
100
Rate Limits
100

🔒 Security

TLS Enforcement
90
Auth Strength
90
Scope Granularity
85
Dep. Hygiene
84
Secret Handling
90

Pure Python with no native binary code reduces attack surface; still process only trusted PDFs as malformed inputs can cause excessive memory or CPU usage

⚡ Reliability

Uptime/SLA
80
Version Stability
82
Breaking Changes
80
Error Recovery
76
AF Security Reliability

Best When

You need character-level position data for layout reconstruction or you require a pure-Python solution with zero binary dependencies.

Avoid When

Processing thousands of PDFs in a batch pipeline where extraction speed is the primary constraint.

Use Cases

  • Extracting text with exact bounding box coordinates for layout-aware document processing
  • Parsing PDFs in environments where binary C extensions cannot be installed
  • Reconstructing reading order from complex multi-column or magazine-style PDF layouts
  • Extracting text from password-protected PDFs by supplying the decryption password
  • Analyzing PDF internal structure (fonts, objects, streams) for forensic or debugging purposes

Not For

  • High-throughput bulk extraction where speed is critical (PyMuPDF is 5-10x faster)
  • Extracting images or rendering pages to raster graphics (use PyMuPDF for that)
  • Scanned PDFs where text is embedded in images rather than as native PDF content

Interface

REST API
No
GraphQL
No
gRPC
No
MCP Server
No
SDK
Yes
Webhooks
No

Authentication

Methods: none
OAuth: No Scopes: No

Local Python library — no authentication required

Pricing

Model: open_source
Free tier: Yes
Requires CC: No

MIT license. pdfminer.six is the community-maintained fork of the original pdfminer project. Completely free with no usage restrictions.

Agent Metadata

Pagination
none
Idempotent
Full
Retry Guidance
Not documented

Known Gotchas

  • The high-level extract_text() helper discards layout information — agents needing coordinates must use the lower-level LAParams + PDFPageAggregator pipeline
  • Some PDFs with custom encodings return garbled Unicode; there is no automatic fallback or warning
  • Significantly slower than PyMuPDF — benchmark before using in latency-sensitive agent steps
  • Silently returns empty string for image-only (scanned) PDFs with no exception raised
  • pdfminer (unmaintained original) and pdfminer.six coexist on PyPI — always install pdfminer.six to avoid installing the dead package

Alternatives

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for pdfminer.six.

$99

Scores are editorial opinions as of 2026-03-06.

5215
Packages Evaluated
26151
Need Evaluation
173
Need Re-evaluation
Community Powered