pdfminer.six

Pure-Python PDF text extraction library (maintained fork of pdfminer) that parses PDF structure and returns text with precise character-level position coordinates.

Evaluated Mar 06, 2026 (0d ago) v20231228

Homepage ↗ Repo ↗ Developer Tools python pdf text-extraction layout-analysis pure-python coordinates no-binary-deps

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

100

Rate Limits

100

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Pure Python with no native binary code reduces attack surface; still process only trusted PDFs as malformed inputs can cause excessive memory or CPU usage

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You need character-level position data for layout reconstruction or you require a pure-Python solution with zero binary dependencies.

Avoid When

Processing thousands of PDFs in a batch pipeline where extraction speed is the primary constraint.

Use Cases

• Extracting text with exact bounding box coordinates for layout-aware document processing
• Parsing PDFs in environments where binary C extensions cannot be installed
• Reconstructing reading order from complex multi-column or magazine-style PDF layouts
• Extracting text from password-protected PDFs by supplying the decryption password
• Analyzing PDF internal structure (fonts, objects, streams) for forensic or debugging purposes

Not For

• High-throughput bulk extraction where speed is critical (PyMuPDF is 5-10x faster)
• Extracting images or rendering pages to raster graphics (use PyMuPDF for that)
• Scanned PDFs where text is embedded in images rather than as native PDF content

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

Local Python library — no authentication required

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

MIT license. pdfminer.six is the community-maintained fork of the original pdfminer project. Completely free with no usage restrictions.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Not documented

Known Gotchas

⚠ The high-level extract_text() helper discards layout information — agents needing coordinates must use the lower-level LAParams + PDFPageAggregator pipeline
⚠ Some PDFs with custom encodings return garbled Unicode; there is no automatic fallback or warning
⚠ Significantly slower than PyMuPDF — benchmark before using in latency-sensitive agent steps
⚠ Silently returns empty string for image-only (scanned) PDFs with no exception raised
⚠ pdfminer (unmaintained original) and pdfminer.six coexist on PyPI — always install pdfminer.six to avoid installing the dead package

Alternatives

pymupdf-api pdfplumber-api pikepdf-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for pdfminer.six.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.