pdf-parse

Simple PDF text extraction library for Node.js built on Mozilla's pdf.js. Extracts text content, page count, PDF metadata (author, title, creation date), and supports page-by-page text extraction callbacks. Designed for server-side PDF text extraction without browser dependencies. Widely used for extracting text from PDFs in Node.js pipelines.

Evaluated Mar 06, 2026 (0d ago) v1.x
Homepage ↗ Repo ↗ Developer Tools pdf parsing text-extraction node document pdf.js
⚙ Agent Friendliness
62
/ 100
Can an agent use this?
🔒 Security
95
/ 100
Is it safe for agents?
⚡ Reliability
79
/ 100
Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality
--
Documentation
75
Error Messages
68
Auth Simplicity
100
Rate Limits
100

🔒 Security

TLS Enforcement
100
Auth Strength
100
Scope Granularity
100
Dep. Hygiene
68
Secret Handling
100

Local computation. Minimally maintained library — dependency hygiene concern. Process untrusted PDFs carefully; malformed PDFs can cause exceptions or excessive memory use.

⚡ Reliability

Uptime/SLA
90
Version Stability
75
Breaking Changes
80
Error Recovery
70
AF Security Reliability

Best When

You need simple, quick PDF text extraction in Node.js for documents with embedded text (not scanned images).

Avoid When

Your PDFs are scanned images, require table structure preservation, or need complex layout analysis — use specialized tools.

Use Cases

  • Extract text content from PDF documents in agent document processing pipelines for RAG or analysis
  • Parse PDF metadata (author, title, creation date, page count) for document indexing workflows
  • Implement PDF-to-text conversion in agent content pipelines feeding into LLM contexts
  • Extract text from uploaded PDF files in Node.js server applications for search indexing
  • Process PDF reports, invoices, or contracts for agent data extraction workflows

Not For

  • Complex PDF manipulation (merging, splitting, form filling) — use pdf-lib or PDFKit for PDF creation/modification
  • Scanned PDFs or image-based PDFs — pdf-parse extracts embedded text only; use OCR (Tesseract) for scanned documents
  • High-accuracy table extraction — pdf-parse extracts raw text without layout/table structure; use tabula-py or Camelot for structured table data

Interface

REST API
No
GraphQL
No
gRPC
No
MCP Server
No
SDK
Yes
Webhooks
No

Authentication

Methods: none
OAuth: No Scopes: No

Local library — no authentication required. MIT licensed.

Pricing

Model: open_source
Free tier: Yes
Requires CC: No

MIT licensed. Zero cost.

Agent Metadata

Pagination
none
Idempotent
Full
Retry Guidance
Not documented

Known Gotchas

  • pdf-parse only extracts EMBEDDED text — scanned PDFs (image-based) return empty text; check numpages vs text.length to detect empty extractions
  • Text extraction may not preserve reading order for multi-column layouts — text from columns may be interleaved; layout-aware extraction requires pdfreader or more advanced tools
  • Memory usage scales with PDF size — large PDFs can cause out-of-memory errors; stream processing isn't natively supported
  • Package is minimally maintained (GitLab, last release 2019) — consider pdf2json or unpdf as more actively maintained alternatives for new projects
  • Password-protected PDFs are not supported — extraction fails with encrypted PDF error; pre-decrypt with qpdf CLI if needed
  • Text encoding issues: some PDFs use custom encodings — extracted text may contain garbled characters for non-standard fonts; inspect result for encoding artifacts

Alternatives

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for pdf-parse.

$99

Scores are editorial opinions as of 2026-03-06.

5208
Packages Evaluated
26151
Need Evaluation
173
Need Re-evaluation
Community Powered