pdf-parse

Simple PDF text extraction library for Node.js built on Mozilla's pdf.js. Extracts text content, page count, PDF metadata (author, title, creation date), and supports page-by-page text extraction callbacks. Designed for server-side PDF text extraction without browser dependencies. Widely used for extracting text from PDFs in Node.js pipelines.

Evaluated Mar 06, 2026 (0d ago) v1.x

Homepage ↗ Repo ↗ Developer Tools pdf parsing text-extraction node document pdf.js

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

100

Rate Limits

100

🔒 Security

TLS Enforcement

100

Auth Strength

100

Scope Granularity

100

Dep. Hygiene

Secret Handling

100

Local computation. Minimally maintained library — dependency hygiene concern. Process untrusted PDFs carefully; malformed PDFs can cause exceptions or excessive memory use.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You need simple, quick PDF text extraction in Node.js for documents with embedded text (not scanned images).

Avoid When

Your PDFs are scanned images, require table structure preservation, or need complex layout analysis — use specialized tools.

Use Cases

• Extract text content from PDF documents in agent document processing pipelines for RAG or analysis
• Parse PDF metadata (author, title, creation date, page count) for document indexing workflows
• Implement PDF-to-text conversion in agent content pipelines feeding into LLM contexts
• Extract text from uploaded PDF files in Node.js server applications for search indexing
• Process PDF reports, invoices, or contracts for agent data extraction workflows

Not For

• Complex PDF manipulation (merging, splitting, form filling) — use pdf-lib or PDFKit for PDF creation/modification
• Scanned PDFs or image-based PDFs — pdf-parse extracts embedded text only; use OCR (Tesseract) for scanned documents
• High-accuracy table extraction — pdf-parse extracts raw text without layout/table structure; use tabula-py or Camelot for structured table data

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

Local library — no authentication required. MIT licensed.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

MIT licensed. Zero cost.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Not documented

Known Gotchas

⚠ pdf-parse only extracts EMBEDDED text — scanned PDFs (image-based) return empty text; check numpages vs text.length to detect empty extractions
⚠ Text extraction may not preserve reading order for multi-column layouts — text from columns may be interleaved; layout-aware extraction requires pdfreader or more advanced tools
⚠ Memory usage scales with PDF size — large PDFs can cause out-of-memory errors; stream processing isn't natively supported
⚠ Package is minimally maintained (GitLab, last release 2019) — consider pdf2json or unpdf as more actively maintained alternatives for new projects
⚠ Password-protected PDFs are not supported — extraction fails with encrypted PDF error; pre-decrypt with qpdf CLI if needed
⚠ Text encoding issues: some PDFs use custom encodings — extracted text may contain garbled characters for non-standard fonts; inspect result for encoding artifacts

Alternatives

pdfminer-api pymupdf-api unstructured-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for pdf-parse.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.