tabula-py

Python wrapper for the Tabula Java library that extracts tables from PDF documents and returns them as pandas DataFrames.

Evaluated Mar 06, 2026 (0d ago) v2.9.x

Homepage ↗ Repo ↗ Developer Tools python pdf tables pandas java data-extraction tabular-data

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

100

Rate Limits

100

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Spawns a Java subprocess; ensure tabula-java jar is sourced from official distribution; avoid processing untrusted PDFs as Java PDF parsers have a history of CVEs

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

Your PDFs contain native (non-scanned) tables and you need pandas-ready output with minimal code.

Avoid When

Java is unavailable in your runtime environment, or the PDF tables are inside scanned images.

Use Cases

• Extracting financial tables from PDF reports and converting them to pandas DataFrames
• Batch processing regulatory filings or research papers to pull structured tabular data
• Automating data ingestion pipelines that receive data only as PDF tables
• Extracting multi-column tables from government or academic PDF publications
• Converting PDF price lists or schedules into machine-readable CSV or JSON

Not For

• Scanned PDFs where tables are images rather than native PDF content (use docTR + layout analysis instead)
• Environments where Java 8+ cannot be installed or is prohibited
• Extracting free-form prose or non-tabular text (use PyMuPDF or pdfminer instead)

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

Local Python library — no authentication required; requires Java 8+ available on PATH

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

MIT license. tabula-py is the Python wrapper; the underlying Tabula Java engine is also open source (MIT). Both are free with no usage limits.

Agent Metadata

Pagination

none

Idempotent

Full

Retry Guidance

Not documented

Known Gotchas

⚠ Java 8+ must be installed and on PATH — missing Java produces an opaque OSError, not a clear dependency message
⚠ read_pdf() returns a list of DataFrames, one per detected table — agents must iterate, not assume a single result
⚠ Table detection heuristics can merge or split tables incorrectly; lattice vs stream mode must be chosen manually
⚠ Password-encrypted PDFs require the password parameter — no automatic detection or helpful error
⚠ Very large PDFs spawn a long-lived JVM subprocess; agent timeouts must account for JVM startup overhead (~1-2s)

Alternatives

camelot-api pdfplumber-api pymupdf-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for tabula-py.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.