Scrapy

High-performance Python web scraping and crawling framework. Scrapy provides a complete spider lifecycle (request scheduling, downloading, parsing, pipeline processing) with async I/O via Twisted. Built-in support for robots.txt, rate limiting, cookie handling, caching, and item pipelines for storing scraped data. The de-facto standard for large-scale Python web scraping.

Evaluated Mar 06, 2026 (0d ago) v2.11+

Homepage ↗ Repo ↗ Developer Tools python scraping crawling spider data-extraction async middleware

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

100

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Self-hosted framework — security model depends on deployment. Twisted handles TLS. Scrapy settings file must not contain credentials in version control.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

You need to crawl many pages systematically with built-in rate limiting, pipelines, and middleware for production-grade web scraping in Python.

Avoid When

The target site requires JavaScript rendering or you need a simple one-off scrape — use Playwright or requests+BeautifulSoup respectively.

Use Cases

• Build agent data collection pipelines that crawl entire websites extracting structured data using CSS/XPath selectors
• Schedule and run automated web scrapers that feed agent knowledge bases with regularly updated content
• Extract product data, prices, and inventory from e-commerce sites for agent competitive intelligence
• Crawl documentation sites to build agent-searchable knowledge stores from HTML content
• Run large-scale domain crawls with rate limiting, politeness rules, and resume support for agent training data

Not For

• JavaScript-heavy single-page applications — Scrapy doesn't execute JS; use Playwright or Selenium for SPAs
• Simple one-off data extractions — requests + BeautifulSoup is simpler for small tasks
• Real-time event-driven scraping — Scrapy is batch-oriented; use streaming solutions for real-time needs

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

Scrapy is a self-hosted Python framework with no authentication model of its own. Handles auth on target sites via cookies, headers, or form login via spiders.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

Scrapy framework itself is free and open source. Zyte (formerly Scrapinghub) offers managed Scrapy Cloud hosting and anti-bot proxy services commercially.

Agent Metadata

Pagination

none

Idempotent

Partial

Retry Guidance

Documented

Known Gotchas

⚠ Scrapy runs on Twisted async I/O — mixing with asyncio requires scrapy-asyncio bridge or Python 3.10+ native asyncio reactor support
⚠ JavaScript-rendered content is not accessible without scrapy-playwright or scrapy-splash middleware — many modern sites require JS execution
⚠ robots.txt is respected by default (ROBOTSTXT_OBEY=True) — agents must disable this setting if scraping non-public content that is allowed
⚠ Scrapy uses a global Item pipeline pattern — scraped items flow through all pipelines; ordering matters and errors in one pipeline can drop items silently
⚠ Memory usage can grow with large crawls if DEPTH_LIMIT and URL deduplication are not tuned — monitor DUPEFILTER stats to detect loops
⚠ Anti-bot detection (Cloudflare, Akamai) blocks naive Scrapy requests — requires rotating proxies, browser fingerprint emulation, or Zyte Smart Proxy Manager

Alternatives

playwright-api selenium-api apify-api brightdata-api

Full Evaluation Report

Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Scrapy.

$99

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.