Scrapy
High-performance Python web scraping and crawling framework. Scrapy provides a complete spider lifecycle (request scheduling, downloading, parsing, pipeline processing) with async I/O via Twisted. Built-in support for robots.txt, rate limiting, cookie handling, caching, and item pipelines for storing scraped data. The de-facto standard for large-scale Python web scraping.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Self-hosted framework — security model depends on deployment. Twisted handles TLS. Scrapy settings file must not contain credentials in version control.
⚡ Reliability
Best When
You need to crawl many pages systematically with built-in rate limiting, pipelines, and middleware for production-grade web scraping in Python.
Avoid When
The target site requires JavaScript rendering or you need a simple one-off scrape — use Playwright or requests+BeautifulSoup respectively.
Use Cases
- • Build agent data collection pipelines that crawl entire websites extracting structured data using CSS/XPath selectors
- • Schedule and run automated web scrapers that feed agent knowledge bases with regularly updated content
- • Extract product data, prices, and inventory from e-commerce sites for agent competitive intelligence
- • Crawl documentation sites to build agent-searchable knowledge stores from HTML content
- • Run large-scale domain crawls with rate limiting, politeness rules, and resume support for agent training data
Not For
- • JavaScript-heavy single-page applications — Scrapy doesn't execute JS; use Playwright or Selenium for SPAs
- • Simple one-off data extractions — requests + BeautifulSoup is simpler for small tasks
- • Real-time event-driven scraping — Scrapy is batch-oriented; use streaming solutions for real-time needs
Interface
Authentication
Scrapy is a self-hosted Python framework with no authentication model of its own. Handles auth on target sites via cookies, headers, or form login via spiders.
Pricing
Scrapy framework itself is free and open source. Zyte (formerly Scrapinghub) offers managed Scrapy Cloud hosting and anti-bot proxy services commercially.
Agent Metadata
Known Gotchas
- ⚠ Scrapy runs on Twisted async I/O — mixing with asyncio requires scrapy-asyncio bridge or Python 3.10+ native asyncio reactor support
- ⚠ JavaScript-rendered content is not accessible without scrapy-playwright or scrapy-splash middleware — many modern sites require JS execution
- ⚠ robots.txt is respected by default (ROBOTSTXT_OBEY=True) — agents must disable this setting if scraping non-public content that is allowed
- ⚠ Scrapy uses a global Item pipeline pattern — scraped items flow through all pipelines; ordering matters and errors in one pipeline can drop items silently
- ⚠ Memory usage can grow with large crawls if DEPTH_LIMIT and URL deduplication are not tuned — monitor DUPEFILTER stats to detect loops
- ⚠ Anti-bot detection (Cloudflare, Akamai) blocks naive Scrapy requests — requires rotating proxies, browser fingerprint emulation, or Zyte Smart Proxy Manager
Alternatives
Full Evaluation Report
Detailed scoring breakdown, competitive positioning, security analysis, and improvement recommendations for Scrapy.
Scores are editorial opinions as of 2026-03-06.