scrapy

Full-featured web crawling and scraping framework for Python — provides Spider classes, async Twisted reactor, item pipelines, middleware, and built-in request scheduling. scrapy features: Spider subclasses with parse() callback, Selector for CSS/XPath extraction, Item and ItemLoader for data modeling, pipeline for processing (deduplication, validation, DB storage), downloader middleware (rotating proxies, user agents), CrawlerProcess for programmatic control, crawl rules for link following (CrawlSpider), sitemap crawling, feed exporters (JSON, CSV, XML), AutoThrottle, robots.txt compliance, and scrapy-splash for JS rendering.

Evaluated Mar 06, 2026 (0d ago) v2.11.x

Homepage ↗ Repo ↗ Developer Tools python scrapy scraping crawling spider xpath css pipeline

⚙ Agent Friendliness

/ 100

Can an agent use this?

🔒 Security

/ 100

Is it safe for agents?

⚡ Reliability

/ 100

Does it work consistently?

Score Breakdown

⚙ Agent Friendliness

MCP Quality

Documentation

Error Messages

Auth Simplicity

Rate Limits

🔒 Security

TLS Enforcement

Auth Strength

Scope Granularity

Dep. Hygiene

Secret Handling

Web scraping framework. Respect robots.txt and ToS. Do not scrape personal data without consent/authorization. Rotate user agents and IPs ethically. Downloaded content may contain malicious payloads — sanitize before processing. DOWNLOAD_HANDLERS can execute code from remote.

⚡ Reliability

Uptime/SLA

Version Stability

Breaking Changes

Error Recovery

Best When

Large-scale web crawling and scraping with site-wide coverage — scrapy's async architecture, middleware system, and pipelines make it the best for scraping 1000s of pages with rate limiting and retries.

Avoid When

JavaScript-heavy sites without scrapy-playwright (use playwright), simple one-page scraping (use requests+BS4), or when scrapy's project structure is overkill.

Use Cases

• Agent spider — import scrapy; class ProductSpider(scrapy.Spider): name = 'products'; start_urls = ['https://shop.example.com']; def parse(self, response): for product in response.css('.product'): yield {'name': product.css('h2::text').get(), 'price': product.css('.price::text').get()} — spider class; agent scrapes structured data from paginated site; yield dict for each item
• Agent link following — class LinkSpider(CrawlSpider): name = 'site'; start_urls = [start]; rules = (Rule(LinkExtractor(allow=r'/products/'), callback='parse_product'),); def parse_product(self, response): yield extract(response) — link crawler; agent follows links matching pattern and scrapes each matching page
• Agent pipeline storage — class DatabasePipeline: def process_item(self, item, spider): db.insert(item); return item; ITEM_PIPELINES = {'myspider.pipelines.DatabasePipeline': 300} — pipeline; agent stores scraped items in database via pipeline; pipeline can filter, validate, deduplicate items
• Agent programmatic run — from scrapy.crawler import CrawlerProcess; from scrapy.utils.project import get_project_settings; process = CrawlerProcess(get_project_settings()); process.crawl(MySpider); process.start() — run from script; agent triggers scrape programmatically; blocks until complete; use CrawlerRunner for non-blocking
• Agent selector extraction — response.css('div.product > h2::text').getall(); response.xpath('//div[@class="price"]/text()').get(); response.css('a::attr(href)').getall() — selectors; agent extracts text, attributes, and links using CSS and XPath selectors; getall() returns list; get() returns first or None

Not For

• JavaScript-heavy sites — scrapy makes plain HTTP requests; for JS-rendered content use playwright or scrapy-playwright extension
• One-off simple scraping — scrapy requires project structure; for simple one-page scraping use requests + BeautifulSoup
• Real-time scraping — scrapy is batch-oriented; for real-time web monitoring use playwright or direct HTTP polling

Interface

REST API

GraphQL

gRPC

MCP Server

SDK

Yes

Webhooks

Authentication

Methods: none

OAuth: No Scopes: No

No auth for the library. Supports HTTP auth, cookies, and custom headers for target sites.

Pricing

Model: open_source

Free tier: Yes

Requires CC: No

scrapy is BSD 3-Clause licensed. Free for all use.

Agent Metadata

Pagination

none

Idempotent

Partial

Retry Guidance

Not documented

Known Gotchas

⚠ scrapy uses Twisted reactor — scrapy runs on Twisted async framework, NOT asyncio; cannot use await inside parse() callbacks; use yield for requests and items; agent code: no async/await in spiders unless using scrapy 2.x async support (experimental); Python generators (yield) are the scrapy idiom
⚠ parse() must yield not return — def parse(self, response): items = []; items.append(extract(response)); return items — WRONG; must yield: for item in extract(response): yield item; or yield from extract(); scrapy collects yielded items and requests; returning list is ignored; yield scrapy.Request(url, callback=self.parse_page) for link following
⚠ CrawlerProcess.start() blocks — process.start() blocks until crawl complete; agent code needing non-blocking: use CrawlerRunner with reactor.run() in separate thread; or use twisted's deferred API; Scrapy's programmatic API documentation is sparse — CrawlerRunner example in docs shows deferred pattern
⚠ Selectors: getall() returns list, get() returns first or None — response.css('h1::text').get() returns None if not found (not exception); response.css('h1::text').getall() returns [] if not found; agent code: use get(default='') for safe default; check None before using: title = response.css('h1::text').get() or 'Unknown'
⚠ Settings require scrapy project or explicit dict — scrapy.Settings() with SCRAPY_SETTINGS_MODULE env; or pass custom_settings={'DOWNLOAD_DELAY': 1} to Spider class attribute; CrawlerProcess(settings={...}); agent code without full scrapy project: use CrawlerProcess({'USER_AGENT': 'mybot', 'LOG_LEVEL': 'INFO'}) dict directly
⚠ ROBOTSTXT_OBEY=True by default in projects — scrapy respects robots.txt; set ROBOTSTXT_OBEY=False to disable; agent code scraping sites with restrictive robots.txt: check legal/ethical implications; ROBOTSTXT_OBEY=True is the ethical default; custom_settings={'ROBOTSTXT_OBEY': False} to override; always check terms of service

Alternatives

playwright-python-api mechanize-python-api

Full Evaluation Report

Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for scrapy.

AI-powered analysis · PDF + markdown · Delivered within 30 minutes

$99

Package Brief

Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.

Delivered within 10 minutes

Score Monitoring

Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.

Continuous monitoring

$3/mo

API endpoint ↗ Agent guide ↗ Report inaccuracy

Scores are editorial opinions as of 2026-03-06.