scrapy
Full-featured web crawling and scraping framework for Python — provides Spider classes, async Twisted reactor, item pipelines, middleware, and built-in request scheduling. scrapy features: Spider subclasses with parse() callback, Selector for CSS/XPath extraction, Item and ItemLoader for data modeling, pipeline for processing (deduplication, validation, DB storage), downloader middleware (rotating proxies, user agents), CrawlerProcess for programmatic control, crawl rules for link following (CrawlSpider), sitemap crawling, feed exporters (JSON, CSV, XML), AutoThrottle, robots.txt compliance, and scrapy-splash for JS rendering.
Score Breakdown
⚙ Agent Friendliness
🔒 Security
Web scraping framework. Respect robots.txt and ToS. Do not scrape personal data without consent/authorization. Rotate user agents and IPs ethically. Downloaded content may contain malicious payloads — sanitize before processing. DOWNLOAD_HANDLERS can execute code from remote.
⚡ Reliability
Best When
Large-scale web crawling and scraping with site-wide coverage — scrapy's async architecture, middleware system, and pipelines make it the best for scraping 1000s of pages with rate limiting and retries.
Avoid When
JavaScript-heavy sites without scrapy-playwright (use playwright), simple one-page scraping (use requests+BS4), or when scrapy's project structure is overkill.
Use Cases
- • Agent spider — import scrapy; class ProductSpider(scrapy.Spider): name = 'products'; start_urls = ['https://shop.example.com']; def parse(self, response): for product in response.css('.product'): yield {'name': product.css('h2::text').get(), 'price': product.css('.price::text').get()} — spider class; agent scrapes structured data from paginated site; yield dict for each item
- • Agent link following — class LinkSpider(CrawlSpider): name = 'site'; start_urls = [start]; rules = (Rule(LinkExtractor(allow=r'/products/'), callback='parse_product'),); def parse_product(self, response): yield extract(response) — link crawler; agent follows links matching pattern and scrapes each matching page
- • Agent pipeline storage — class DatabasePipeline: def process_item(self, item, spider): db.insert(item); return item; ITEM_PIPELINES = {'myspider.pipelines.DatabasePipeline': 300} — pipeline; agent stores scraped items in database via pipeline; pipeline can filter, validate, deduplicate items
- • Agent programmatic run — from scrapy.crawler import CrawlerProcess; from scrapy.utils.project import get_project_settings; process = CrawlerProcess(get_project_settings()); process.crawl(MySpider); process.start() — run from script; agent triggers scrape programmatically; blocks until complete; use CrawlerRunner for non-blocking
- • Agent selector extraction — response.css('div.product > h2::text').getall(); response.xpath('//div[@class="price"]/text()').get(); response.css('a::attr(href)').getall() — selectors; agent extracts text, attributes, and links using CSS and XPath selectors; getall() returns list; get() returns first or None
Not For
- • JavaScript-heavy sites — scrapy makes plain HTTP requests; for JS-rendered content use playwright or scrapy-playwright extension
- • One-off simple scraping — scrapy requires project structure; for simple one-page scraping use requests + BeautifulSoup
- • Real-time scraping — scrapy is batch-oriented; for real-time web monitoring use playwright or direct HTTP polling
Interface
Authentication
No auth for the library. Supports HTTP auth, cookies, and custom headers for target sites.
Pricing
scrapy is BSD 3-Clause licensed. Free for all use.
Agent Metadata
Known Gotchas
- ⚠ scrapy uses Twisted reactor — scrapy runs on Twisted async framework, NOT asyncio; cannot use await inside parse() callbacks; use yield for requests and items; agent code: no async/await in spiders unless using scrapy 2.x async support (experimental); Python generators (yield) are the scrapy idiom
- ⚠ parse() must yield not return — def parse(self, response): items = []; items.append(extract(response)); return items — WRONG; must yield: for item in extract(response): yield item; or yield from extract(); scrapy collects yielded items and requests; returning list is ignored; yield scrapy.Request(url, callback=self.parse_page) for link following
- ⚠ CrawlerProcess.start() blocks — process.start() blocks until crawl complete; agent code needing non-blocking: use CrawlerRunner with reactor.run() in separate thread; or use twisted's deferred API; Scrapy's programmatic API documentation is sparse — CrawlerRunner example in docs shows deferred pattern
- ⚠ Selectors: getall() returns list, get() returns first or None — response.css('h1::text').get() returns None if not found (not exception); response.css('h1::text').getall() returns [] if not found; agent code: use get(default='') for safe default; check None before using: title = response.css('h1::text').get() or 'Unknown'
- ⚠ Settings require scrapy project or explicit dict — scrapy.Settings() with SCRAPY_SETTINGS_MODULE env; or pass custom_settings={'DOWNLOAD_DELAY': 1} to Spider class attribute; CrawlerProcess(settings={...}); agent code without full scrapy project: use CrawlerProcess({'USER_AGENT': 'mybot', 'LOG_LEVEL': 'INFO'}) dict directly
- ⚠ ROBOTSTXT_OBEY=True by default in projects — scrapy respects robots.txt; set ROBOTSTXT_OBEY=False to disable; agent code scraping sites with restrictive robots.txt: check legal/ethical implications; ROBOTSTXT_OBEY=True is the ethical default; custom_settings={'ROBOTSTXT_OBEY': False} to override; always check terms of service
Alternatives
Full Evaluation Report
Comprehensive deep-dive: security analysis, reliability audit, agent experience review, cost modeling, competitive positioning, and improvement roadmap for scrapy.
AI-powered analysis · PDF + markdown · Delivered within 30 minutes
Package Brief
Quick verdict, integration guide, cost projections, gotchas with workarounds, and alternatives comparison.
Delivered within 10 minutes
Score Monitoring
Get alerted when this package's AF, security, or reliability scores change significantly. Stay ahead of regressions.
Continuous monitoring
Scores are editorial opinions as of 2026-03-06.