Redis-based components for Scrapy.
- Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
- Documentation: https://github.com/rmax/scrapy-redis/wiki.
- Release: https://github.com/rmax/scrapy-redis/wiki/History
- Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
- LICENSE: MIT license
Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added
jsonsupported data in Redisdata contains
url,`meta`and other optional parameters.metais a nested json which contains sub-data. this function extract this data and send another FormRequest withurl,metaand additionformdata.For example:
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies
Note
This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.
- Python 3.7+
- Redis >= 5.0
Scrapy>= 2.0redis-py>= 4.0
From pip
pip install scrapy-redisFrom GitHub
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py installNote
For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
pip uninstall scrapy-redisFrontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.