A tool to crawl web site.
You can view the crawled result from this site:
http://listen.supperxin.com/News
- Html rendered web site. Get metadata from html directly
- Json data from api.
- Configurable. Only need to create an new configuration file for new web site.
- List page crawl. Can crawl items from list page directly.
- Iteration page crawl. Can crawl page by specify logic.
- eg. crawl http://site?p=1 to http://site?p=10
- Make operations to crawled results.
- Result cache. Don't crawl the same result by key(maybe url).
You can get configuration files from:
Supperxin.Web.Webcrawler/Configurations/
There are some demo configurations for:
- v2ex hot topic and job.
- readhub
- 创业邦快讯(cyzone)
- copy or create configuration file
Choose a setting file from Supperxin.Web.Webcrawler/Configurations, or create your own.
cp Supperxin.Web.Webcrawler/Configurations/appsettings.cyzone.json Supperxin.Web.Webcrawler/appsettings.json
Create file: docker-compose.tag.yml, then change the tag
The tag can be set to the crawled site name.
cp docker-compose.override.yml docker-compose.tag.yml
version: '3.4'
services:
supperxin.web.webcrawler:
image: supperxin.web.webcrawler:[tag]- add save result to (option)
"SaveTo": {
"Type": "HttpPipline",
"ProcessUrl": [Your url]
},
-
start crawler
bash start.sh