GitHub - xiaoxin01/Supperxin.Web.Webcrawler

A tool to crawl web site.

You can view the crawled result from this site:

http://listen.supperxin.com/News

support types

Html rendered web site. Get metadata from html directly
Json data from api.

Functions

Configurable. Only need to create an new configuration file for new web site.
List page crawl. Can crawl items from list page directly.
Iteration page crawl. Can crawl page by specify logic.
- eg. crawl http://site?p=1 to http://site?p=10
Make operations to crawled results.
- eg. change https://www.v2ex.com/t/538237#reply5 to https://www.v2ex.com/t/538237
Result cache. Don't crawl the same result by key(maybe url).

Configuration files

You can get configuration files from:

Supperxin.Web.Webcrawler/Configurations/

There are some demo configurations for:

v2ex hot topic and job.
readhub
创业邦快讯(cyzone)

3 steps to crawl data.

copy or create configuration file

Choose a setting file from Supperxin.Web.Webcrawler/Configurations, or create your own.

cp Supperxin.Web.Webcrawler/Configurations/appsettings.cyzone.json Supperxin.Web.Webcrawler/appsettings.json

Create file: docker-compose.tag.yml, then change the tag

The tag can be set to the crawled site name.

cp docker-compose.override.yml docker-compose.tag.yml

version: '3.4'

services:
  supperxin.web.webcrawler:
    image: supperxin.web.webcrawler:[tag]

add save result to (option)

"SaveTo": {
    "Type": "HttpPipline",
    "ProcessUrl": [Your url]
},

start crawler

bash start.sh

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Supperxin.Web.Webcrawler		Supperxin.Web.Webcrawler
test/Supperxin.Web.Webcrawler.Tests		test/Supperxin.Web.Webcrawler.Tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Supperxin.Web.Webcrawler.sln		Supperxin.Web.Webcrawler.sln
docker-compose.ci.build.yml		docker-compose.ci.build.yml
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

support types

Functions

Configuration files

3 steps to crawl data.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

support types

Functions

Configuration files

3 steps to crawl data.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages