Skip to content

xiaoxin01/Supperxin.Web.Webcrawler

Repository files navigation

A tool to crawl web site.

You can view the crawled result from this site:

http://listen.supperxin.com/News

support types

  • Html rendered web site. Get metadata from html directly
  • Json data from api.

Functions

Configuration files

You can get configuration files from:

Supperxin.Web.Webcrawler/Configurations/

There are some demo configurations for:

  1. v2ex hot topic and job.
  2. readhub
  3. 创业邦快讯(cyzone)

3 steps to crawl data.

  1. copy or create configuration file

Choose a setting file from Supperxin.Web.Webcrawler/Configurations, or create your own.

cp Supperxin.Web.Webcrawler/Configurations/appsettings.cyzone.json Supperxin.Web.Webcrawler/appsettings.json

Create file: docker-compose.tag.yml, then change the tag

The tag can be set to the crawled site name.

cp docker-compose.override.yml docker-compose.tag.yml
version: '3.4'

services:
  supperxin.web.webcrawler:
    image: supperxin.web.webcrawler:[tag]
  1. add save result to (option)
"SaveTo": {
    "Type": "HttpPipline",
    "ProcessUrl": [Your url]
},
  1. start crawler

    bash start.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors