Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

AlpineMarmot/pulse

Repository files navigation

Pulse

Pulse is a crawler build on top of gocolly/colly

Features:

  • Expose all golly/colly options to a yml configuration
  • Create rule(s) that export crawling data to MongoDB

Installation

Go modules must be enabled

$ go build

Usage

$ pulse [-q][--no-logging] [-c configFile] [url entrypoint]

$ pulse -c conf.yml https://www.example.com

Configuration example

see default.yml

Grab HTML data

This rule below will add to mongodb collection "images" the value of src attribute for all tag img. The context-attr is also added as images metadata.

collection: "images"
tag: "img"
attr: "src"
context-attr: "alt"

You can also grab html attributes with a selector instead of tag.

collection: "images-test"
selector: "img[data-src]"
attr: "data-src"
context-attr: "alt"

More infos about selector here: PuerkitoBio/goquery

About

Crawler build on top of gocolly/colly

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages