Minner - Web Scraper

Minner is an easy way to make any web scraper for data-mining. Builded in C++14, with only one shared library, libcurl. With log messages through slack and terminal.

In original version (some parts are still), this scraper is only a service for NF-eBOT, but now, my objective is to refactor this project to make more people use this.

Make fork and refactor for your situation.

Compile and Run

Production:

1. Dependencies

gcc >= 3.5.1
libcurl - install via OS package manager (ex: apt install libcurl)

2. Build configuration file

Create doc/config.h with doc/config.h.dist template.

3. Compile:

cmake . && make

4. Run:

./minner --SCRAPER_KEY

With Docker:

(best choice for dev and good choice for production)

1. Dependencies

Install Docker

2. Build configuration file

Create doc/config.h with doc/config.h.dist template.

3. Build container and compile minner (first time, and every time you change config.h)

docker build -t nfebot/minner .

# 4. Run

docker run -ti --rm nfebot/minner --SCRAPER_KEY

With Vagrant:

(best choice for Windows and dev)

1. Dependencies

Install Vagrant

2. Create vm and enter

vagrant up && vagrant ssh

3. Build configuration file

Create doc/config.h with doc/config.h.dist template.

# 4. Compile

cd /data && cmake . && make

# 5. Run:

./minner --SCRAPER_KEY

Available Scraper keys

--nfe-notas-tecnicas nfe.fazenda.gov.br / Notas Técnicas

--nfe-avisos nfe.fazenda.gov.br / Avisos

--sped sped.rfb.gov.br / Destaques

For create a new scraper key/parser, explore `app/include/parsers` source.

Folder Structure

app: application source files
- app/include: application lib/modules source file
  - app/include/parsers: web page parse layer
  - app/include/services: external web services
build: where builded executable is saved (with you use ./scripts/gcc_build.sh)
doc: configuration file
lib: vendor libs
scripts: scripts to help build and install
spike: files to test technologies or ideas

TODO

Make doc/config.h more simple
Change all #include to use .h files
Make const parameters in include/helpers.h
Refactor this code block in app/main.cpp:

rapidxml::xml_document<> doc;
char *cstr = new char[res.size() + 1];
strcpy(cstr, res.c_str());
doc.parse<0>(cstr);

And a lot of more refactors...

Special thanks to:

@mattgodbolt @dascandy @famastefano @grisumbras @Corristo

and other guys in C++ Slack Group

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
app		app
build		build
doc		doc
lib		lib
scripts		scripts
spike		spike
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minner - Web Scraper

Compile and Run

Production:

1. Dependencies

2. Build configuration file

3. Compile:

4. Run:

With Docker:

1. Dependencies

2. Build configuration file

3. Build container and compile minner (first time, and every time you change config.h)

# 4. Run

With Vagrant:

1. Dependencies

2. Create vm and enter

3. Build configuration file

# 4. Compile

# 5. Run:

Available Scraper keys

For create a new scraper key/parser, explore `app/include/parsers` source.

Folder Structure

TODO

Special thanks to:

About

Releases 2

Packages

Contributors 2

Languages

NF-eBOT/minner

Folders and files

Latest commit

History

Repository files navigation

Minner - Web Scraper

Compile and Run

Production:

1. Dependencies

2. Build configuration file

3. Compile:

4. Run:

With Docker:

1. Dependencies

2. Build configuration file

3. Build container and compile minner (first time, and every time you change config.h)

# 4. Run

With Vagrant:

1. Dependencies

2. Create vm and enter

3. Build configuration file

# 4. Compile

# 5. Run:

Available Scraper keys

For create a new scraper key/parser, explore app/include/parsers source.

Folder Structure

TODO

Special thanks to:

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

For create a new scraper key/parser, explore `app/include/parsers` source.

Packages