PageCollector

A distributed async page crawler

Simply use from cli

example1:

You can simply run with no parameters
```
python3 cli.py
```
When you run it this way,the program will read all of information from the configuration
example2:

This is a common way to use it
```
python3 cli.py -i <input_path> -o <output_dir>
```
- The option -i means the file path that correspond to the list of sites to be crawled
- The option -o means the directory that save the crawl results
More information:

You can use the command
```
python3 cli.py -h
```
By the way,you can use the program with splash and proxy pool，when you add options -S and -P

Distributed

You can start processes on multiple machines which is based on dramatiq.You can the submit some tasks to the crawler.Before using the crawler you need to install dramatiq, and configured redis

Caution:

To run the crawler, you need the necessary components
- MongoDB: To save the crawler results
- Redis: As a dramatiq message queue
start:

All commands are integrated into the file command.py For starting some workers,you can run:
```
python command.py start
```
You can also specify the starting processes in current machine, which use:
```
python command.py start -p 16
```
submit:

You can submit some tasks from cli or a source file

Simple use:
```
python command.py submit -u "http://www.example.com"
```
From source file:
```
python command.py submit -s "path/to/file"
```
Important:

You'd better specify the name of the crawler when submitting the task, otherwise the program will use the default name 'spider'

Like this:
```
python command.py -u "http://www.example.com" -N "spider_name"
```
For more information, you can run:
```
python command.py submit --help
```
stop:

If you want to stop the workers in current machine, you can use the stop command
```
python command.py stop
```
If it doesn't close properly, you can kill it
```
python command.py kill
```
export:

Export the crawler results

For example,you can export results to a file path
```
python command.py export -t "http://www.example.com" -o "path/to/dir"
```
For other uses, please refer to the help information
help:

Simply run command python command.py --help, you can get more information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PageCollector

Simply use from cli

Distributed

Files

README.md

Latest commit

History

README.md

File metadata and controls

PageCollector

Simply use from cli

Distributed