A distributed async page crawler
-
example1:
You can simply run with no parameters
python3 cli.py
When you run it this way,the program will read all of information from the configuration
-
example2:
This is a common way to use it
python3 cli.py -i <input_path> -o <output_dir>
- The option
-i
means the file path that correspond to the list of sites to be crawled - The option
-o
means the directory that save the crawl results
- The option
-
More information:
You can use the command
python3 cli.py -h
-
By the way,you can use the program with
splash
andproxy pool
,when you add options-S
and-P
You can start processes on multiple machines which is based on dramatiq
.You can the submit some tasks to the crawler.Before using the crawler you need to install dramatiq
, and configured redis
-
Caution:
To run the crawler, you need the necessary components
- MongoDB: To save the crawler results
- Redis: As a
dramatiq
message queue
-
start:
All commands are integrated into the file
command.py
For starting some workers,you can run:python command.py start
You can also specify the starting processes in current machine, which use:
python command.py start -p 16
-
submit:
You can submit some tasks from cli or a source file
Simple use:
python command.py submit -u "http://www.example.com"
From source file:
python command.py submit -s "path/to/file"
Important:
You'd better specify the name of the crawler when submitting the task, otherwise the program will use the default name 'spider'
Like this:
python command.py -u "http://www.example.com" -N "spider_name"
For more information, you can run:
python command.py submit --help
-
stop:
If you want to stop the workers in current machine, you can use the
stop
commandpython command.py stop
If it doesn't close properly, you can kill it
python command.py kill
-
export:
Export the crawler results
For example,you can export results to a file path
python command.py export -t "http://www.example.com" -o "path/to/dir"
For other uses, please refer to the help information
-
help:
Simply run command
python command.py --help
, you can get more information