A distributed async page crawler
You can simply run with no parameters
python3 cli.py
When you run it this way,the program will read all of information from the configuration
This is a common way to use it
python3 cli.py -i <input_path> -o <output_dir>
- The option
means the file path that correspond to the list of sites to be crawled - The option
means the directory that save the crawl results
- The option
More information:
You can use the command
python3 cli.py -h
By the way,you can use the program with
andproxy pool
,when you add options-S
You can start processes on multiple machines which is based on dramatiq
.You can the submit some tasks to the crawler.Before using the crawler you need to install dramatiq
, and configured redis
To run the crawler, you need the necessary components
- MongoDB: To save the crawler results
- Redis: As a
message queue
All commands are integrated into the file
For starting some workers,you can run:python command.py start
You can also specify the starting processes in current machine, which use:
python command.py start -p 16
You can submit some tasks from cli or a source file
Simple use:
python command.py submit -u "http://www.example.com"
From source file:
python command.py submit -s "path/to/file"
You'd better specify the name of the crawler when submitting the task, otherwise the program will use the default name 'spider'
Like this:
python command.py -u "http://www.example.com" -N "spider_name"
For more information, you can run:
python command.py submit --help
If you want to stop the workers in current machine, you can use the
commandpython command.py stop
If it doesn't close properly, you can kill it
python command.py kill
Export the crawler results
For example,you can export results to a file path
python command.py export -t "http://www.example.com" -o "path/to/dir"
For other uses, please refer to the help information
Simply run command
python command.py --help
, you can get more information