-
-
Notifications
You must be signed in to change notification settings - Fork 544
Home
maksdma edited this page Apr 10, 2022
·
2 revisions
TorBot is a Python web crawler for Deep and Dark Web.
The basic procedure executed by the web crawling algorithm takes a list of seed URLs as its input and repeatedly executes the following steps:
- Remove a URL from the URL list.
- Check existence of the page.
- Download the corresponding page.
- Check the Relevancy of the page.
- Extract any links contained in it.
- Check the cache if the links are already in it.
- Add the unique links back to the URL list.
- After all URLs are processed, return the most relevant page.
- Crawls Tor links (.onion).(Partially Completed)
- Returns Page title and address with a short description about the site.(Partially Completed)
- Save links to database.(Not Started)
- Get emails from site.(Completed)
- Save crawl info to file.(Completed)
- Crawl custom domains.(Completed)
- Check if the link is live.(Complete)
- Built-in Updater.(Completed) ...(will be updated)
Contributions to this project are always welcome. To add a new feature fork the dev branch and give a pull request when your new feature is tested and complete. If its a new module, it should be put inside the modules directory and imported to the main file. The branch name should be your new feature name in the format <Feature_featurename_version(optional)>. For example, Feature_FasterCrawl_1.0. Contributor name will be updated to the contributors list. 😃