Provides python access to Googles parser for robot.txt files as used by their GoogleBot webscraper.
Websites may provide an optional robots.txt file in their domains root to govern the access and behavior of web scrapers. One of the most famous webscrapers GoogleBot is responsible for promoting this standard and sites interested in SEO will closely conform to GoogleBot behavior.
All credit for the parser goes to the Google team who created, open sourced and promoted it.
SEO (Search Engine Optimization): The process of modifying a websites content or metadata to boost rankings in search engines page indexes. Higher rankings lead to higher positions in user searches leading to more visitors. For further details, see the SEO wikipedia page.
Basic usage using the RobotsMatcher class provided by Google.
import jwm.robotstxt.googlebot
robotstxt = """
user-agent: GoodBot
allowed: /path
"""
matcher = jwm.robotstxt.googlebot.RobotsMatcher()
assert matcher.AllowedByRobots(robotstxt, ("GoodBot",), "/path")Check out the documentation for further details. For more use cases, see the test cases for jwm.robotstxt and robotstxt.
Install from Pypi under the jwm.robotstxt distribution.
pip install jwm.robotstxtImport into your program through the jwm.robotstxt.googlebot package.
import jwm.robotstxt.googlebotIt is highly recommended to install python projects into a virtual environment, see PEP405 for motivations.
Create a virtual environment in the .venv directory.
python3 -m venv ./.venvActivate with the correct command for your system.
# Linux/MacOS
. ./.venv/bin/activate# Windows
.\.venv\Scripts\activateMake sure you have cloned the repository and its submodules.
git clone --recurse-submodules https://github.com/jwmorley73/jwm.robotstxt.gitInstall the project using pip. This will build the required robotstxt static library files and link them into the produced python package.
pip install .If you want to include the developer tooling, add the dev optional dependencies.
pip install .[dev]- Windows 32 bit is not supported.