Pyison is a tarpit for AI webcrawlers
- Like all web crawlers, AI bots request pages from webservers, and then follow links on the page to other pages. By doing this, they can build an index of an entire website.
- Unlike other web crawlers, though, AI bots present some unique issues
- Server admins can configure a robots.txt file, which tells web crawlers what pages they should and shouldn't crawl.
- Some AI crawlers have been found to ignore this file. There are privacy and copyright concerns with allowing these bots to use a website's data to train LLMs, especially when the user has explicitly opted out of crawling.
- Some crawlers also ignore ratelimiting. When they request pages as fast as possible, they can put significant load on a webserver.
- Server admins can configure a robots.txt file, which tells web crawlers what pages they should and shouldn't crawl.
- Enter Pyison. Like other AI crawler tarpits (Nepenthes, Iocane), Pyison feeds web crawlers an endless list of links to other pages on its site. This traps the crawlers on a single site, where they'll endlessly navigate an ever-growing sea of links.
- Keeping AI crawlers stuck in one place prevents them from indexing other parts of the site that the owner might not want to feed to LLMs.
- At the same time, these pages can contain tons of useless text. When LLMs incorporate this text into their models, they can gradually be "poisoned" as the random input will make their responses less coherent.
- Unlike other web crawlers, though, AI bots present some unique issues
- Creating useful software to address the rise in LLM content thievery
- Solving problems in existing AI tarpits:
- Generating random text without use of a Markov Chain, LLMs, or any existing writing samples
- Not discriminating AI bots by User-Agent header, since crawlers will often disguise themselves as normal users
- Designing a realistic blog-esque site in order to prevent detection
- Creating an extremely customizable framework to allow further configuration
- Providing a solution that can easily integrate into an existing website
- Supports but doesn't require reverse proxies/specific webservers, provides sub-path configuration with a document root setting
- Running with a small footprint rather than an entire CMS or feature-rich webserver
- This project runs a dynamic web server using Python's HTTPServer library
- Sentences are produced from a 50/50 mix between random english words and "stop words"
- Pyison uses a global salt as well as a page url to seed its RNG. This ensures that if crawlers ever perform a "sanity check" by reloading a page, it will be the same as the last time they checked it. At the same time, different servers should use different global seeds so that not all sites using Pyison look the same.
- It's possible to generate an unlimited number of pages, since Pyison isn't actually creating any permanent files. It's a dynamic web server, so it just generates html and sends it to the client.
- Use of HTML content tags, CSS formatting, and images should make the site seem a bit more realistic to a crawler.
- It's possible to configure Pyison's output without rewriting the program, by changing the config, static, and template files
- Pyison responds to errant POST and PUT requests with a 404 in case crawlers test that those HTTP verbs are configured
-
Clone this project to a local folder
-
Change settings in the
config/config.jsonfile- Set the
random-seedto a random number! - Set the
portnumber for the server to use - See the Configuration section for more info
- Set the
-
(Optional but recommended) Update the HTML template, CSS, images, and robots.txt to your liking
- This project is more effective if pages look different, which can be done by varying the HTML and CSS structure. Rearrange and rename some stuff, or rewrite it completely.
-
(Optional) Edit the robots.txt if you want to change which bots are affected
-
Set up the server environment with any of the methods below
-
Put the server behind a Reverse Proxy
- Install nltk
pip install nltk
- Run Pyison
python3 src/server.py
- Check it's running
- Open
http://localhost:<your port number>in a browser
- Open
- Make a bare-minimum docker-compose.yml file containing
services: pyison: image: "ghcr.io/jonaslong/pyison:main" container_name: pyison tty: true ports: - 80:80
docker compose up- (Optional) Use the
-dflag to detach from the container
- (Optional) Use the
- For changes to local files to have an effect, clone this repository and use a bind mount. See the full compose file.
docker run --tty --name pyison -p "127.0.0.1:80:80" --rm ghcr.io/jonaslong/pyison:main- (Optional) Remove the
--rmflag to persist the container
- (Optional) Remove the
- Clone the repository locally
docker build -t pyison:latest .- Run the container with
docker run --tty --name pyison -p "127.0.0.1:80:80" --rm pyison:latest
- Install nltk
It is highly recommended that you use a reverse proxy to serve this content. It can reduce server load by caching pages and introducing ratelimits, as well as serve the content over https and protect from some basic webserver exploits.
- These examples use Nginx Proxy Manager, but any reverse proxy software should work
- Pyison supports NPM's strictest SSL settings
- NPM should pass the
User-Agenthttp header from NPM to the Pyison server without any special configuration. Useproxy_pass_header User-Agent;if needed.
-
Select the following settings in the UI, or enter the equivalent settings in the configuration file:
- Domain Names: A domain or subdomain you control, like
tarpit.example.com - Scheme:
http - Forward Hostname / IP:
localhost, or the name of the docker container if using a docker network - Forward Port: Whatever port is defined in the docker-compose/run command/config.json (default is
80) - Enable
Cache AssetsandBlock Common Exploits, but notWebsockets Support
- Domain Names: A domain or subdomain you control, like
-
To use Pyison in a sub-path, use the following location configuration
# Handles the root page ("example.com/tarpit") location /tarpit { proxy_pass http://localhost:80; } # Handles all sub-pages ("example.com/tarpit/a" and "example.com/tarpit/a/b") location /tarpit/ { proxy_pass http://localhost:80; } # Handles all urls with a 3-letter file extension ("example.com/tarpit/style.css" and "example.com/tarpit/images/picture.jpg") # Note that the ~ denotes regex matching. The string immediately following it must be a valid regex statement. location ~ .*\/tarpit\/.*\....$ { proxy_pass http://localhost:80; }
-
Here is a working setup using the UI:
- The first location block,
/tarpit, has no custom configuration. It will behave like the first location block defined above. - The second location block,
/tarpit/, contains the second and third location blocks defined above - Replace
tarpitwith the desired root path.- If using a deeper root path like
/tar/pit, note that the third location block uses regex, so any slashes must be escaped with a backslash- eg:
/tar/pitwould requirelocation ~ .*\/tar\/pit\/.*\....$in the 3rd location block
- eg:
- If using a deeper root path like
- Update Pyison's
document-rootsetting with the sub-path used by the proxy (see the Configuration section) - See above sections for further configuration of host, ports, SSL, etc
- The first location block,
- The config.json file defines various settings for easy customization:
port(default 80)- What port to serve content on
random-seed(Please change this!)- Global seed (salt) used to make sure not every webserver has the exact same text
- Set this to some random number, doesn't need to be secure
document-root(default "/")- Prepends a path onto links
- You'll only need to change this if you're using a reverse proxy and serving to a sub-directory
- If so, use either a fully-qualified path or one relative to the root
- ie
https://example.com/pyison/or/pyison/
- ie
- If so, use either a fully-qualified path or one relative to the root
fake-image-dir(default ["images"])- All images will appear to be served from this path
- Accepts either a single string, null, or a list of options to randomly choose from
fake-css-dir(default ["css"])- All css files will appear to be served from this path
- Accepts either a single string, null, or a list of options to randomly choose from
spacing-characters(default ["_","-","%20"])- Spaces to use between the words in a page URL
- This will also affect how page URLs are split to decode back into titles
unsafe-characters(default ["'","`"])- Characters that can occur naturally in the word list but should be removed from URLs
- By default, this removes apostrophes (`) and single quotes (')
robots-txt(default "assets/robots.txt")- This configures the file that gets served at
/robots.txt - If the string is empty, a 404 response will be returned instead
- This configures the file that gets served at
html-templates(default ["assets/template.html"])- HTML file to serve, containing format text to provide random values for (see HTML Templating)
- Accepts either a single string, or a list of options to randomly choose from
css-files(default ["assets/style.css"])- CSS file(s) to serve
- No substitution is done on CSS files
- Accepts either a single string, or a list of options to randomly choose from
imagesico(default ["assets/logo.ico"])jpg(default ["assets/logo.jpg"])png(default["assets/logo.png"])- For each of the above image extensions: A single image file, null, or a list of images to pick randomly from
remove-from-stop-words- The nltk library has a "stop words" list that's useful to generate lots of common words. However, some entries shouldn't be used for text generation because they're an obvious giveaway that this is generated content
- Before serving your HTML file(s), Pyison will substitue some preset tags with its own values
When one of these tags is specified multiple times in the template, each value will be the same
{HOME}- Link to the document root, as defined in the config
{TITLE}- Title text of the page, based on the current URL
- ex
/blog/about/once-upon-a-time->Once Upon A Time
- ex
- Title text of the page, based on the current URL
{UPTITLE}- Title text of the parent page, generated from the current URL
{MAIN}- This should be used as the site's main content. It will generate several paragraphs containing random text, along with section headings and subheadings. Random links may also be present throughout each paragraph.
{UP}- Path to the parent page
- eg:
/blog/about/once-upon-a-time->/blog/about
{CSSLINK}- Random URL to a CSS document
- The beginning path of this URL (eg
/styles) can be set in the config - Make sure to specify '.css' immediately after the tag. This is how the webserver knows to send css rather than more html.
- The actual document returned by the server will be chosen based on the css page(s) specified in the config
{WORD}- Generates a single random word. Proper nouns may be capitalized.
{NAME}- Generates two capitalized words with a space in between.
{SENTENCE}- Generates a random sentence starting with a capital letter and ending with a period.
{PIC}- Generates a random URL to an image file
- The beginning path of this URL (eg
/images) can be set in the config - Make sure to put an appropriate file extension after the link (.png, .jpg, .ico)
- The actual image returned by the server will be chosen based on the provided extension and the image(s) specified in the config
{LINK}- Generates a random link, which may contain sub-paths
- The page name may be separated by dashes or underscores
- eg:
/dressage/electrochronometer/himself-she-eciliate
{OVER}- Generates a random link to a sibling page
- eg: When visiting
/neighborliness/wont_unhonest, an{OVER}tag might be replaced with/neighborliness/hooliganism
{NEWTITLE}- Generates a random title
- Uses a series of random words in Title Capitalization
- eg:
Nectarial Electrofusion Which Dephosphorization - If this tag follows a
{LINK}or{OVER}tag, the title and url will be synced- eg:
<a href="{OVER}">{NEWTITLE}</a>might be replaced with<a href="/swordmanship/yeller/over-will-oghuz">Over Will Oghuz</a> - Tags are evaluated forwards through the template. A
{NEWTITLE}tag will search backwards for the closest{LINK}or{OVER}tag to sync with.- Tags will always pair correctly when any
{LINK}or{OVER}tag is immediately followed by its{NEWTITLE}tag, if a{NEWTITLE}is desired
- Tags will always pair correctly when any
- eg:
Here is a sample of how the site looks before any editing of the template:

