First revision.

Gisle Aas · Gisle Aas · commit 8ae63eac6c7a · 1996-02-26T20:21:42.000Z
diff --git a/doc/norobots.html b/doc/norobots.html
@@ -0,0 +1,314 @@
+<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
+<html>
+<head>
+<title>A Standard for Robot Exclusion</title>
+</head>
+<body>
+
+<h1>A Standard for Robot Exclusion</h1>
+
+Table of contents:
+
+<ul>
+<li>
+
+<a href="#status">
+Status of this document
+</a>
+
+<li>
+
+<a href="#introduction">
+Introduction
+</a>
+
+<li>
+
+<a href="#method">
+Method
+</a>
+
+<li>
+
+<a href="#format">
+Format
+</a>
+
+<li>
+
+<a href="#examples">
+Examples
+</a>
+
+<li>
+
+<a href="#code">
+Example Code
+</a>
+
+<li>
+
+<a href="#author">
+Author's Address
+</a>
+
+</ul>
+<hr>
+
+<h2><a name="status">Status of this document</a></h2>
+
+This document represents a consensus on 30 June 1994 on the robots
+mailing list (robots-request@nexor.co.uk), between the majority of
+robot authors and other people with an interest in robots. It has
+also been open for discussion on the Technical World Wide Web
+mailing list (www-talk@info.cern.ch). This document is based on a
+previous working draft under the same title.
+
+<p>
+
+It is not an official standard backed by a standards body,
+or owned by any comercial organisation.
+
+It is not enforced by anybody, and there no guarantee that
+all current and future robots will use it.
+
+Consider it a common facility the majority of robot authors
+offer the WWW community to protect WWW server against
+unwanted accesses by their robots.</p>
+
+<p>
+
+The latest version of this document can be found on
+<a href="http://web.nexor.co.uk/mak/doc/robots/norobots.html">
+http://web.nexor.co.uk/mak/doc/robots/norobots.html</a>.</p>
+
+<hr>
+
+<h2><a name="introduction">Introduction</a></h2>
+
+WWW Robots (also called wanderers or spiders) are programs
+that traverse many pages in the World Wide Web by
+recursively retrieving linked pages. For more information
+see <a href="robots.html">the robots page</a>.
+
+<p>
+
+In 1993 and 1994 there have been occasions where robots
+have visited WWW servers where they weren't welcome for
+various reasons. Sometimes these reasons were robot specific,
+e.g. certain robots swamped servers with rapid-fire
+requests, or retrieved the same files repeatedly.
+In other situations robots traversed parts of WWW servers
+that weren't suitable, e.g. very deep virtual trees,
+duplicated information, temporary information, or
+cgi-scripts with side-effects (such as voting).</p>
+
+<p>
+
+These incidents indicated the need for established
+mechanisms for WWW servers to indicate to robots which parts
+of their server should not be accessed. This standard
+addresses this need with an operational solution.</p>
+
+<hr>
+
+<h2><a name="method">The Method</a></h2>
+
+The method used to exclude robots from a server is to
+create a file on the server which specifies an access
+policy for robots.
+
+This file must be accessible via HTTP on the local URL
+"<code>/robots.txt</code>".
+The contents of this file are specified <a href="#format">below</a>.
+
+<p>
+
+This approach was chosen because it can be easily
+implemented on any existing WWW server, and a robot can find
+the access policy with only a single document retrieval.</p>
+
+<p>
+
+A possible drawback of this single-file approach is that only a
+server administrator can maintain such a list, not the
+individual document maintainers on the server. This can be
+resolved by a local process to construct the single file
+from a number of others, but if, or how, this is done is
+outside of the scope of this document.</p>
+
+<p>
+
+The choice of the URL was motivated by several criteria:</p>
+
+<ul>
+<li>
+
+The filename should fit in file naming restrictions of all
+common operating systems.
+
+<li>
+
+The filename extension should not require extra server
+configuration.
+
+<li>
+
+The filename should indicate the purpose of the file
+and be easy to remember.
+
+<li>
+
+The likelihood of a clash with existing files should
+be minimal.
+
+</ul>
+<hr>
+
+<h2><a name="format">The Format</a></h2>
+
+The format and semantics of the "<code>/robots.txt</code>" file
+are as follows:
+
+<p>
+
+The file consists of one or more records separated by one or
+more blank lines (terminated by CR,CR/NL, or NL). Each
+record contains lines of the form
+"<code>&lt;field&gt;:&lt;optionalspace&gt;&lt;value&gt;&lt;optionalspace&gt;</code>".
+The field name is case insensitive.</p>
+
+<p>
+
+Comments can be included in file using UNIX bourne shell
+conventions: the '<code>#</code>' character is used to
+indicate that preceding space (if any) and the remainder of
+the line up to the line termination is discarded.
+Lines containing only a comment are discarded completely,
+and therefore do not indicate a record boundary.</p>
+
+<p>
+The record starts with one or more <code>User-agent</code>
+lines, followed by one or more <code>Disallow</code> lines,
+as detailed below. Unrecognised headers are ignored.</p>
+
+<dl>
+<dt>User-agent</dt>
+<dd>
+
+The value of this field is the name of the robot the
+record is describing access policy for.
+
+<p>
+If more than one User-agent field is present the record
+describes an identical access policy for more
+than one robot. At least one field needs to be present
+per record.</p>
+
+<p>
+The robot should be liberal in interpreting this field.
+A case insensitive substring match of the name without
+version information is recommended.</p>
+
+<p>
+
+If the value is '<code>*</code>', the record describes
+the default access policy for any robot that has not not
+matched any of the other records. It is not allowed to
+have two such records in the "<code>/robots.txt</code>"
+file.</p></dd>
+
+<dt>Disallow</dt>
+<dd>
+
+The value of this field specifies a partial URL that is not
+to be visited. This can be a full path, or a partial
+path; any URL that starts with this value will not be
+retrieved. For example, <code>Disallow: /help</code>
+disallows both <code>/help.html</code> and
+<code>/help/index.html</code>, whereas
+<code>Disallow: /help/</code> would disallow
+<code>/help/index.html</code>
+but allow <code>/help.html</code>.
+
+<p>
+
+Any empty value, indicates that all URLs can be
+retrieved. At least one Disallow field needs to
+be present in a record.</p></dd>
+
+</dl>
+
+The presence of an empty "<code>/robots.txt</code>" file
+has no explicit associated semantics, it will be treated
+as if it was not present, i.e. all robots will consider
+themselves welcome.
+
+<hr>
+
+<h2><a name="examples">Examples</a></h2>
+
+The following example "<code>/robots.txt</code>" file specifies
+that no robots should visit any URL starting with
+"<code>/cyberworld/map/</code>" or
+"<code>/tmp/</code>:
+
+<hr>
+<pre>
+# robots.txt for http://www.site.com/
+
+User-agent: *
+Disallow: /cyberworld/map/ # This is an infinite virtual URL space
+Disallow: /tmp/ # these will soon disappear
+</pre>
+<hr>
+
+This example "<code>/robots.txt</code>" file specifies
+that no robots should visit any URL starting with
+"<code>/cyberworld/map/</code>", except the robot called
+"<code>cybermapper</code>":
+
+<hr>
+<pre>
+# robots.txt for http://www.site.com/
+
+User-agent: *
+Disallow: /cyberworld/map/ # This is an infinite virtual URL space
+
+# Cybermapper knows where to go.
+User-agent: cybermapper
+Disallow:
+</pre>
+<hr>
+
+This example indicates that no robots should visit
+this site further:
+
+<hr>
+<pre>
+# go away
+User-agent: *
+Disallow: /
+</pre>
+<hr>
+
+<h2><a name="code">Example Code</a></h2>
+
+Although it is not part of this specification, some example code
+in Perl is available in <a href="norobots.pl">norobots.pl</a>.  It
+is a bit more flexible in its parsing than this document
+specificies, and is provided as-is, without warranty.
+
+<h2><a name="author">Author's Address</a></h2>
+
+<address>
+<a href="/mak/mak.html">Martijn Koster</a>
+&lt;m.koster@webcrawler.com&gt;<br>
+NEXOR<br>
+PO Box 132, <br>
+Nottingham,<br>
+The United Kingdom<br>
+Phone: +44 602 520576
+</address>
+</body>
+</html>