Skip to content

Commit 8ae63ea

Browse files
author
Gisle Aas
committed
First revision.
1 parent 10a2e6a commit 8ae63ea

File tree

1 file changed

+314
-0
lines changed

1 file changed

+314
-0
lines changed

doc/norobots.html

+314
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2+
<html>
3+
<head>
4+
<title>A Standard for Robot Exclusion</title>
5+
</head>
6+
<body>
7+
8+
<h1>A Standard for Robot Exclusion</h1>
9+
10+
Table of contents:
11+
12+
<ul>
13+
<li>
14+
15+
<a href="#status">
16+
Status of this document
17+
</a>
18+
19+
<li>
20+
21+
<a href="#introduction">
22+
Introduction
23+
</a>
24+
25+
<li>
26+
27+
<a href="#method">
28+
Method
29+
</a>
30+
31+
<li>
32+
33+
<a href="#format">
34+
Format
35+
</a>
36+
37+
<li>
38+
39+
<a href="#examples">
40+
Examples
41+
</a>
42+
43+
<li>
44+
45+
<a href="#code">
46+
Example Code
47+
</a>
48+
49+
<li>
50+
51+
<a href="#author">
52+
Author's Address
53+
</a>
54+
55+
</ul>
56+
<hr>
57+
58+
<h2><a name="status">Status of this document</a></h2>
59+
60+
This document represents a consensus on 30 June 1994 on the robots
61+
mailing list ([email protected]), between the majority of
62+
robot authors and other people with an interest in robots. It has
63+
also been open for discussion on the Technical World Wide Web
64+
mailing list ([email protected]). This document is based on a
65+
previous working draft under the same title.
66+
67+
<p>
68+
69+
It is not an official standard backed by a standards body,
70+
or owned by any comercial organisation.
71+
72+
It is not enforced by anybody, and there no guarantee that
73+
all current and future robots will use it.
74+
75+
Consider it a common facility the majority of robot authors
76+
offer the WWW community to protect WWW server against
77+
unwanted accesses by their robots.</p>
78+
79+
<p>
80+
81+
The latest version of this document can be found on
82+
<a href="http://web.nexor.co.uk/mak/doc/robots/norobots.html">
83+
http://web.nexor.co.uk/mak/doc/robots/norobots.html</a>.</p>
84+
85+
<hr>
86+
87+
<h2><a name="introduction">Introduction</a></h2>
88+
89+
WWW Robots (also called wanderers or spiders) are programs
90+
that traverse many pages in the World Wide Web by
91+
recursively retrieving linked pages. For more information
92+
see <a href="robots.html">the robots page</a>.
93+
94+
<p>
95+
96+
In 1993 and 1994 there have been occasions where robots
97+
have visited WWW servers where they weren't welcome for
98+
various reasons. Sometimes these reasons were robot specific,
99+
e.g. certain robots swamped servers with rapid-fire
100+
requests, or retrieved the same files repeatedly.
101+
In other situations robots traversed parts of WWW servers
102+
that weren't suitable, e.g. very deep virtual trees,
103+
duplicated information, temporary information, or
104+
cgi-scripts with side-effects (such as voting).</p>
105+
106+
<p>
107+
108+
These incidents indicated the need for established
109+
mechanisms for WWW servers to indicate to robots which parts
110+
of their server should not be accessed. This standard
111+
addresses this need with an operational solution.</p>
112+
113+
<hr>
114+
115+
<h2><a name="method">The Method</a></h2>
116+
117+
The method used to exclude robots from a server is to
118+
create a file on the server which specifies an access
119+
policy for robots.
120+
121+
This file must be accessible via HTTP on the local URL
122+
"<code>/robots.txt</code>".
123+
The contents of this file are specified <a href="#format">below</a>.
124+
125+
<p>
126+
127+
This approach was chosen because it can be easily
128+
implemented on any existing WWW server, and a robot can find
129+
the access policy with only a single document retrieval.</p>
130+
131+
<p>
132+
133+
A possible drawback of this single-file approach is that only a
134+
server administrator can maintain such a list, not the
135+
individual document maintainers on the server. This can be
136+
resolved by a local process to construct the single file
137+
from a number of others, but if, or how, this is done is
138+
outside of the scope of this document.</p>
139+
140+
<p>
141+
142+
The choice of the URL was motivated by several criteria:</p>
143+
144+
<ul>
145+
<li>
146+
147+
The filename should fit in file naming restrictions of all
148+
common operating systems.
149+
150+
<li>
151+
152+
The filename extension should not require extra server
153+
configuration.
154+
155+
<li>
156+
157+
The filename should indicate the purpose of the file
158+
and be easy to remember.
159+
160+
<li>
161+
162+
The likelihood of a clash with existing files should
163+
be minimal.
164+
165+
</ul>
166+
<hr>
167+
168+
<h2><a name="format">The Format</a></h2>
169+
170+
The format and semantics of the "<code>/robots.txt</code>" file
171+
are as follows:
172+
173+
<p>
174+
175+
The file consists of one or more records separated by one or
176+
more blank lines (terminated by CR,CR/NL, or NL). Each
177+
record contains lines of the form
178+
"<code>&lt;field&gt;:&lt;optionalspace&gt;&lt;value&gt;&lt;optionalspace&gt;</code>".
179+
The field name is case insensitive.</p>
180+
181+
<p>
182+
183+
Comments can be included in file using UNIX bourne shell
184+
conventions: the '<code>#</code>' character is used to
185+
indicate that preceding space (if any) and the remainder of
186+
the line up to the line termination is discarded.
187+
Lines containing only a comment are discarded completely,
188+
and therefore do not indicate a record boundary.</p>
189+
190+
<p>
191+
The record starts with one or more <code>User-agent</code>
192+
lines, followed by one or more <code>Disallow</code> lines,
193+
as detailed below. Unrecognised headers are ignored.</p>
194+
195+
<dl>
196+
<dt>User-agent</dt>
197+
<dd>
198+
199+
The value of this field is the name of the robot the
200+
record is describing access policy for.
201+
202+
<p>
203+
If more than one User-agent field is present the record
204+
describes an identical access policy for more
205+
than one robot. At least one field needs to be present
206+
per record.</p>
207+
208+
<p>
209+
The robot should be liberal in interpreting this field.
210+
A case insensitive substring match of the name without
211+
version information is recommended.</p>
212+
213+
<p>
214+
215+
If the value is '<code>*</code>', the record describes
216+
the default access policy for any robot that has not not
217+
matched any of the other records. It is not allowed to
218+
have two such records in the "<code>/robots.txt</code>"
219+
file.</p></dd>
220+
221+
<dt>Disallow</dt>
222+
<dd>
223+
224+
The value of this field specifies a partial URL that is not
225+
to be visited. This can be a full path, or a partial
226+
path; any URL that starts with this value will not be
227+
retrieved. For example, <code>Disallow: /help</code>
228+
disallows both <code>/help.html</code> and
229+
<code>/help/index.html</code>, whereas
230+
<code>Disallow: /help/</code> would disallow
231+
<code>/help/index.html</code>
232+
but allow <code>/help.html</code>.
233+
234+
<p>
235+
236+
Any empty value, indicates that all URLs can be
237+
retrieved. At least one Disallow field needs to
238+
be present in a record.</p></dd>
239+
240+
</dl>
241+
242+
The presence of an empty "<code>/robots.txt</code>" file
243+
has no explicit associated semantics, it will be treated
244+
as if it was not present, i.e. all robots will consider
245+
themselves welcome.
246+
247+
<hr>
248+
249+
<h2><a name="examples">Examples</a></h2>
250+
251+
The following example "<code>/robots.txt</code>" file specifies
252+
that no robots should visit any URL starting with
253+
"<code>/cyberworld/map/</code>" or
254+
"<code>/tmp/</code>:
255+
256+
<hr>
257+
<pre>
258+
# robots.txt for http://www.site.com/
259+
260+
User-agent: *
261+
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
262+
Disallow: /tmp/ # these will soon disappear
263+
</pre>
264+
<hr>
265+
266+
This example "<code>/robots.txt</code>" file specifies
267+
that no robots should visit any URL starting with
268+
"<code>/cyberworld/map/</code>", except the robot called
269+
"<code>cybermapper</code>":
270+
271+
<hr>
272+
<pre>
273+
# robots.txt for http://www.site.com/
274+
275+
User-agent: *
276+
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
277+
278+
# Cybermapper knows where to go.
279+
User-agent: cybermapper
280+
Disallow:
281+
</pre>
282+
<hr>
283+
284+
This example indicates that no robots should visit
285+
this site further:
286+
287+
<hr>
288+
<pre>
289+
# go away
290+
User-agent: *
291+
Disallow: /
292+
</pre>
293+
<hr>
294+
295+
<h2><a name="code">Example Code</a></h2>
296+
297+
Although it is not part of this specification, some example code
298+
in Perl is available in <a href="norobots.pl">norobots.pl</a>. It
299+
is a bit more flexible in its parsing than this document
300+
specificies, and is provided as-is, without warranty.
301+
302+
<h2><a name="author">Author's Address</a></h2>
303+
304+
<address>
305+
<a href="/mak/mak.html">Martijn Koster</a>
306+
&lt;[email protected]&gt;<br>
307+
NEXOR<br>
308+
PO Box 132, <br>
309+
Nottingham,<br>
310+
The United Kingdom<br>
311+
Phone: +44 602 520576
312+
</address>
313+
</body>
314+
</html>

0 commit comments

Comments
 (0)