|
| 1 | +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> |
| 2 | +<html> |
| 3 | +<head> |
| 4 | +<title>A Standard for Robot Exclusion</title> |
| 5 | +</head> |
| 6 | +<body> |
| 7 | + |
| 8 | +<h1>A Standard for Robot Exclusion</h1> |
| 9 | + |
| 10 | +Table of contents: |
| 11 | + |
| 12 | +<ul> |
| 13 | +<li> |
| 14 | + |
| 15 | +<a href="#status"> |
| 16 | +Status of this document |
| 17 | +</a> |
| 18 | + |
| 19 | +<li> |
| 20 | + |
| 21 | +<a href="#introduction"> |
| 22 | +Introduction |
| 23 | +</a> |
| 24 | + |
| 25 | +<li> |
| 26 | + |
| 27 | +<a href="#method"> |
| 28 | +Method |
| 29 | +</a> |
| 30 | + |
| 31 | +<li> |
| 32 | + |
| 33 | +<a href="#format"> |
| 34 | +Format |
| 35 | +</a> |
| 36 | + |
| 37 | +<li> |
| 38 | + |
| 39 | +<a href="#examples"> |
| 40 | +Examples |
| 41 | +</a> |
| 42 | + |
| 43 | +<li> |
| 44 | + |
| 45 | +<a href="#code"> |
| 46 | +Example Code |
| 47 | +</a> |
| 48 | + |
| 49 | +<li> |
| 50 | + |
| 51 | +<a href="#author"> |
| 52 | +Author's Address |
| 53 | +</a> |
| 54 | + |
| 55 | +</ul> |
| 56 | +<hr> |
| 57 | + |
| 58 | +<h2><a name="status">Status of this document</a></h2> |
| 59 | + |
| 60 | +This document represents a consensus on 30 June 1994 on the robots |
| 61 | +mailing list ( [email protected]), between the majority of |
| 62 | +robot authors and other people with an interest in robots. It has |
| 63 | +also been open for discussion on the Technical World Wide Web |
| 64 | +mailing list ( [email protected]). This document is based on a |
| 65 | +previous working draft under the same title. |
| 66 | + |
| 67 | +<p> |
| 68 | + |
| 69 | +It is not an official standard backed by a standards body, |
| 70 | +or owned by any comercial organisation. |
| 71 | + |
| 72 | +It is not enforced by anybody, and there no guarantee that |
| 73 | +all current and future robots will use it. |
| 74 | + |
| 75 | +Consider it a common facility the majority of robot authors |
| 76 | +offer the WWW community to protect WWW server against |
| 77 | +unwanted accesses by their robots.</p> |
| 78 | + |
| 79 | +<p> |
| 80 | + |
| 81 | +The latest version of this document can be found on |
| 82 | +<a href="http://web.nexor.co.uk/mak/doc/robots/norobots.html"> |
| 83 | +http://web.nexor.co.uk/mak/doc/robots/norobots.html</a>.</p> |
| 84 | + |
| 85 | +<hr> |
| 86 | + |
| 87 | +<h2><a name="introduction">Introduction</a></h2> |
| 88 | + |
| 89 | +WWW Robots (also called wanderers or spiders) are programs |
| 90 | +that traverse many pages in the World Wide Web by |
| 91 | +recursively retrieving linked pages. For more information |
| 92 | +see <a href="robots.html">the robots page</a>. |
| 93 | + |
| 94 | +<p> |
| 95 | + |
| 96 | +In 1993 and 1994 there have been occasions where robots |
| 97 | +have visited WWW servers where they weren't welcome for |
| 98 | +various reasons. Sometimes these reasons were robot specific, |
| 99 | +e.g. certain robots swamped servers with rapid-fire |
| 100 | +requests, or retrieved the same files repeatedly. |
| 101 | +In other situations robots traversed parts of WWW servers |
| 102 | +that weren't suitable, e.g. very deep virtual trees, |
| 103 | +duplicated information, temporary information, or |
| 104 | +cgi-scripts with side-effects (such as voting).</p> |
| 105 | + |
| 106 | +<p> |
| 107 | + |
| 108 | +These incidents indicated the need for established |
| 109 | +mechanisms for WWW servers to indicate to robots which parts |
| 110 | +of their server should not be accessed. This standard |
| 111 | +addresses this need with an operational solution.</p> |
| 112 | + |
| 113 | +<hr> |
| 114 | + |
| 115 | +<h2><a name="method">The Method</a></h2> |
| 116 | + |
| 117 | +The method used to exclude robots from a server is to |
| 118 | +create a file on the server which specifies an access |
| 119 | +policy for robots. |
| 120 | + |
| 121 | +This file must be accessible via HTTP on the local URL |
| 122 | +"<code>/robots.txt</code>". |
| 123 | +The contents of this file are specified <a href="#format">below</a>. |
| 124 | + |
| 125 | +<p> |
| 126 | + |
| 127 | +This approach was chosen because it can be easily |
| 128 | +implemented on any existing WWW server, and a robot can find |
| 129 | +the access policy with only a single document retrieval.</p> |
| 130 | + |
| 131 | +<p> |
| 132 | + |
| 133 | +A possible drawback of this single-file approach is that only a |
| 134 | +server administrator can maintain such a list, not the |
| 135 | +individual document maintainers on the server. This can be |
| 136 | +resolved by a local process to construct the single file |
| 137 | +from a number of others, but if, or how, this is done is |
| 138 | +outside of the scope of this document.</p> |
| 139 | + |
| 140 | +<p> |
| 141 | + |
| 142 | +The choice of the URL was motivated by several criteria:</p> |
| 143 | + |
| 144 | +<ul> |
| 145 | +<li> |
| 146 | + |
| 147 | +The filename should fit in file naming restrictions of all |
| 148 | +common operating systems. |
| 149 | + |
| 150 | +<li> |
| 151 | + |
| 152 | +The filename extension should not require extra server |
| 153 | +configuration. |
| 154 | + |
| 155 | +<li> |
| 156 | + |
| 157 | +The filename should indicate the purpose of the file |
| 158 | +and be easy to remember. |
| 159 | + |
| 160 | +<li> |
| 161 | + |
| 162 | +The likelihood of a clash with existing files should |
| 163 | +be minimal. |
| 164 | + |
| 165 | +</ul> |
| 166 | +<hr> |
| 167 | + |
| 168 | +<h2><a name="format">The Format</a></h2> |
| 169 | + |
| 170 | +The format and semantics of the "<code>/robots.txt</code>" file |
| 171 | +are as follows: |
| 172 | + |
| 173 | +<p> |
| 174 | + |
| 175 | +The file consists of one or more records separated by one or |
| 176 | +more blank lines (terminated by CR,CR/NL, or NL). Each |
| 177 | +record contains lines of the form |
| 178 | +"<code><field>:<optionalspace><value><optionalspace></code>". |
| 179 | +The field name is case insensitive.</p> |
| 180 | + |
| 181 | +<p> |
| 182 | + |
| 183 | +Comments can be included in file using UNIX bourne shell |
| 184 | +conventions: the '<code>#</code>' character is used to |
| 185 | +indicate that preceding space (if any) and the remainder of |
| 186 | +the line up to the line termination is discarded. |
| 187 | +Lines containing only a comment are discarded completely, |
| 188 | +and therefore do not indicate a record boundary.</p> |
| 189 | + |
| 190 | +<p> |
| 191 | +The record starts with one or more <code>User-agent</code> |
| 192 | +lines, followed by one or more <code>Disallow</code> lines, |
| 193 | +as detailed below. Unrecognised headers are ignored.</p> |
| 194 | + |
| 195 | +<dl> |
| 196 | +<dt>User-agent</dt> |
| 197 | +<dd> |
| 198 | + |
| 199 | +The value of this field is the name of the robot the |
| 200 | +record is describing access policy for. |
| 201 | + |
| 202 | +<p> |
| 203 | +If more than one User-agent field is present the record |
| 204 | +describes an identical access policy for more |
| 205 | +than one robot. At least one field needs to be present |
| 206 | +per record.</p> |
| 207 | + |
| 208 | +<p> |
| 209 | +The robot should be liberal in interpreting this field. |
| 210 | +A case insensitive substring match of the name without |
| 211 | +version information is recommended.</p> |
| 212 | + |
| 213 | +<p> |
| 214 | + |
| 215 | +If the value is '<code>*</code>', the record describes |
| 216 | +the default access policy for any robot that has not not |
| 217 | +matched any of the other records. It is not allowed to |
| 218 | +have two such records in the "<code>/robots.txt</code>" |
| 219 | +file.</p></dd> |
| 220 | + |
| 221 | +<dt>Disallow</dt> |
| 222 | +<dd> |
| 223 | + |
| 224 | +The value of this field specifies a partial URL that is not |
| 225 | +to be visited. This can be a full path, or a partial |
| 226 | +path; any URL that starts with this value will not be |
| 227 | +retrieved. For example, <code>Disallow: /help</code> |
| 228 | +disallows both <code>/help.html</code> and |
| 229 | +<code>/help/index.html</code>, whereas |
| 230 | +<code>Disallow: /help/</code> would disallow |
| 231 | +<code>/help/index.html</code> |
| 232 | +but allow <code>/help.html</code>. |
| 233 | + |
| 234 | +<p> |
| 235 | + |
| 236 | +Any empty value, indicates that all URLs can be |
| 237 | +retrieved. At least one Disallow field needs to |
| 238 | +be present in a record.</p></dd> |
| 239 | + |
| 240 | +</dl> |
| 241 | + |
| 242 | +The presence of an empty "<code>/robots.txt</code>" file |
| 243 | +has no explicit associated semantics, it will be treated |
| 244 | +as if it was not present, i.e. all robots will consider |
| 245 | +themselves welcome. |
| 246 | + |
| 247 | +<hr> |
| 248 | + |
| 249 | +<h2><a name="examples">Examples</a></h2> |
| 250 | + |
| 251 | +The following example "<code>/robots.txt</code>" file specifies |
| 252 | +that no robots should visit any URL starting with |
| 253 | +"<code>/cyberworld/map/</code>" or |
| 254 | +"<code>/tmp/</code>: |
| 255 | + |
| 256 | +<hr> |
| 257 | +<pre> |
| 258 | +# robots.txt for http://www.site.com/ |
| 259 | + |
| 260 | +User-agent: * |
| 261 | +Disallow: /cyberworld/map/ # This is an infinite virtual URL space |
| 262 | +Disallow: /tmp/ # these will soon disappear |
| 263 | +</pre> |
| 264 | +<hr> |
| 265 | + |
| 266 | +This example "<code>/robots.txt</code>" file specifies |
| 267 | +that no robots should visit any URL starting with |
| 268 | +"<code>/cyberworld/map/</code>", except the robot called |
| 269 | +"<code>cybermapper</code>": |
| 270 | + |
| 271 | +<hr> |
| 272 | +<pre> |
| 273 | +# robots.txt for http://www.site.com/ |
| 274 | + |
| 275 | +User-agent: * |
| 276 | +Disallow: /cyberworld/map/ # This is an infinite virtual URL space |
| 277 | + |
| 278 | +# Cybermapper knows where to go. |
| 279 | +User-agent: cybermapper |
| 280 | +Disallow: |
| 281 | +</pre> |
| 282 | +<hr> |
| 283 | + |
| 284 | +This example indicates that no robots should visit |
| 285 | +this site further: |
| 286 | + |
| 287 | +<hr> |
| 288 | +<pre> |
| 289 | +# go away |
| 290 | +User-agent: * |
| 291 | +Disallow: / |
| 292 | +</pre> |
| 293 | +<hr> |
| 294 | + |
| 295 | +<h2><a name="code">Example Code</a></h2> |
| 296 | + |
| 297 | +Although it is not part of this specification, some example code |
| 298 | +in Perl is available in <a href="norobots.pl">norobots.pl</a>. It |
| 299 | +is a bit more flexible in its parsing than this document |
| 300 | +specificies, and is provided as-is, without warranty. |
| 301 | + |
| 302 | +<h2><a name="author">Author's Address</a></h2> |
| 303 | + |
| 304 | +<address> |
| 305 | +<a href="/mak/mak.html">Martijn Koster</a> |
| 306 | + |
| 307 | +NEXOR<br> |
| 308 | +PO Box 132, <br> |
| 309 | +Nottingham,<br> |
| 310 | +The United Kingdom<br> |
| 311 | +Phone: +44 602 520576 |
| 312 | +</address> |
| 313 | +</body> |
| 314 | +</html> |
0 commit comments