-
Notifications
You must be signed in to change notification settings - Fork 8
Description
When downloading robot.txt files, we are ignoring the HTTP response Content-type header. This is generally consistent with other peoples published interpretation of the protocol (e.g https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt), but maybe we can do better.
The content type contains both the mime-type of the data, and (optionally) the character encoding. If and how we handle these bits of information can be considered separately:
Mime-type
The most obvious thing to do here is dump all data that isn't sent as text/plain. We could handle this like other server-side errors; allow the whole domain, but try again in a day or two. The advantage of doing this is that we will save bandwidth and processing time time handling invalid data. Disadvantage
is that we will be likely to disregard data that would otherwise have been parsable; because mime-type mis-configuration is common. Over-all, I'm not convinced the advantage is worth effort.
Character encoding
In cases where a character coding is given, we could use that rather than the default (UTF-8). While this isn't required by the protocol, it would be a good-faith gesture to make a best effort interpreting the sites wishes. There's no really disadvantage to doing this, other than developer time.