Skip to content

Respect HTTP response header content-type #21

@hamishmorgan

Description

@hamishmorgan

When downloading robot.txt files, we are ignoring the HTTP response Content-type header. This is generally consistent with other peoples published interpretation of the protocol (e.g https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt), but maybe we can do better.

The content type contains both the mime-type of the data, and (optionally) the character encoding. If and how we handle these bits of information can be considered separately:

Mime-type

The most obvious thing to do here is dump all data that isn't sent as text/plain. We could handle this like other server-side errors; allow the whole domain, but try again in a day or two. The advantage of doing this is that we will save bandwidth and processing time time handling invalid data. Disadvantage
is that we will be likely to disregard data that would otherwise have been parsable; because mime-type mis-configuration is common. Over-all, I'm not convinced the advantage is worth effort.

Character encoding

In cases where a character coding is given, we could use that rather than the default (UTF-8). While this isn't required by the protocol, it would be a good-faith gesture to make a best effort interpreting the sites wishes. There's no really disadvantage to doing this, other than developer time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions