Respect HTTP response header content-type

When downloading `robot.txt` files, we are ignoring the HTTP response `Content-type` header. This is generally consistent with other peoples published interpretation of the protocol (e.g https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt), but maybe we can do better.

The content type contains both the mime-type of the data, and (optionally) the character encoding. If and how we handle these bits of information can be considered separately: 
#### Mime-type

The most obvious thing to do here is dump all data that isn't sent as `text/plain`. We could handle this like other server-side errors; _allow_ the whole domain, but try again in a day or two. The advantage of doing this is that we will save bandwidth and processing time time handling invalid data. Disadvantage  
is that we will be likely to disregard data that would otherwise have been parsable; because mime-type mis-configuration is common. Over-all, I'm not convinced the advantage is worth effort.
#### Character encoding

In cases where a character coding is given, we could use that rather than the default (UTF-8). While this isn't required by the protocol, it would be a good-faith gesture to make a best effort interpreting the sites wishes. There's no really disadvantage to doing this, other than developer time. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect HTTP response header content-type #21

Mime-type

Character encoding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Respect HTTP response header content-type #21

Description

Mime-type

Character encoding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions