Skip to content

Identical matching user-agent groups. #16

@hamishmorgan

Description

@hamishmorgan

Consider the example bellow (from http://blakesmalltalkblog.dailymail.co.uk/robots.txt). There are two distinct groups for User-agent: *. Our strategy is to choose the most specific matching group. Since the groups are equally specific we will fall back to choosing the first. Path directives in the second group are ignored.

It's not entirely clear what the authors of this robots.txt intended, but I think they are under the impression that all matching groups are processed. The robots library applies at-most one group, so it really doesn't work as they expect.

User-agent: *
Disallow: /t/trackback
Disallow: /t/comments
Disallow: /t/stats
Disallow: /t/app
Disallow: /.m/

# block against duplicate content
User-agent: *
Disallow: /*.html?cid=*
Disallow: /*/comments/page/*
Disallow: /*/comments/atom.xml
Disallow: /*/comments/rss.xml
Disallow: /*/comments/index.rdf

User-agent: Googlebot-Mobile
Allow: /.m/
Disallow: /

User-agent: Y!J-SRD
Allow: /.m/
Disallow: /

User-agent: Y!J-MBS
Allow: /.m/
Disallow: /

# block MSIE from abusing cache request
User-agent: Active Cache Request
Disallow: *

Here's another example form http://sunshine-girls.net/robots.txt:

# This file was generated on Sun, 21 Sep 2014 21:20:00 +0000
# If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.

Sitemap: http://sunshine-girls.net/sitemap.xml
Sitemap: http://sunshine-girls.net/news-sitemap.xml

User-agent: IRLbot
Crawl-delay: 3600

User-agent: *
Disallow: /next/

User-agent: *
Disallow: /mshots/v1/

# har har
User-agent: *
Disallow: /activate/

User-agent: *
Disallow: /wp-login.php

User-agent: *
Disallow: /signup/

User-agent: *
Disallow: /related-tags.php

User-agent: *
Disallow: /public.api/

# MT refugees
User-agent: *
Disallow: /cgi-bin/

User-agent: *
Disallow: /wp-admin/

I'm not sure there is a coherent way to resolve this problem. One could combine the groups, but it feels like a awkward exception. If we combine two equally specific matching groups, then why don't be combine all matching groups? I'm inclined to ignore this until there is a clear need.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions