-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Consider the example bellow (from http://blakesmalltalkblog.dailymail.co.uk/robots.txt). There are two distinct groups for User-agent: *. Our strategy is to choose the most specific matching group. Since the groups are equally specific we will fall back to choosing the first. Path directives in the second group are ignored.
It's not entirely clear what the authors of this robots.txt intended, but I think they are under the impression that all matching groups are processed. The robots library applies at-most one group, so it really doesn't work as they expect.
User-agent: *
Disallow: /t/trackback
Disallow: /t/comments
Disallow: /t/stats
Disallow: /t/app
Disallow: /.m/
# block against duplicate content
User-agent: *
Disallow: /*.html?cid=*
Disallow: /*/comments/page/*
Disallow: /*/comments/atom.xml
Disallow: /*/comments/rss.xml
Disallow: /*/comments/index.rdf
User-agent: Googlebot-Mobile
Allow: /.m/
Disallow: /
User-agent: Y!J-SRD
Allow: /.m/
Disallow: /
User-agent: Y!J-MBS
Allow: /.m/
Disallow: /
# block MSIE from abusing cache request
User-agent: Active Cache Request
Disallow: *
Here's another example form http://sunshine-girls.net/robots.txt:
# This file was generated on Sun, 21 Sep 2014 21:20:00 +0000
# If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://sunshine-girls.net/sitemap.xml
Sitemap: http://sunshine-girls.net/news-sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
User-agent: *
Disallow: /mshots/v1/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /wp-login.php
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
User-agent: *
Disallow: /public.api/
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow: /wp-admin/
I'm not sure there is a coherent way to resolve this problem. One could combine the groups, but it feels like a awkward exception. If we combine two equally specific matching groups, then why don't be combine all matching groups? I'm inclined to ignore this until there is a clear need.