From ec82fe756771f9d15d331b2827d72d2e273edc2e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Finn=20=C3=85rup=20Nielsen?= <faan@dtu.dk>
Date: Tue, 30 Nov 2021 08:50:44 +0100
Subject: [PATCH] Change robots.txt #1709

We had an old robots.txt for the wmflabs with subdirectory.
As the Toolforge Scholia domain is changed to scholia.toolforge.org
then the robots.txt was no longer having a correct path and should
thus be ineffective.

Search engines index dynamic content on Scholia pages differently:
Bing and Quant seems to index the content, but Duckduckgo and Google do
apparently not, see #1709.

With this change, not only is the path change, but bots are now allows.
If this results in too much load on the Toolforge infrastruture then it
should be changed to a 'Disallow: /'.

Note that the 'robots' HTML meta tag on each Scholia page has a nofollow
to avoid crawling.
---
 scholia/app/views.py | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/scholia/app/views.py b/scholia/app/views.py
index 707e139b2..f0e5d0fbc 100644
--- a/scholia/app/views.py
+++ b/scholia/app/views.py
@@ -1984,14 +1984,34 @@ def show_publisher_empty():
 def show_robots_txt():
     """Return robots.txt file.
 
+    A robots.txt file is returned that allows bots to index Scholia.
+
     Returns
     -------
     response : flask.Response
-        Rendered HTML for publisher index page.
+        Rendered plain text with robots.txt content.
+
+    Notes
+    -----
+    The default robots.txt for Toolforge hosted tools is
+
+    User-agent: *
+    Disallow: /
+
+    Scholia's function returns a robots.txt with 'Allow' for all. We would like
+    bots to index, but not crawl Scholia. Crawling is also controlled by the
+    HTML meta tag 'robots' thatis set to the content: noindex, nofollow on all
+    pages. So Scholia's robots.txt is:
+
+    User-agent: *
+    Allow: /
+
+    If this results in too much crawling or load on the Toolforge
+    infrastructure then it should be changed.
 
     """
     ROBOTS_TXT = ('User-agent: *\n'
-                  'Disallow: /scholia/\n')
+                  'Allow: /\n')
     return Response(ROBOTS_TXT, mimetype="text/plain")