You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since robots.txt often includes interesting pages, we should make sure the extractor script gets them recursively (they could just be downloaded with another wget -r, and that would assure they'd be there if they exist). At this point, it might be worth moving the whole "get entire website" logic to a separate script file, or at least another function.
The text was updated successfully, but these errors were encountered:
Note that some robots.txt files are a bit more complex than the common ones. Examples: https://www.facebook.com/robots.txt (contains comments, multiple user-agents, allow and disallow) https://www.google.com/robots.txt (also has a mix of allow and disallow, has sitemap references, has wildcards and special characters (*,?,=,$))
Sometimes, downloading robots.txt gives the same content as index.html (or potentially other things as well), so try validating the file before starting to work with it
Since robots.txt often includes interesting pages, we should make sure the extractor script gets them recursively (they could just be downloaded with another wget -r, and that would assure they'd be there if they exist). At this point, it might be worth moving the whole "get entire website" logic to a separate script file, or at least another function.
The text was updated successfully, but these errors were encountered: