-
-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add User-Agent
in link checking
#3304
Comments
So I still recommend this - because I was able to prove this was a blocking case using But, for some reason, running this with the given configuration still fails in a
I took $ docker exec -it megalinter markdown-link-check -q -v -c /tmp/lint/.config/linters/.markdown-link-check.json /tmp/lint/.github/SUPPORT.md
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403
ERROR: 1 dead links found in /tmp/lint/.github/SUPPORT.md !
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403 But running the exact same command in the project outside of the container is successful: $ npx -y markdown-link-check -q -v -c .config/linters/.markdown-link-check.json .github/SUPPORT.md
(node:45774) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
# (No errors) Although I still believe this belongs in And just for proactivity sake: $ docker exec -it megalinter markdown-link-check --version
3.11.2
$ npx markdown-link-check --version
(node:43301) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
3.11.2 Tagging @tcort if they have any thoughts. 🤔 |
More progress - and I think this actually may need to be either moved or duplicated to https://github.com/tcort/markdown-link-check now, because of what I found. I wanted to completely remove the idea that running in a container itself was the issue, so I followed the https://github.com/tcort/markdown-link-check directions on running $ docker run -v ${PWD}:/tmp:ro --rm -i ghcr.io/tcort/markdown-link-check:stable -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403
ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403 ...but it didn't. It also fails in a much more simple environment: $ docker run -it -v ${PWD}:/tmp:ro --rm node npx -y markdown-link-check -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
(node:19) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403
ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
[✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403 It seems that the issue is the user-agent (or something) is not working, but only while run within a Docker container (or something specific about how either both the Megalinter and this Docker container are built/configured).
Interestingly, # Works fine locally...
$ curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 200
date: Sun, 21 Jan 2024 21:44:16 GMT
# etc...
# Fails on the basic `node` image....
$ docker run -it --rm node curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403
date: Sun, 21 Jan 2024 21:44:50 GMT
# etc...
# And even fails on the base `alpine` image...
docker run -it --rm alpine sh -c 'apk update -q; apk add -q curl; curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"'
HTTP/2 403
date: Sun, 21 Jan 2024 21:54:03 GMT
# etc.... I'm going to do some more digging to see if it's an issue with a commonality, Docker, or otherwise, but this is a deeper issue than I expected. Unfortunately, To be clear - don't close this issue. Adding the user-agent above is still a very, very good idea. This is indicative of a secondary problem. |
More updates... it works fine with # Local works fine...
$ wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378
--2024-01-21 16:55:55-- https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
# Alpine fails out of the box
$ docker run -it --rm alpine sh -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden
# But Alpine works fine if we reinstall wget...
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
--2024-01-21 22:03:56-- https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
# Ubuntu works fine...
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
# ...
Connecting to meta.stackexchange.com (meta.stackexchange.com)|104.18.43.226|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
# Not Megalinter (which is alpine-based)
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (104.18.43.226:443)
HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden
# Nor markdown-link-check (which is also alpine-based)
$ docker run -it --rm --entrypoint /bin/sh ghcr.io/tcort/markdown-link-check:stable -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden So what about versions? # Works fine
$ wget
GNU Wget 1.21.4 built on darwin22.4.0.
# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.
# Works fine
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.4 built on linux-musl.
# Works fine
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.2 built on linux-gnu.
# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.
# Fails
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-06 11:32:24 UTC) multi-call binary. So that's interesting... it seems that the default https://github.com/mirror/busybox bundle is the common point of failure on these devices. I wonder if this could be solved simply by adding a proper |
What an investigation @andrewvaughan :D it seems markdown-links-check
Maybe needle has different behaviors depending of environment variables ? Something to check that could be to expose a mock service, and log the calls within docker and out of docker to see the differences :) |
Lol you should see the comment I was half-way through writing... I have gone the depths of the dependency stack. I am weary and tired, but I bear the fruits of my labor:
Bear with me friends, because this is where my soul started tearing apart. The code was a jungle.
Which brought me to the Node.js core source-code with even LESS documentation...
...that's about as far as I got |
Narrowed it down: $ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0
# curl --no-alpn -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/1.1 200 OK
# curl -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403 There must be something either about the Although interesting, forcing # curl --http1.1 -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "http
HTTP/1.1 403 Forbidden Now why these remote hosts are allowing non-ALPN traffic through but blocking ALPN traffic is an interesting question. Edit: I've answered this - every response came back with a I will move the remainder of this discovery over to an issue on the markdown-link-checker, but I would definitely consider adding a unique UserAgent to Megalinter with this Issue - it will help prevent the default |
Wow, what a pleasure to read at the whole thinking process. Near the beginning of the thread, I was thinking on trying a Debian/debian slim/ubuntu container too. Sometimes, to make sure that I'm not hitting some particular differences of musl-based packages, it's always good to check if it should work without it. In the last couple of years of still being subscribed to notifications on the node-red docker repo, you'd be surprised about the frequency of weird behaviours that doesn't happen with a Debian based base image (as they have both). I'd never thought of going as deep as you did, you even learned me a new word, ALPN! As for the user agent, I have three contradicting opinions. On one side, it is reasonable and your explanations justify correctly the need to have a user agent. On another side, shouldn't it be a user agent for the linter rather than Megalinter? While you are talking specifically talking about markdown-link-checker, the linter I struggle a bit more with lychee. That brings to the third competing opinion: some sites answer completely differently by user-agent. Wink Wink SourceForge. Even though I already found it out on myself before, it was apparent when working with a winget definition for a new software version, where the download URLs work only in specific cases. (Luckily they have an arrangement so their CI works better than locally). But these differences came back at the beginning of the introduction of lychee linter, before getting stuff smoothed out. So here, sometimes having the generic most common user agent is the only way to have a (badly) configured website to work at all. So I can't decide yet what will weight more in the balance. |
Thanks for the kind words! Per your concerns on the UA - you're 100% on point. That's why I particularly recommended the pattern of UA that I did. There's a link I put above with best-practices on generating UAs. Most "crawlers" literally put "crawler/2.2.2" which can be problematic, if only because some lazy admins block "everything not standard," which was never intended for UAs. That's where marking a The UA format is As such, nearly all UAs for browsers are:
With some level of standardization in what the However! That comment can technically be anything - and there actually is a better pattern for applications that meet the "requirements" of a browser standard but make use of it in a different way; for example:
This is a great pattern, because it both informs the server as to what standard can be managed and allows for fine-tune bot management by administrators. Maybe someone wants to block all of MegaLinter - maybe just link checkers. Maybe just particular, problematic versions. It's their choice in this format with some simple string-matching. So you end up with something like the recommendation above, or, for something more simple, the following:
The reference URL at the end is also super helpful - coming from an admin, if I were to start seeing this new UA appear out of everywhere, my first reaction would be to block it. A responsible admin, however, will check the reference to see what its purpose is and determine as to whether it is nefarious or not for the purposes of the applications. Systems like Web Application Firewalls learn from this, and you might even start to see Unfortunately, without any specification, you end up with the default for whatever the linters are - or sometimes no UA at all. For Now, imagine how many people have probably used the So, for me - I think the question is whether the responsibility of setting an appropriate UA is for the tool or the tool container. I lean toward the argument that the UA should always represent the technology closest to the end user (in this case, MegaLinter, being the utility I chose to incorporate into my project, not necessarily the specific linter), so I would prefer my UA to represent This is just me thinking out loud, but that's my $0.00002 on the issue! Edit: I realized I didn't touch on a concern - there's always the default argument to just "copy/paste" a "known working" UserAgent to mimic a browser entirely... but WAFs caught on to that decades ago, and it's barely worthwhile these days. It has to do with usage patterns - raises AI eyebrows when "iOS Safari" only makes That said... you can always offer a configurable override to end-users! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this issue should stay open, please remove the |
Is your feature request related to a problem? Please describe.
Many websites block (
403
response) requests without aUser-Agent
HTTP request header set. This causes link checkers to automatically fail.For https://github.com/tcort/markdown-link-check/ a proper issue has already been raised (tcort/markdown-link-check#172); however, given that, for one, this issue is now almost 3-years old without a response, and, for two, it's better for each individual client to provide their unique
User-Agent
to be a good netizen, I recommend having MegaLinter provide a versionedUser-Agent
in their default configurations.Describe the solution you'd like
Add the following to the default https://github.com/oxsecurity/megalinter/blob/main/TEMPLATES/.markdown-link-check.json configuration, but also to any other link-checking linters that may exist:
For more information on
User-Agent
header best practices and why I recommend the above:https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Describe alternatives you've considered
stackoverflow.com
to my ignore list.Additional context
My MegaLinter right now:
The text was updated successfully, but these errors were encountered: