-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Links to sites that have DDoS protections are marked as failed #109
Comments
I've been suggested to try using a real browser in these specific failure cases and https://github.com/puppeteer/puppeteer may help. To investigate. |
It might be advantageous to extend MLC enable/disable via annotation to include outcomes that mark as success/fail. I could see an annotation along the lines of |
Is expecting a link to return a 403 a way to make the site stronger, not sure. The goal of MLC is to help having non broken links. If we hide the issue behind checking an error code that could be anything else including a real issue, that may not help. it can be a 403 because of ddos protection at first but then really become an access denied one and we won't notice. |
I’d say that providing a means to define a “good” link is in the spirit of MLC. There are occasions where it is necessary to redefine good and extending the exclude framework to permit that makes the tool more flexible. I chose 403 specifically, it is a scenario in which the destination exists but MLC is unauthenticated. I actually have a use case for it personally on my family intranet site. I wish to verify that the remote document would succeed if authenticated but not add credentials to the source. There are potentially additional scenarios where defining an alternate status code might be beneficial to the user— HTTP status is a suggestion not the law. |
You totally had my approval at "my family intranet site" 🤯 😉 |
It might be worthwhile investigating the headers that MLC sends with each request. I had a link report "bad" (403) that definitely was not, turns out it was denied by my mod_security setup on a generic rule for not sending an Accept header (log entry sanitized):
|
@cadeef could you post what your modsec would have accepted? |
The issue was the Accept header was missing from the request completely. A standard |
Should we modify link-check to add this as a default or only document it here for users to add the header in options if needed? (I tend to prefer adding the default in link-check) |
But anyway it doesn't seem to be enough (would be too easy) for Cloudflare DDoS protection se for instance on this page https://metamask.zendesk.com/hc/en-us/articles/360015488991-Sending-Ether-New-UI- We need to keep improving Web fabric as broken links are bad for the Web, it makes its original purpose fail. Having a free interconnected network where everyone is equally able to access data... it's less and less true every day, but I keep thinking we have to help finding ways to prevent total failure to happen. Here are some thoughts about things we might do here (or in Link-check):
|
Should have been more clear in my initial comment, I was doubtful the change would make a difference with Cloudflare. Projects of MLC's nature will continually be on the losing end of the request inspection cat and mouse. Shoring up the request profile for MLC would likely eliminate some existing edge cases (like mine with standard mod_sec rules), but it will always be a chase. WAFs evolve constantly by design. I'd argue it's worthwhile, at least in the short term, to patch low hanging fruit that maintains the standard user experience without additional annotation. An exception framework, with toggleable handling, would be welcome as I've eluded before, but it creates additional complexity. Care in implementation would be necessary to ensure that simplicity of the primary cause (dead simple, bolt-on markdown link checking) is maintained. I like the idea of optionally punting to a headless third-party service in the event of failure, but it opens up a can of worms (availability, privacy, support) that would need to be carefully navigated. |
I agree with your comments. DDoS protection on websites is something I have more and more issues with though... Maybe it's specific to the domain I work in (blockchain) but I think it's, like spam, something we will have to live with... and it will break the basic fabric of the Web. Is it the role of MLC to fix this? Absolutely not, of course, but can we provide tools to help users deal with that? I think so. I think MLC has to evolve anyway. If we want it to remain a small script, we will end with something that will be useless at some point. If we build a proper software, with extension capabilities by design, we will be able to improve and follow users needs. |
Resolving #111 would help. |
As an example of what you all long ago predicted, all links to GitHub Docs (https://docs.github.com/*) started failing for me within the past 13 hours with 403s. I am running markdown-link-check 3.10.0, and the issue is not related to a change in the version of markdown-link-check, so I assume GitHub enhanced their DDoS protections. I am a fan of the proposal to support expecting a 403 but an even bigger fan of the idea of escalation to a headless browser (and, if warranted, additionally escalating from a headless browser to a full browser). I don't think anyone would argue that Puppeteer is a panacea to the inherent arms race markdown-link-check finds itself faced with, but it has far more resources to keep up with changing expectations. I may have missed some pertinent issues, but these were the only two hits for DDoS in their issue tracker:
As suggested in the latter, puppeteer-extra-plugin-stealth can be used to dodge most DDoS protections. As an overall strategy, consolidating around a single (or small number of) large project(s) seems a solid one for the open-source community since we are much stronger together than separate. |
I was wrong. The issue with links to GitHub Docs had nothing to do with DDoS, but rather GitHub evidently started requiring that clients accept compression. This .markdown-link-check.json resolved the issue for us. |
Adding the "https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html" link to the ignore pattern for the link check file as a workaround for this particular link returning a 403 error. Not an elegant solution but was suggested here: tcort/markdown-link-check#109 Will keep investigating to see if there is a better work around
I hit this with an origin backed by Netlify. They return 403 for
The following in the config fixed it: "httpHeaders": [
{
"urls": [
"https://"
],
"headers": {
"user-agent": "pls let me in"
}
}
], |
More and more sites have DDoS or anti spam protections like Cloudflare DDoS Attack Protection or Godaddy that makes link checker fail on these links (server returns 502 code for instance if you are not an actual browser)
Only way for the moment to me is to exclude the sites.
But in the future we will have more and more site using these tools and link check will be only able to check internal links.
We have to figure out if we have any technical solution to this issue and see how to implement it.
All suggestions are welcome in the comments.
Thanks.
The text was updated successfully, but these errors were encountered: