External link checking flaw and service #4049
Replies: 4 comments 5 replies
-
I think, I side with @peterbe here when he introduced the flaw checker: the tool can't be made much cleverer than it is. What I could imagine, though, is to keep a file with meta information. The HTTP status code might also be worth looking at (not only for 404, but also for redirects, for example). |
Beta Was this translation helpful? Give feedback.
-
In general, I love the idea! This is going to significantly improve the quality of MDN because of SEO. Google hates it when you link to spam because they think you're encouraging that and it counts as a negative on our SEO. I.e. spam site "link farms". The other point I want to emphasize is that we should not tie this to Markdown. The tricky part is the flaws are meant to be "a slam dunk". It's supposed to be indisputable that something is a flaw. But many external links are not. They're almost entirely best judged by humans and not by scripts. One thing that might definitely be a flaw and automates well are:
Currently, the PR review companion highlights external URLs but it's only a small start. I wish we could do more with that! |
Beta Was this translation helpful? Give feedback.
-
Last but not least; I really like the PR review companion "framework". It's a nice place to write all sorts of scripts whose output becomes a (Markdown) GitHub issue comment. |
Beta Was this translation helpful? Give feedback.
-
I still think it would be relatively straight-forward to extend the review companion and do a bit more with the external links. But a basic sweep of each external URL and a list of its HTTP error codes I think would be a good start. Now, if we do that in the review companion it won't solve for existing content. And it might cause confusion if it analyzes an external URL that was not introduced by the current PR author. |
Beta Was this translation helpful? Give feedback.
-
Currently we have flaw checkers that can spot broken internal links. We have checkers that note http links that might possibly be https. What we don't have is anything to check whether external links actually work. That's a problem!
Flaw checker proposal
As we move to markdown, how about we add something like https://github.com/tcort/markdown-link-check to the flaw checker.
This is a good tool because you can feed it just the markdown pages you are interested in checking. Because it is parsing the markdown it can find the "intended links" in the page content, and not care about anything that might be elsewhere in sidebars.
The flaw checker should actually report only actual flaws, so I'd propose we only report things that give HTTP404 errors, and possibly those that give permanent redirects. We strip out all other errors since we don't know for sure they are actual errors.
CI based link checker service
On CI we use a service that can perhaps give more detailed information. For example there are services that can tell you if a domain is being resold - but it is a 404 for our purposes.
This report could be wrapped in a twistie so that it is hidden by default - with perhaps only the "real" problems being displayed.
FYI @peterbe
Yes, I know we have spoken of this before. I'm going to keep it alive until the problem is fixed :-)
Beta Was this translation helpful? Give feedback.
All reactions