Fix issue #15: Mechanize discards first URL after self-closing anchor tag #405
+61
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When WWW::Mechanize encountered a self-closing anchor tag like
<a name="anchor"/>, it would discard the first link that appeared immediately after it. This was originally reported in 2007 as issue #15 via RT.For example, given this HTML:
Before this fix,
mech-dump --linkswould only return:The first link (
test1) was completely missing.Root Cause
The
_link_from_token()method inlib/WWW/Mechanize.pmunconditionally called$parser->get_trimmed_text("/a")for all<a>tags to extract the link text.For self-closing tags like
<a name="anchor"/>, this call caused HTML::TokeParser to read forward until it found the next</a>closing tag. Unfortunately, that closing tag belonged to the subsequent link (test1), so the entire first link was consumed during text extraction and never processed.Solution
Modified
_link_from_token()to check if a tag is self-closing before callingget_trimmed_text(). HTML::TokeParser marks self-closing tags with a'/'key in the attributes hash. Self-closing tags have no content, so callingget_trimmed_text()is both unnecessary and causes this bug.The fix is minimal - just 4 lines with comments explaining the check.
Testing
t/anchor_name_bug.tthat reproduces the exact scenario from the original issue reportmech-dumpthat both links are now properly extractedAfter this fix,
mech-dump --linkscorrectly returns both links:Closes #15
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
blahblahblah.xx-only-testing.foo/usr/bin/perl t/local/failure.t(dns block)esm.ubuntu.com/usr/lib/apt/methods/https(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Fixes #119
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.