Skip to content

Conversation

Copy link

Copilot AI commented Oct 24, 2025

Problem

When WWW::Mechanize encountered a self-closing anchor tag like <a name="anchor"/>, it would discard the first link that appeared immediately after it. This was originally reported in 2007 as issue #15 via RT.

For example, given this HTML:

<h1>hello world</h1>
<a name="anchor"/>
<p><a href="http://www.url1.com/gi1?a=1">test1</a></p>
<p><a href="http://www.url2.com/gi2?a=2">test2</a></p>

Before this fix, mech-dump --links would only return:

http://www.url2.com/gi2?a=2

The first link (test1) was completely missing.

Root Cause

The _link_from_token() method in lib/WWW/Mechanize.pm unconditionally called $parser->get_trimmed_text("/a") for all <a> tags to extract the link text.

For self-closing tags like <a name="anchor"/>, this call caused HTML::TokeParser to read forward until it found the next </a> closing tag. Unfortunately, that closing tag belonged to the subsequent link (test1), so the entire first link was consumed during text extraction and never processed.

Solution

Modified _link_from_token() to check if a tag is self-closing before calling get_trimmed_text(). HTML::TokeParser marks self-closing tags with a '/' key in the attributes hash. Self-closing tags have no content, so calling get_trimmed_text() is both unnecessary and causes this bug.

The fix is minimal - just 4 lines with comments explaining the check.

Testing

  • Added comprehensive test case t/anchor_name_bug.t that reproduces the exact scenario from the original issue report
  • All existing tests continue to pass
  • Verified with mech-dump that both links are now properly extracted
  • Tested edge cases including multiple consecutive self-closing anchors, mixed anchor types, and empty anchor tags

After this fix, mech-dump --links correctly returns both links:

http://www.url1.com/gi1?a=1
http://www.url2.com/gi2?a=2

Closes #15

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • blahblahblah.xx-only-testing.foo
    • Triggering command: /usr/bin/perl t/local/failure.t (dns block)
  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>WM: Mechanize seemed to discard the first URL after this tag in a html page</issue_title>
<issue_description>
[email protected] reported on Nov 9, 2007

From http://rt.cpan.org/Ticket/Display.html?id=22891

From:  	[email protected]
Download (untitled) [text/plain 1.6k]
OK, the real problem turned out to be

Mechanize seemed to discard the first URL after this <a ame="anchor"/>
tag in a html page.

<h1> hello world</h1> <a name="anchor"/>
<p><a href="http://www.url1.com/gi1?a=1">test1</a><p>
<p><a href="http://www.url2.com/gi2?a=2">test2</a><p>


mech-dump -links for above html code returns
http://www.url2.com/gi2?a=2

Details

Imported from Google Code issue 15 via archive

Comments

[email protected] commented on Apr 24, 2011 :

(No comment was entered for this change.)
  • Labels : WM

</issue_description>

Comments on the Issue (you are @copilot in this section)

@simbabque Someone attempting this would have to start with a test to see if this still happens. Even if it does not, we'll happily accept just the test.

Fixes #119


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Fix Mechanize issue with discarding first URL after anchor tag Fix issue #15: Mechanize discards first URL after self-closing anchor tag Oct 24, 2025
Copilot finished work on behalf of oalders October 24, 2025 04:30
Copilot AI requested a review from oalders October 24, 2025 04:30
@haarg
Copy link
Member

haarg commented Oct 24, 2025

<a name="anchor"/> is not a self closing tag in HTML. A trailing / does not denote a self closing tag, it is just ignored. a tags are required to have a closing tag, so this is not really valid HTML. A proper parser will keep the first a tag open until it is implicitly closed by the start of the second a tag, because a tags cannot be nested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WM: Mechanize seemed to discard the first URL after this <a name="anchor"/> tag in a html page

3 participants