Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser blows up on certain malformed HTML content sections #38

Open
fluffy-critter opened this issue Jul 16, 2019 · 0 comments
Open

Parser blows up on certain malformed HTML content sections #38

fluffy-critter opened this issue Jul 16, 2019 · 0 comments

Comments

@fluffy-critter
Copy link

So, this is a bit of a weird one. It seems that when the feed validator is validating the HTML content within a block, it gets tripped up by certain attributes having no value. For example, this minimal reproduction:

<?xml version="1.0" encoding="utf-8”?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <title>Crash the validator</title>
    <link href="http://example.com/feed.xml" rel="self" />
    
    
    <link href="http://example.com/" />
    <id>tag:example.com,2019-07-16:blog</id>
    <updated>2019-06-04T00:40:00-07:00</updated>

    
    <entry>
        <title>This feed crashes feedvalidator</title>
        <link href="http://example.com/crash.html" rel="alternate" type="text/html" />
        <published>2019-06-04T00:40:00-07:00</published>
        <updated>2019-06-04T00:40:00-07:00</updated>
        <id>urn:uuid:9a51cf78-c042-5254-b564-eec1fe3bb181</id>
        <author><name>fluffy</name></author>
        <content type="html"><![CDATA[
            <p>This generates a weird error.</p>
            <div class="images" style><img src="http://placekitten.com/200/300" alt="meow"></div>
            <p>Isn't it strange?</p>
        ]]></content>
    </entry>
    
</feed>

causes an error:

An error occurred while trying to validate this feed.
Possible causes:
• The address may be incorrect. Make sure the address is spelled correctly. Try loading the feed directly in your browser to make sure a feed exists at that address.
• The feed may be temporarily unavailable. The server may be down, or too slow. Try again later.
• The validator may be busted. If the feed exists, the server is fine, and the problem is reproducible, let us know on the feedvalidator-users mailing list.

The element that’s causing the problem is the <div class=“images” style> - removing that causes the validator to work perfectly.

Via the Python REPL I was able to figure out where exactly the code is blowing up; here's a stack trace:

>>> feedvalidator.validateStream(open('/Users/fluffy/Desktop/feed.xml'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "feedvalidator/__init__.py", line 164, in validateStream
    validator = _validate(rawdata, firstOccurrenceOnly, loggedEvents, base, encoding, mediaType=mediaType)
  File "feedvalidator/__init__.py", line 115, in _validate
    parser.parse(source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 110, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 213, in feed
    self._parser.Parse(data, isFinal)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 365, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File "feedvalidator/base.py", line 266, in endElementNS
    handler.endElementNS(name, qname)
  File "feedvalidator/base.py", line 516, in endElementNS
    self.validate()
  File "feedvalidator/content.py", line 82, in validate
    self.validateSafe(self.value)
  File "feedvalidator/validators.py", line 736, in validateSafe
    HTMLValidator(value, self)
  File "feedvalidator/validators.py", line 257, in __init__
    self.feed(value)
  File "feedvalidator/vendor/HTMLParser.py", line 169, in feed
    self.goahead(0)
  File "feedvalidator/vendor/HTMLParser.py", line 209, in goahead
    k = self.parse_starttag(i)
  File "feedvalidator/vendor/HTMLParser.py", line 332, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "feedvalidator/validators.py", line 279, in handle_starttag
    for evil in checkStyle(value):
  File "feedvalidator/validators.py", line 304, in checkStyle
    if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'|"[\s\w]+"|\([\d,\s]+\))*$""", style):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or buffer

So, it seems that there's some special-casing on the style attribute parser that dies if it gets a None value.

Oddly enough, I couldn't manage to reproduce this issue with a minimal content block like:

<div style></div>

Anyway, finding this error finally gave me a reason to care about fixing PlaidWeb/Publ#226 sooner rather than later. :)

dontcallmedom added a commit to w3c/feedvalidator that referenced this issue Mar 2, 2020
Make parser more robust rubys/feedvalidator#38
while still doing strict validity checking (which new HTMLParser doesn't allow see rubys/feedvalidator#28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant