Parser blows up on certain malformed HTML content sections #38

fluffy-critter · 2019-07-16T03:48:53Z

So, this is a bit of a weird one. It seems that when the feed validator is validating the HTML content within a block, it gets tripped up by certain attributes having no value. For example, this minimal reproduction:

<?xml version="1.0" encoding="utf-8”?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <title>Crash the validator</title>
    <link href="http://example.com/feed.xml" rel="self" />
    
    
    <link href="http://example.com/" />
    <id>tag:example.com,2019-07-16:blog</id>
    <updated>2019-06-04T00:40:00-07:00</updated>

    
    <entry>
        <title>This feed crashes feedvalidator</title>
        <link href="http://example.com/crash.html" rel="alternate" type="text/html" />
        <published>2019-06-04T00:40:00-07:00</published>
        <updated>2019-06-04T00:40:00-07:00</updated>
        <id>urn:uuid:9a51cf78-c042-5254-b564-eec1fe3bb181</id>
        <author><name>fluffy</name></author>
        <content type="html"><![CDATA[
            <p>This generates a weird error.</p>
            <div class="images" style><img src="http://placekitten.com/200/300" alt="meow"></div>
            <p>Isn't it strange?</p>
        ]]></content>
    </entry>
    
</feed>

causes an error:

An error occurred while trying to validate this feed.
Possible causes:
• The address may be incorrect. Make sure the address is spelled correctly. Try loading the feed directly in your browser to make sure a feed exists at that address.
• The feed may be temporarily unavailable. The server may be down, or too slow. Try again later.
• The validator may be busted. If the feed exists, the server is fine, and the problem is reproducible, let us know on the feedvalidator-users mailing list.

The element that’s causing the problem is the <div class=“images” style> - removing that causes the validator to work perfectly.

Via the Python REPL I was able to figure out where exactly the code is blowing up; here's a stack trace:

>>> feedvalidator.validateStream(open('/Users/fluffy/Desktop/feed.xml'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "feedvalidator/__init__.py", line 164, in validateStream
    validator = _validate(rawdata, firstOccurrenceOnly, loggedEvents, base, encoding, mediaType=mediaType)
  File "feedvalidator/__init__.py", line 115, in _validate
    parser.parse(source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 110, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 213, in feed
    self._parser.Parse(data, isFinal)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 365, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File "feedvalidator/base.py", line 266, in endElementNS
    handler.endElementNS(name, qname)
  File "feedvalidator/base.py", line 516, in endElementNS
    self.validate()
  File "feedvalidator/content.py", line 82, in validate
    self.validateSafe(self.value)
  File "feedvalidator/validators.py", line 736, in validateSafe
    HTMLValidator(value, self)
  File "feedvalidator/validators.py", line 257, in __init__
    self.feed(value)
  File "feedvalidator/vendor/HTMLParser.py", line 169, in feed
    self.goahead(0)
  File "feedvalidator/vendor/HTMLParser.py", line 209, in goahead
    k = self.parse_starttag(i)
  File "feedvalidator/vendor/HTMLParser.py", line 332, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "feedvalidator/validators.py", line 279, in handle_starttag
    for evil in checkStyle(value):
  File "feedvalidator/validators.py", line 304, in checkStyle
    if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'|"[\s\w]+"|\([\d,\s]+\))*$""", style):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or buffer

So, it seems that there's some special-casing on the style attribute parser that dies if it gets a None value.

Oddly enough, I couldn't manage to reproduce this issue with a minimal content block like:

<div style></div>

Anyway, finding this error finally gave me a reason to care about fixing PlaidWeb/Publ#226 sooner rather than later. :)

The text was updated successfully, but these errors were encountered:

Make parser more robust rubys/feedvalidator#38 while still doing strict validity checking (which new HTMLParser doesn't allow see rubys/feedvalidator#28

dontcallmedom added a commit to w3c/feedvalidator that referenced this issue Mar 2, 2020

Switch HTML Parser to HTML5Lib

f249ddd

Make parser more robust rubys/feedvalidator#38 while still doing strict validity checking (which new HTMLParser doesn't allow see rubys/feedvalidator#28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser blows up on certain malformed HTML content sections #38

Parser blows up on certain malformed HTML content sections #38

fluffy-critter commented Jul 16, 2019

Parser blows up on certain malformed HTML content sections #38

Parser blows up on certain malformed HTML content sections #38

Comments

fluffy-critter commented Jul 16, 2019