extract_text does not work on lxml XHTML element #24

keturn · 2020-02-10T00:39:06Z

I guess the docs do explicitly state lxml.html.HtmlElement, but the lxml docs say

Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results.

so I had been using lxml in XML-mode, and it failed with the not-so-obvious error:

…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
     47     XXX: mostly copy-pasted from parsel.selector.create_root_node
     48     """
---> 49     body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
     50     parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
     51     root = lxml.etree.fromstring(body, parser=parser)

AttributeError: 'lxml.etree._Element' object has no attribute 'strip'

Test case:

def test_extract_text_from_xml_tree():
    xhtml = (u'<html xmlns="http://www.w3.org/1999/xhtml"><head/><body>'
             '<p>Hello,   World!</p>'
             '</body></html>')

    text = u'Hello, World!'
    assert extract_text(etree.fromstring(xhtml,parser=etree.XMLParser()),
                                         guess_punct_space=False, guess_layout=False) == text

The text was updated successfully, but these errors were encountered:

lopuhin · 2020-02-10T07:31:21Z

@keturn right, good catch - this is something we should fix. In the meantime, you can try calling html_text.etree_to_text directly, that won't fail in parse_html (but may fail later as I didn't check it). EDIT as I see you already tried that in #25.

Also I didn't experience issues with parsing XHTML with HTML parser, at least as far as html-text is concerned.

keturn mentioned this issue Feb 10, 2020

guess_layout does not work on XHTML elements #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text does not work on lxml XHTML element #24

extract_text does not work on lxml XHTML element #24

keturn commented Feb 10, 2020

lopuhin commented Feb 10, 2020 •

edited

Loading

extract_text does not work on lxml XHTML element #24

extract_text does not work on lxml XHTML element #24

Comments

keturn commented Feb 10, 2020

lopuhin commented Feb 10, 2020 • edited Loading

lopuhin commented Feb 10, 2020 •

edited

Loading