Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guess_layout does not work on XHTML elements #25

Open
keturn opened this issue Feb 10, 2020 · 1 comment
Open

guess_layout does not work on XHTML elements #25

keturn opened this issue Feb 10, 2020 · 1 comment

Comments

@keturn
Copy link

keturn commented Feb 10, 2020

After the failure of extract_text in #24, I tried etree_to_text.

I got through that without encountering an exception, but guess_layout doesn't work: no newlines are added after those tags.

I think it's because element.tag includes the tag's XML namespace, so it doesn't match the namespaceless NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS.

Test:

def test_guess_layout():
    xhtml = (u'<html xmlns="http://www.w3.org/1999/xhtml">'
             '<head><title>  title  </title></head>'
             '<body><div>text_1.<p>text_2 text_3</p>'
            '<p id="demo"></p><ul><li>text_4</li><li>text_5</li></ul>'
            '<p>text_6<em>text_7</em>text_8</p>text_9</div>'
            '<script>document.getElementById("demo").innerHTML = '
            '"This should be skipped";</script> <p>...text_10</p>'
            '</body></html>')

    text = ('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
            '\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10')
    assert extract_text(xhtml, guess_punct_space=False, guess_layout=True) == text

    assert etree_to_text(etree.fromstring(xhtml,parser=etree.XMLParser()), guess_layout=True) == text
@keturn
Copy link
Author

keturn commented Feb 10, 2020

This could be handled either by altering traverse_text_fragments to get the tag's local name (using etree.QName), or adding a duplicate of each tag to the NEWLINE_TAGS set that has {http://www.w3.org/1999/xhtml} prepended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant