Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

Open
keturn opened this issue Feb 10, 2020 · 2 comments

Comments

@keturn
Copy link

keturn commented Feb 10, 2020

The error is shown as "a bytes-like object is required, not str", but this is misleading, because the caller's error was that they did pass a bytes object.

Honestly not sure what the pythonic way to deal with this is.

  • Explicit assert isinstance type checking?
  • type annotations, and hope the user is running in an environment that will type check before they hit this exception?
html_text.extract_text(b'<html><body><p>Hello,   World!</p></body></html>')
…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
     47     XXX: mostly copy-pasted from parsel.selector.create_root_node
     48     """
---> 49     body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
     50     parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
     51     root = lxml.etree.fromstring(body, parser=parser)

TypeError: a bytes-like object is required, not 'str'

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

@lopuhin
Copy link
Contributor

lopuhin commented Feb 10, 2020

Yeah, it's .replace method of bytestring which raises this error, and it is confusing for the user. For html-text, having an explicit type check in extract_text seems like a good usability improvement to me, but with raising TypeError instead of an assert.

@lopuhin
Copy link
Contributor

lopuhin commented Feb 10, 2020

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

That's also possible, but note that this must be a utf8-encoded html, so if it's just a raw response result in a different encoding, then it would not work correctly. Accepting only strings makes sure we don't have this error, and it seems that the time to do re-encoding is small compared to text extraction time. But maybe it's fine to support bytes if the error on non-utf8 html is not too obscure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants