Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-breaking spaces and other special unicode characters #6

Open
lopuhin opened this issue May 30, 2017 · 4 comments
Open

Handle non-breaking spaces and other special unicode characters #6

lopuhin opened this issue May 30, 2017 · 4 comments

Comments

@lopuhin
Copy link
Contributor

lopuhin commented May 30, 2017

See discussion in #2 (comment)

@codinguncut
Copy link

not sure if this is the same issue, but I'm getting:

ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
    tree = _cleaned_html_tree(html)
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
    tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed

Apparently this is a new strictness introduced by python 3.
Possibly using surrogateescape flag in encode could help...?

Also see:

@lopuhin
Copy link
Contributor Author

lopuhin commented Oct 2, 2017

Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement into html_text.extract_text.

@kmike
Copy link
Contributor

kmike commented Oct 2, 2017

The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.

I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.

@kmike
Copy link
Contributor

kmike commented Oct 2, 2017

FTR, response.css / response.xpath also don't work for this website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants