Handle non-breaking spaces and other special unicode characters #6

lopuhin · 2017-05-30T07:06:47Z

See discussion in #2 (comment)

codinguncut · 2017-10-02T15:46:36Z

not sure if this is the same issue, but I'm getting:

ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
    tree = _cleaned_html_tree(html)
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
    tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed

Apparently this is a new strictness introduced by python 3.
Possibly using surrogateescape flag in encode could help...?

Also see:

lopuhin · 2017-10-02T16:13:31Z

Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement into html_text.extract_text.

kmike · 2017-10-02T16:14:29Z

The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.

I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.

kmike · 2017-10-02T16:16:25Z

FTR, response.css / response.xpath also don't work for this website.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-breaking spaces and other special unicode characters #6

Handle non-breaking spaces and other special unicode characters #6

lopuhin commented May 30, 2017

codinguncut commented Oct 2, 2017

lopuhin commented Oct 2, 2017

kmike commented Oct 2, 2017

kmike commented Oct 2, 2017

Handle non-breaking spaces and other special unicode characters #6

Handle non-breaking spaces and other special unicode characters #6

Comments

lopuhin commented May 30, 2017

codinguncut commented Oct 2, 2017

lopuhin commented Oct 2, 2017

kmike commented Oct 2, 2017

kmike commented Oct 2, 2017