Fix unwanted joins for inline tags #2

lopuhin · 2017-05-26T15:47:30Z

Fixes #1 - see examples by @codinguncut there. Inline tags are commonly used as block tags, and current normalize-space() results in unwanted joining of words - this branch fixes it by always adding whitespace between tags.

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.

@codinguncut @kmike I would appreciate your review :) I have some vague memory that in some cases //text() is not what we want, but I can't recall in which and I didn't see anything bad in the tests I did.

@codinguncut

Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1

python 2 does not cache re.sub regexps, and it's faster even on python 3

codecov-io · 2017-05-26T15:50:20Z

Codecov Report

Merging #2 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master     #2   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           2      2           
  Lines          26     42   +16     
  Branches        1      6    +5     
=====================================
+ Hits           26     42   +16

Impacted Files	Coverage Δ
html_text/html_text.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7ebb57...1fb2ec4. Read the comment docs.

codinguncut · 2017-05-26T16:10:13Z

in my form github/fluquid/html-text I tried the following:
return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

But it doesnt yet seem to work propery with within-string multi-whitespace, or newlines/tabs for that matter.

your code looks fine, but understandable that re.sub wld add overhead.

Johannes

lopuhin · 2017-05-26T16:21:01Z

@codinguncut it seems that the main overhead comes not from re.sub, but from iterating over //text() results (although re.sub also has some overhead).

return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

That's an interesting approach, I'll check it out, thanks!

codinguncut · 2017-05-26T16:26:49Z

i'll try xpath('//*[normalize-space()]').

also, both solutions will add spaces before comma, period, etc. in the context of tags:
<a>mail</a>, and more.
cant be helped i think.

kmike · 2017-05-26T16:30:11Z

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

kmike · 2017-05-26T16:31:22Z

Ah, and it also removes all spaces before punctuation, no only caused by joining, so maybe it is not the way to go.

lopuhin · 2017-05-29T09:35:51Z

i'll try xpath('//*[normalize-space()]').

@codinguncut to be honest, I didn't understand this xpath - it just returns all elements with some text, right?

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

Nice suggestion, thanks @kmike ! I implemented this as an option in e833357 (edit: f020f4b), it works only on tag boundaries, so only spaces caused by joining would be affected. The overhead is not that huge, for 1k pages total time is 11.7 s vs 12.5 s (and 8.17 s vs 9.21 s when working on already parsed trees). Maybe it's ok to make it default?

This is similar to webstruct.utils.smart_joins (https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61), but is applied only on the tag boundaries. This mode is just a little bit slower than default.

It's fine to apply whitespace cleaning regexp at the end

kmike · 2017-05-29T11:19:29Z

Nice! +1 to enable punctuation handling by default.
There is also a simple micro-optimization trick: instead of writing

_trailing_whitespace = re.compile(r'\s$')
# ...
if _trailing_whitespace.search(...):

One can write this, to save an attribute lookup in a tight loop:

_has_trailing_whitespace = re.compile(r'\s$').search
# ...
if _has_trailing_whitespace(...):

@kmike

Thanks for the idea @kmike!

lopuhin · 2017-05-29T12:34:25Z

Thanks @codinguncut and @kmike , merged with punctuation handling enabled by default.

codinguncut · 2017-05-29T15:37:50Z

yes, i still don't 100% understand xpath syntax.
I was hoping to find an equivalent of //text() for normalize-space(), but maybe the two are simply different types of functions.

kmike · 2017-05-29T16:20:12Z

Maybe @redapple can share his experience. I think this issue is very much related to scrapy/parsel#34.

redapple · 2017-05-29T17:51:25Z

html_text/html_text.py

+
+        def fragments():
+            prev = None
+            for text in sel.xpath('//text()').extract():


I'd recommend using './/text()' so that it can be used for any selector, and not only those coming from extract_text(html)

That's a great idea, thanks @redapple - I'd like to also make it possible to pass selectors via the public interface.

redapple · 2017-05-29T18:34:51Z

@codinguncut , [normalize-space()](https://www.w3.org/TR/xpath/#function-normalize-space) only applies some trimming...:

whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

...on top of string(), which is itself a concatenation of descendant text nodes of the context node:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

And indeed, normalize-space() is a string function, whereas //text() means /descendant-or-self::node()/text(), so it selects text nodes, and does not produce a string. Two different operations.
lxml/parsel produces smart-strings out of text nodes, so they can be concatenated.

@lopuhin , @kmike , this may not be the right issue for my comment here as it may be more about doing html2text-ish transformation than plaintext, but what I found handy in the past was:

to have an option to keep newlines, which normalize-space() gobbles up
and add a newline after block elements for things like titles

You can find some (ugly, pre-knowing-of-normalize-space) code in parslepy.

redapple · 2017-05-29T22:11:18Z

@codecov-io @lopuhin , you may also be interested in this answer I wrote some time ago on whitespace and XPath's normalize-space() vs. Python's strip(): https://stackoverflow.com/a/33829869/

lopuhin · 2017-05-30T07:05:38Z

@redapple that's really interesting and useful, thanks @redapple ! I think we should also try to strip them - 85% of the sample html pages have at least one non-breaking space extracted.

lopuhin added 2 commits May 26, 2017 17:12

Add whitespace even for inline tags

6135ba6

Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1

Cache regexp

43f1bd4

python 2 does not cache re.sub regexps, and it's faster even on python 3

lopuhin force-pushed the inline-tags-spaces branch from d17ec6c to e833357 Compare May 29, 2017 09:34

lopuhin force-pushed the inline-tags-spaces branch from e833357 to f020f4b Compare May 29, 2017 09:37

Slightly faster and cleaner default path

73bf2ac

It's fine to apply whitespace cleaning regexp at the end

lopuhin added 2 commits May 29, 2017 15:17

Cache method lookup, more readable loop conditions

e9cf9b8

Thanks for the idea @kmike!

Make guess_punct_space=True by default, document

1fb2ec4

lopuhin force-pushed the inline-tags-spaces branch from d71a8ee to 1fb2ec4 Compare May 29, 2017 12:22

lopuhin merged commit cf48523 into master May 29, 2017

lopuhin deleted the inline-tags-spaces branch May 29, 2017 12:34

redapple reviewed May 29, 2017

View reviewed changes

kmike mentioned this pull request May 29, 2017

improve newline handling #5

Closed

This was referenced May 30, 2017

Handle non-breaking spaces and other special unicode characters #6

Open

Allow passing a selector and extract text only from given selector #7

Closed

kmike mentioned this pull request Aug 30, 2018

Add guess page layout #9

Merged

7 tasks

lopuhin mentioned this pull request Sep 6, 2019

Don't always insert spaces around inline tags? #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unwanted joins for inline tags #2

Fix unwanted joins for inline tags #2

lopuhin commented May 26, 2017

codecov-io commented May 26, 2017 •

edited

Loading

codinguncut commented May 26, 2017

lopuhin commented May 26, 2017

codinguncut commented May 26, 2017

kmike commented May 26, 2017

kmike commented May 26, 2017

lopuhin commented May 29, 2017 •

edited

Loading

kmike commented May 29, 2017

lopuhin commented May 29, 2017

codinguncut commented May 29, 2017

kmike commented May 29, 2017

redapple May 29, 2017

lopuhin May 29, 2017

redapple commented May 29, 2017

redapple commented May 29, 2017

lopuhin commented May 30, 2017

Fix unwanted joins for inline tags #2

Fix unwanted joins for inline tags #2

Conversation

lopuhin commented May 26, 2017

codecov-io commented May 26, 2017 • edited Loading

Codecov Report

codinguncut commented May 26, 2017

lopuhin commented May 26, 2017

codinguncut commented May 26, 2017

kmike commented May 26, 2017

kmike commented May 26, 2017

lopuhin commented May 29, 2017 • edited Loading

kmike commented May 29, 2017

lopuhin commented May 29, 2017

codinguncut commented May 29, 2017

kmike commented May 29, 2017

redapple May 29, 2017

Choose a reason for hiding this comment

lopuhin May 29, 2017

Choose a reason for hiding this comment

redapple commented May 29, 2017

redapple commented May 29, 2017

lopuhin commented May 30, 2017

codecov-io commented May 26, 2017 •

edited

Loading

lopuhin commented May 29, 2017 •

edited

Loading