Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unwanted joins for inline tags #2

Merged
merged 6 commits into from
May 29, 2017
Merged

Fix unwanted joins for inline tags #2

merged 6 commits into from
May 29, 2017

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented May 26, 2017

Fixes #1 - see examples by @codinguncut there. Inline tags are commonly used as block tags, and current normalize-space() results in unwanted joining of words - this branch fixes it by always adding whitespace between tags.

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.

@codinguncut @kmike I would appreciate your review :) I have some vague memory that in some cases //text() is not what we want, but I can't recall in which and I didn't see anything bad in the tests I did.

Thanks @codinguncut for suggestion. Still needs testing.
re.sub is replicating xpath's normalize-space behaviour.
See GH-1
python 2 does not cache re.sub regexps,
and it's faster even on python 3
@codecov-io
Copy link

codecov-io commented May 26, 2017

Codecov Report

Merging #2 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master     #2   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           2      2           
  Lines          26     42   +16     
  Branches        1      6    +5     
=====================================
+ Hits           26     42   +16
Impacted Files Coverage Δ
html_text/html_text.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7ebb57...1fb2ec4. Read the comment docs.

@codinguncut
Copy link

in my form github/fluquid/html-text I tried the following:
return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

But it doesnt yet seem to work propery with within-string multi-whitespace, or newlines/tabs for that matter.

your code looks fine, but understandable that re.sub wld add overhead.

Johannes

@lopuhin
Copy link
Contributor Author

lopuhin commented May 26, 2017

@codinguncut it seems that the main overhead comes not from re.sub, but from iterating over //text() results (although re.sub also has some overhead).

return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

That's an interesting approach, I'll check it out, thanks!

@codinguncut
Copy link

i'll try xpath('//*[normalize-space()]').

also, both solutions will add spaces before comma, period, etc. in the context of tags:
<a>mail</a>, and more.
cant be helped i think.

@kmike
Copy link
Contributor

kmike commented May 26, 2017

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

@kmike
Copy link
Contributor

kmike commented May 26, 2017

Ah, and it also removes all spaces before punctuation, no only caused by joining, so maybe it is not the way to go.

@lopuhin
Copy link
Contributor Author

lopuhin commented May 29, 2017

i'll try xpath('//*[normalize-space()]').

@codinguncut to be honest, I didn't understand this xpath - it just returns all elements with some text, right?

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

Nice suggestion, thanks @kmike ! I implemented this as an option in e833357 (edit: f020f4b), it works only on tag boundaries, so only spaces caused by joining would be affected. The overhead is not that huge, for 1k pages total time is 11.7 s vs 12.5 s (and 8.17 s vs 9.21 s when working on already parsed trees). Maybe it's ok to make it default?

This is similar to webstruct.utils.smart_joins
(https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61),
but is applied only on the tag boundaries.
This mode is just a little bit slower than default.
It's fine to apply whitespace cleaning regexp at the end
@kmike
Copy link
Contributor

kmike commented May 29, 2017

Nice! +1 to enable punctuation handling by default.
There is also a simple micro-optimization trick: instead of writing

_trailing_whitespace = re.compile(r'\s$')
# ...
if _trailing_whitespace.search(...):

One can write this, to save an attribute lookup in a tight loop:

_has_trailing_whitespace = re.compile(r'\s$').search
# ...
if _has_trailing_whitespace(...):

@lopuhin lopuhin merged commit cf48523 into master May 29, 2017
@lopuhin lopuhin deleted the inline-tags-spaces branch May 29, 2017 12:34
@lopuhin
Copy link
Contributor Author

lopuhin commented May 29, 2017

Thanks @codinguncut and @kmike , merged with punctuation handling enabled by default.

@codinguncut
Copy link

yes, i still don't 100% understand xpath syntax.
I was hoping to find an equivalent of //text() for normalize-space(), but maybe the two are simply different types of functions.

@kmike
Copy link
Contributor

kmike commented May 29, 2017

Maybe @redapple can share his experience. I think this issue is very much related to scrapy/parsel#34.


def fragments():
prev = None
for text in sel.xpath('//text()').extract():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using './/text()' so that it can be used for any selector, and not only those coming from extract_text(html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea, thanks @redapple - I'd like to also make it possible to pass selectors via the public interface.

@redapple
Copy link

@codinguncut , [normalize-space()](https://www.w3.org/TR/xpath/#function-normalize-space) only applies some trimming...:

whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

...on top of string(), which is itself a concatenation of descendant text nodes of the context node:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

And indeed, normalize-space() is a string function, whereas //text() means /descendant-or-self::node()/text(), so it selects text nodes, and does not produce a string. Two different operations.
lxml/parsel produces smart-strings out of text nodes, so they can be concatenated.

@lopuhin , @kmike , this may not be the right issue for my comment here as it may be more about doing html2text-ish transformation than plaintext, but what I found handy in the past was:

  • to have an option to keep newlines, which normalize-space() gobbles up
  • and add a newline after block elements for things like titles

You can find some (ugly, pre-knowing-of-normalize-space) code in parslepy.

@kmike kmike mentioned this pull request May 29, 2017
@redapple
Copy link

@codecov-io @lopuhin , you may also be interested in this answer I wrote some time ago on whitespace and XPath's normalize-space() vs. Python's strip(): https://stackoverflow.com/a/33829869/

@lopuhin
Copy link
Contributor Author

lopuhin commented May 30, 2017

@redapple that's really interesting and useful, thanks @redapple ! I think we should also try to strip them - 85% of the sample html pages have at least one non-breaking space extracted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

whitespace issues
5 participants