Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guess page layout #9

Merged
merged 40 commits into from
Sep 25, 2018
Merged

Conversation

Kebniss
Copy link
Contributor

@Kebniss Kebniss commented Aug 28, 2018

I tested the new extraction algorithm on 1200 htmls and timed extraction performance:

  • extract_text(html, guess_page_layout=True, guess_punct_space=True):
    • avg extraction time per html: 0.00769ms
    • fastest: 0.00006ms
    • slowest: 0.07818ms
  • html_text.selector_to_text(sel, guess_punct_space=True, guess_page_layout=True):
    • avg extraction time per html: 0.01368ms
    • fastest: 0.00012ms
    • slowest: 0.12133ms

Did the same on the old extraction algorithm:

  • extract_text(html, guess_punct_space=True):
    • avg extraction time per html: 0.00941ms
    • fastest: 0.00011ms
    • slowest: 0.15546ms
  • text = html_text.selector_to_text(sel, guess_punct_space=True):
    • avg extraction time per html: 0.01445ms
    • fastest:0.00016ms
    • slowest: 0.12790ms

html extraction speed has improved of ~20% while selector extraction improved of ~6%

TODO:

  • add multiple newlines
  • add backward compatibility with selectors
  • handle more tags if guess_page_layout=True
  • add tests on real webpages with guess_page_layout=True
  • handle nested tags without text when guess_page_layout=True
  • update readme
  • add performance evaluation

@Kebniss
Copy link
Contributor Author

Kebniss commented Aug 28, 2018

The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010

@kmike
Copy link
Contributor

kmike commented Aug 30, 2018

Hey @Kebniss! Discussion about punctuation handling is here: #2.

  1. It shouldn't depend on guess_punct_space = False option, because this is how HTML works: if there are several whitespaces or newlines inside an element, they're collapsed to one. There are some exceptions, (<pre> elements, or overridden CSS styles), but general rule is to collapse. See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space.

  2. Check

    def extract_text(html, guess_punct_space=True):
    for the motivation. This heuristic produces better text in practice. Special handling of punctuation is added to make it work more reliably.

  3. We're maintaining a whitespace because browsers maintain it.

So for me it looks like all failures are real.

@@ -47,45 +49,71 @@ def parse_html(html):
_has_punct_before = re.compile(r'\($').search


def selector_to_text(sel, guess_punct_space=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a parsed tree for a selector using sel.root

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you want to create a selector in extract_text(html) and then apply traverse_text_fragments on sel.root, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if selector_to_text is supported, cleaned_selector is also nice to have

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have an tail_text argument in html_to_text? When it is False, tail extraction is skipped, but only on top level (i.e. text is still extracted from children tails). It can be False by default - I don't see why would anyone want to extract text from the tail of the root element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual).

In a common case (root <html> element) there shouldn't be any text in the tail. And when user passes another element explicitly, extracting text from element tail is likely undesirable - it is the same issues as with Selectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a root object, can't is just be a boolean flag process_tail in some internal function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there

def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):
""" Convert a cleaned html tree to text.
See html_text.extract_text docstring for description of the approach
and options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting is off, text in docstring is indented too much (not like in pep8)

and not _has_punct_before(prev)
)
)
)
Copy link
Contributor

@kmike kmike Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be more readable as several ifs, e.g.

if prev is None:
    return False
if _has_trailing_whitespace(prev):
    return False
if _has_punct_after(text) or _has_punct_before(text):
    return False
return True

or when converted to an expression with or, not and:

return not(
    prev is None 
    or _has_trailing_whitespace(prev) 
    or _has_punct_after(text)
    or _has_punct_before(prev)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@codecov-io
Copy link

codecov-io commented Aug 30, 2018

Codecov Report

Merging #9 into master will decrease coverage by 1.94%.
The diff coverage is 98.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage     100%   98.05%   -1.95%     
==========================================
  Files           2        2              
  Lines          42      103      +61     
  Branches        6       28      +22     
==========================================
+ Hits           42      101      +59     
- Misses          0        2       +2
Impacted Files Coverage Δ
html_text/__init__.py 100% <100%> (ø) ⬆️
html_text/html_text.py 98.03% <98.71%> (-1.97%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4836865...05b979a. Read the comment docs.


def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this function private - rename it to _html_to_text, remove from __init__ exports. The API becomes rather confusing: there is html_text.extract_text and html_text.html_to_text, which do almost the same - extract_text supports html as a string in addition to lxml trees, and also does cleaning on its own; html_to_text only works on lxml trees, and is not doing cleaning.

If allowing to pass an already cleaned tree is important, we can add an argument to extract_text - though this can be done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reason for not wanting the html to be cleaned is tied to performance. Let's just make _html_to_text private and later I can check performance with and without cleaning to see if there is a difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kebniss right, though let's not worry about adding clean=True argument now. Main use cases are

  • user has an already cleaned tree, and wants to save some time by avoiding cleaning again;
  • user wants to change cleaning options.

@@ -85,7 +147,7 @@ def cleaned_selector(html):
return sel


def extract_text(html, guess_punct_space=True):
def extract_text(html, guess_punct_space=True, guess_page_layout=False, new=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new argument is unused and undocumented

@@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True):

html should be a unicode string or an already parsed lxml.html element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess_page_layout argument should be documented.

@@ -7,6 +7,9 @@
import parsel


NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']
DOUBLE_NEWLINE_TAGS = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lists are used for lookups; even though they're short, I think it is cleaner and faster to have them as sets.

In [1]: x = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
In [2]: x_set = {'title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}
In [3]: %timeit 'foo' in x
162 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %timeit 'foo' in x_set
39.8 ns ± 0.153 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

^^ lookup in a set is ~4x faster even when the list is that short; not a lot, but why not do that :)

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd also be nice to allow overriding these double_newline_tags and newline_tags in extract_text; these constants would then just be defaults (use frozenset instead of set if you do so). A use case is the following: you want to extract text from a particular website, and know that e.g. div element should add a new line. You then write

extract_text(html, guess_page_layout=True, newline_tags=NEWLINE_TAGS | {'div'})

@@ -7,6 +7,9 @@
import parsel


NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add table tags like tr and th as well? Check e.g http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - currently product information is on the same line.

It'd also be nice to add a few realistic tests: a few examples of HTML pages and their text output (in separate files, for readability). Text should be extracted with guess_page_layout=True. I think this would allow us to detect regressions / changes in the output better, and also find cases which are not handled properly.

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: which HTML tags are not in these two lists because they shouldn't be there, and which are not there because we're not handling them yet? Which of the tags from https://developer.mozilla.org/en-US/docs/Web/HTML/Element have you checked? Maybe it makes sense to handle more of them?

guess_page_layout=guess_page_layout)
if extracted:
text.append(extracted)
return ' '.join(text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to have it work as the previous implementation however I think it would make more sense to have it return a list of the text extracted by each selector. This way the user can decide whether and how to join it. Maybe they need all text as separate entities and that's why they want to select specific elements.

Copy link
Contributor

@kmike kmike Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not clear which behavior is better; let's keep the current one, but have this problem in mind.

@lopuhin
Copy link
Contributor

lopuhin commented Sep 20, 2018

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

@kmike
Copy link
Contributor

kmike commented Sep 20, 2018

I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html.

@Kebniss
Copy link
Contributor Author

Kebniss commented Sep 20, 2018

In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master

@Kebniss
Copy link
Contributor Author

Kebniss commented Sep 20, 2018

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

I can add this in the following pr

@lopuhin
Copy link
Contributor

lopuhin commented Sep 20, 2018

I can add this in the following pr

Thanks, that would be great if this does not blow up the scope :)

README.rst Outdated

>>> text = html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
u'Hello
world!'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples look off (actually, previous example was not good as well). I think it should look like this if you try it in a Python console:

>>> html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
'Hello\n\nworld!'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples below have the same issue: it shouldn't be text = html_text...., just html_text...., otherwise there is no output

README.rst Outdated
* DOUBLE_NEWLINE_TAGS = frozenset([
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
'p', 'pre', 'title', 'ul'
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to just say constants are html_text.NEWLINE_TAGS and html_text.NEWLINE_TAGS (and maybe expose them to a top level) - copy-pasting these lists here requires maintenance, it is easy to forget to update README when making a code change

@@ -96,7 +185,22 @@ def extract_text(html, guess_punct_space=True):
for punctuation. This has a slight (around 10%) performance overhead
and is just a heuristic.

When guess_page_layout is True (default is False), a newline is added
before and after NEWLINE_TAGS and two newlines are added before and after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more precise to say newline_tags instead of NEWLINE_TAGS - "a newline is added before and after newline_tags", and remove a note below ("NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized.") - users shouldn't be changing NEWLINE_TAGS, they should be passing newline_tags arguments, e.g.

html_text.extract_text(html, guess_page_layout=True, newline_tags=html_text.NEWLINE_TAGS | {'div'})

^^ maybe we should even provide this example somewhere, adding div may be a common thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add an example in the readme but div is already included in NEWLINE_TAGS. I will use a different tag for clarity :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you want me to add a test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's not a bad idea - let's to do both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants