Add guess page layout #9

Kebniss · 2018-08-28T19:59:19Z

I tested the new extraction algorithm on 1200 htmls and timed extraction performance:

extract_text(html, guess_page_layout=True, guess_punct_space=True):
- avg extraction time per html: 0.00769ms
- fastest: 0.00006ms
- slowest: 0.07818ms
html_text.selector_to_text(sel, guess_punct_space=True, guess_page_layout=True):
- avg extraction time per html: 0.01368ms
- fastest: 0.00012ms
- slowest: 0.12133ms

Did the same on the old extraction algorithm:

extract_text(html, guess_punct_space=True):
- avg extraction time per html: 0.00941ms
- fastest: 0.00011ms
- slowest: 0.15546ms
text = html_text.selector_to_text(sel, guess_punct_space=True):
- avg extraction time per html: 0.01445ms
- fastest:0.00016ms
- slowest: 0.12790ms

html extraction speed has improved of ~20% while selector extraction improved of ~6%

TODO:

add multiple newlines
add backward compatibility with selectors
handle more tags if guess_page_layout=True
add tests on real webpages with guess_page_layout=True
handle nested tags without text when guess_page_layout=True
update readme
add performance evaluation

Kebniss · 2018-08-28T20:24:37Z

The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010

kmike · 2018-08-30T15:48:34Z

Hey @Kebniss! Discussion about punctuation handling is here: #2.

It shouldn't depend on guess_punct_space = False option, because this is how HTML works: if there are several whitespaces or newlines inside an element, they're collapsed to one. There are some exceptions, (<pre> elements, or overridden CSS styles), but general rule is to collapse. See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space.
Check

html-text/html_text/html_text.py

Line 88 in 4836865

def extract_text(html, guess_punct_space=True):

for the motivation. This heuristic produces better text in practice. Special handling of punctuation is added to make it work more reliably.
We're maintaining a whitespace because browsers maintain it.

So for me it looks like all failures are real.

kmike · 2018-08-30T15:57:13Z

html_text/html_text.py

@@ -47,45 +49,71 @@ def parse_html(html):
 _has_punct_before = re.compile(r'\($').search


-def selector_to_text(sel, guess_punct_space=True):


This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel.

You can get a parsed tree for a selector using sel.root

Ok so you want to create a selector in extract_text(html) and then apply traverse_text_fragments on sel.root, correct?

if selector_to_text is supported, cleaned_selector is also nice to have

@Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text.

Can't we have an tail_text argument in html_to_text? When it is False, tail extraction is skipped, but only on top level (i.e. text is still extracted from children tails). It can be False by default - I don't see why would anyone want to extract text from the tail of the root element.

Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual).

In a common case (root <html> element) there shouldn't be any text in the tail. And when user passes another element explicitly, extracting text from element tail is likely undesirable - it is the same issues as with Selectors.

you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it

Why do you need a root object, can't is just be a boolean flag process_tail in some internal function?

yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there

kmike · 2018-08-30T18:21:59Z

html_text/html_text.py

+def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):
+    """ Convert a cleaned html tree to text.
+        See html_text.extract_text docstring for description of the approach
+        and options.


formatting is off, text in docstring is indented too much (not like in pep8)

kmike · 2018-08-30T19:18:11Z

html_text/html_text.py

+                     and not _has_punct_before(prev)
+                          )
+                     )
+                )


I think it can be more readable as several ifs, e.g.

if prev is None: return False if _has_trailing_whitespace(prev): return False if _has_punct_after(text) or _has_punct_before(text): return False return True

or when converted to an expression with or, not and:

return not( prev is None or _has_trailing_whitespace(prev) or _has_punct_after(text) or _has_punct_before(prev) )

codecov-io · 2018-08-30T20:39:14Z

Codecov Report

Merging #9 into master will decrease coverage by 1.94%.
The diff coverage is 98.73%.

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage     100%   98.05%   -1.95%     
==========================================
  Files           2        2              
  Lines          42      103      +61     
  Branches        6       28      +22     
==========================================
+ Hits           42      101      +59     
- Misses          0        2       +2

Impacted Files	Coverage Δ
html_text/__init__.py	`100% <100%> (ø)`	⬆️
html_text/html_text.py	`98.03% <98.71%> (-1.97%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4836865...05b979a. Read the comment docs.

kmike · 2018-09-06T20:34:15Z

html_text/html_text.py


+def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):


I think we should keep this function private - rename it to _html_to_text, remove from __init__ exports. The API becomes rather confusing: there is html_text.extract_text and html_text.html_to_text, which do almost the same - extract_text supports html as a string in addition to lxml trees, and also does cleaning on its own; html_to_text only works on lxml trees, and is not doing cleaning.

If allowing to pass an already cleaned tree is important, we can add an argument to extract_text - though this can be done later.

I think the only reason for not wanting the html to be cleaned is tied to performance. Let's just make _html_to_text private and later I can check performance with and without cleaning to see if there is a difference

@Kebniss right, though let's not worry about adding clean=True argument now. Main use cases are

user has an already cleaned tree, and wants to save some time by avoiding cleaning again;

user wants to change cleaning options.

kmike · 2018-09-06T20:35:44Z

html_text/html_text.py

@@ -85,7 +147,7 @@ def cleaned_selector(html):
    return sel


-def extract_text(html, guess_punct_space=True):
+def extract_text(html, guess_punct_space=True, guess_page_layout=False, new=True):


new argument is unused and undocumented

kmike · 2018-09-06T20:37:04Z

html_text/html_text.py

@@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True):

    html should be a unicode string or an already parsed lxml.html element.


guess_page_layout argument should be documented.

kmike · 2018-09-06T20:41:03Z

html_text/html_text.py

@@ -7,6 +7,9 @@
 import parsel


+NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']
+DOUBLE_NEWLINE_TAGS = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']


These lists are used for lookups; even though they're short, I think it is cleaner and faster to have them as sets.

In [1]: x = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'] In [2]: x_set = {'title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'} In [3]: %timeit 'foo' in x 162 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [4]: %timeit 'foo' in x_set 39.8 ns ± 0.153 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

^^ lookup in a set is ~4x faster even when the list is that short; not a lot, but why not do that :)

It'd also be nice to allow overriding these double_newline_tags and newline_tags in extract_text; these constants would then just be defaults (use frozenset instead of set if you do so). A use case is the following: you want to extract text from a particular website, and know that e.g. div element should add a new line. You then write

extract_text(html, guess_page_layout=True, newline_tags=NEWLINE_TAGS | {'div'})

kmike · 2018-09-06T20:59:49Z

html_text/html_text.py

@@ -7,6 +7,9 @@
 import parsel


+NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']


Shouldn't we add table tags like tr and th as well? Check e.g http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - currently product information is on the same line.

It'd also be nice to add a few realistic tests: a few examples of HTML pages and their text output (in separate files, for readability). Text should be extracted with guess_page_layout=True. I think this would allow us to detect regressions / changes in the output better, and also find cases which are not handled properly.

To clarify: which HTML tags are not in these two lists because they shouldn't be there, and which are not there because we're not handling them yet? Which of the tags from https://developer.mozilla.org/en-US/docs/Web/HTML/Element have you checked? Maybe it makes sense to handle more of them?

… new argument

Kebniss · 2018-09-20T00:15:44Z

html_text/html_text.py

+                guess_page_layout=guess_page_layout)
+            if extracted:
+                text.append(extracted)
+        return ' '.join(text)


This is to have it work as the previous implementation however I think it would make more sense to have it return a list of the text extracted by each selector. This way the user can decide whether and how to join it. Maybe they need all text as separate entities and that's why they want to select specific elements.

I agree it is not clear which behavior is better; let's keep the current one, but have this problem in mind.

lopuhin · 2018-09-20T12:12:12Z

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

kmike · 2018-09-20T12:14:22Z

I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html.

Kebniss · 2018-09-20T15:11:13Z

In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master

Kebniss · 2018-09-20T15:46:11Z

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

I can add this in the following pr

lopuhin · 2018-09-20T15:51:08Z

I can add this in the following pr

Thanks, that would be great if this does not blow up the scope :)

kmike · 2018-09-20T18:42:44Z

README.rst

+
+    >>> text = html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
+    u'Hello
+    world!'


These examples look off (actually, previous example was not good as well). I think it should look like this if you try it in a Python console:

>>> html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True) 'Hello\n\nworld!'

examples below have the same issue: it shouldn't be text = html_text...., just html_text...., otherwise there is no output

kmike · 2018-09-20T18:50:32Z

README.rst

+* DOUBLE_NEWLINE_TAGS = frozenset([
+    'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
+    'p', 'pre', 'title', 'ul'
+])


I think it makes sense to just say constants are html_text.NEWLINE_TAGS and html_text.NEWLINE_TAGS (and maybe expose them to a top level) - copy-pasting these lists here requires maintenance, it is easy to forget to update README when making a code change

kmike · 2018-09-20T18:54:19Z

html_text/html_text.py

@@ -96,7 +185,22 @@ def extract_text(html, guess_punct_space=True):
    for punctuation. This has a slight (around 10%) performance overhead
    and is just a heuristic.

+    When guess_page_layout is True (default is False), a newline is added
+    before and after NEWLINE_TAGS and two newlines are added before and after


I think it is more precise to say newline_tags instead of NEWLINE_TAGS - "a newline is added before and after newline_tags", and remove a note below ("NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized.") - users shouldn't be changing NEWLINE_TAGS, they should be passing newline_tags arguments, e.g.

html_text.extract_text(html, guess_page_layout=True, newline_tags=html_text.NEWLINE_TAGS | {'div'})

^^ maybe we should even provide this example somewhere, adding div may be a common thing

I will add an example in the readme but div is already included in NEWLINE_TAGS. I will use a different tag for clarity :)

Or do you want me to add a test?

Hm, that's not a bad idea - let's to do both.

html_text/html_text.py

Kebniss added 6 commits August 24, 2018 14:48

add first working approach plus debug code

0ae6d24

add newline only at the end of selected tags

566dc9b

fix multiple consecutive newlines

587e9a7

add guess_space = False option

6c9d27e

move add space and newline checks to a function

c22f3fa

add tests guess_page_layout

8a78fc5

remove old test

a783e31

kmike reviewed Aug 30, 2018

View reviewed changes

Kebniss added 3 commits August 30, 2018 13:13

guess_punct_space = False behavior same as before this PR

cb8dc1c

fix tests

fb599bc

fixed tests

90e37b7

Kebniss added 4 commits August 30, 2018 16:46

fix indent and make add_space more readable

ae26d29

add double newline before and after title, p and h tags

bb33d4b

by default tail of root node will not be extracted

3069a73

add test

dd03201

kmike reviewed Sep 6, 2018

View reviewed changes

Kebniss added 5 commits September 7, 2018 12:45

fix indentation

0f2fb2b

newline tags as set and extendable, add new features comments, delete…

e8da507

… new argument

make html_to_text private, fix its signature

0b9d139

add new tags to handle

ba7cdc0

handle more tags

952d895

Kebniss added 13 commits September 10, 2018 18:26

fix newline + space bug

cba531f

add bad punct test

9811349

add newline

d47138c

add tests on real webpages

76f9028

tests to hopefully make codecov happy

05c7702

remove pathlib import

4505e24

fix test

a27e4c8

remove space

b926c8c

handle list of selectors

73f49ad

a list of selectors returns a list of texts

15d22e0

selectors_to_text add to res only if something is extracted

8f68b2c

selectors_to_text merge results as in previous implementation

cf02b94

update readme

7aec8d2

Kebniss commented Sep 20, 2018

View reviewed changes

Kebniss added 2 commits September 19, 2018 17:20

update history

7653bf9

update readme

4300fe6

kmike reviewed Sep 20, 2018

View reviewed changes

Kebniss added 2 commits September 20, 2018 12:11

update readme and add newline personalization tests

4772061

change documentation

05b979a

kmike reviewed Sep 21, 2018

View reviewed changes

html_text/html_text.py Show resolved Hide resolved

kmike reviewed Sep 21, 2018

View reviewed changes

html_text/html_text.py Show resolved Hide resolved

kmike mentioned this pull request Sep 21, 2018

Add an option to guess page layout (try to preserve some of the formatting) #11

Merged

kmike merged commit 05b979a into TeamHG-Memex:master Sep 25, 2018

		@@ -47,45 +49,71 @@ def parse_html(html):
		_has_punct_before = re.compile(r'\($').search


		def selector_to_text(sel, guess_punct_space=True):


		def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):

		@@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True):

		html should be a unicode string or an already parsed lxml.html element.

		@@ -7,6 +7,9 @@
		import parsel


		NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']

Add guess page layout #9

Add guess page layout #9

Conversation

Kebniss commented Aug 28, 2018 • edited Loading

Kebniss commented Aug 28, 2018

kmike commented Aug 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Sep 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Aug 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Aug 30, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Sep 6, 2018 • edited Loading

Choose a reason for hiding this comment

kmike Sep 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Sep 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Sep 21, 2018 • edited Loading

Choose a reason for hiding this comment

lopuhin commented Sep 20, 2018

kmike commented Sep 20, 2018

Kebniss commented Sep 20, 2018 • edited Loading

Kebniss commented Sep 20, 2018

lopuhin commented Sep 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kebniss commented Aug 28, 2018 •

edited

Loading

kmike Sep 6, 2018 •

edited

Loading

kmike Aug 30, 2018 •

edited

Loading

codecov-io commented Aug 30, 2018 •

edited

Loading

kmike Sep 6, 2018 •

edited

Loading

kmike Sep 6, 2018 •

edited

Loading

kmike Sep 6, 2018 •

edited

Loading

kmike Sep 21, 2018 •

edited

Loading

Kebniss commented Sep 20, 2018 •

edited

Loading