Add an option to guess page layout (try to preserve some of the formatting) #11

kmike · 2018-09-21T17:54:33Z

This is a follow-up to #9, with minor tweaks.

… new argument

codecov-io · 2018-09-21T17:56:26Z

Codecov Report

Merging #11 into master will decrease coverage by 2.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
- Coverage     100%   97.82%   -2.18%     
==========================================
  Files           2        2              
  Lines          42       92      +50     
  Branches        6       17      +11     
==========================================
+ Hits           42       90      +48     
- Misses          0        2       +2

Impacted Files	Coverage Δ
html_text/html_text.py	`97.8% <100%> (-2.2%)`	⬇️
html_text/__init__.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4836865...732c87d. Read the comment docs.

* prev is always a string now, never a 1-element list * unify newline and text handling between text and tail * another workaround for mutable variable in the outer scope (Context class) * append to a list instead of using a generator

kmike · 2018-09-24T17:50:48Z

hey @Kebniss! The refactoring is ready; could you please run your benchmark on it, to check that it is not much slower?

…evity

guess_punct_space doesn't provide good output in this case => xfail

Kebniss · 2018-09-25T07:25:12Z

Performance now is better than before in 2/3 cases :)

extract_text(html, guess_layout=True, guess_punct_space=True):
- avg extraction time per html: 0.00693ms
- fastest: 8e-05ms
- slowest: 0.06098ms
- ~10% faster than previous PR
html_text.selector_to_text(sel, guess_punct_space=True, guess_layout=True) not including time to create the selector:
- avg extraction time per html: 0.00406ms
- fastest: 3e-05ms
- slowest: 0.06513ms
- ~10% slower than previous PR
html_text.selector_to_text(sel, guess_punct_space=True, guess_layout=True) including time to create the selector:
- avg extraction time per html: 0.01195
- fastest: 0.00013ms
- slowest: 0.10915ms
- ~13% faster than previous PR

lopuhin

I think this is a great feature to add, and I like the docs update.
I think it's fine to enable guess_layout by default (or at least not take backwards compatibility into account when deciding whether to leave it on or off).
To be honest I didn't quite get some details in the algorithm which handles newlines, I left a question, and will read into it again.

lopuhin · 2018-09-25T11:47:20Z

README.rst

-Two functions that do it are ``html_text.cleaned_selector`` and
-``html_text.selector_to_text``:
+NB Selectors are not cleaned automatically you need to call
+``html_text.cleaned_selector`` first.


Wow, I didn't realize this, 👍

lopuhin · 2018-09-25T11:47:35Z

CHANGES.rst

+------------------
+
+* ``guess_layout`` option to to make extracted text look more like how
+  it is rendered in browser.


Shall we enable it by default? There is no speed hit and the text looks nicer (almost always).

Yes, I also wonder about this. Let's enable it by default.

lopuhin · 2018-09-25T11:54:31Z

html_text/html_text.py

+
+    class Context:
+        """ workaround for missing `nonlocal` in Python 2 """
+        prev = '\n\n'


Can chunks[-1] be used instead? Is this added to avoid checking the case when chunks is empty? Edit: actually I see it's not always equal, please see next comment and sorry for any misunderstanding on my side.

Yeah, I actually tried it, and was unable to make it work, at least quick enough :)

lopuhin · 2018-09-25T11:56:47Z

html_text/html_text.py

+            return
+        space = get_space_between(text, context.prev)
+        chunks.extend([space, text])
+        context.prev = text_content


What if text_content happened to be \n\n or \n? Probably I'm completely missing how this works, but I though that either context.prev should be in 3 states: \n, \n\n or something else, or it must be equal to chunks[-1] (or even last chars of ''.join(chunks) to account for the case when two last chunks are \n).

If text_content is \n\n or \n, then text is empty, so function returns earlier. Can't say I understood this before you asked, that's a great question! Probably it makes sense to have the logic more explicit, though I'm not sure how.

Thanks for the explanation! Yes, I see that such message won't reach this code part, thanks!

Probably it makes sense to have the logic more explicit, though I'm not sure how.

I see two ways (not sure how much of an improvement they are):

have an integer prev_newlines variable, which can have values 0 (instead of prev = text_content), 1 (instead of \n), and 2 (instead of \n\n)

have a enum with 3 values

But putting a comment that explains that text_content here contains some text and can not be equal to newlines is also fine by me.

great suggestion @lopuhin, thanks! Did something along these lines here: 7a1b57b

lopuhin

Looks great, thanks @Kebniss and @kmike !

…ment text directly

not worths it to use six.string_types

kmike · 2018-09-25T15:10:20Z

Thanks @Kebniss for the implementation and benchmarks, and @lopuhin for the review!

Kebniss added 30 commits August 24, 2018 14:48

add first working approach plus debug code

0ae6d24

add newline only at the end of selected tags

566dc9b

fix multiple consecutive newlines

587e9a7

add guess_space = False option

6c9d27e

move add space and newline checks to a function

c22f3fa

add tests guess_page_layout

8a78fc5

remove old test

a783e31

guess_punct_space = False behavior same as before this PR

cb8dc1c

fix tests

fb599bc

fixed tests

90e37b7

fix indent and make add_space more readable

ae26d29

add double newline before and after title, p and h tags

bb33d4b

by default tail of root node will not be extracted

3069a73

add test

dd03201

fix indentation

0f2fb2b

newline tags as set and extendable, add new features comments, delete…

e8da507

… new argument

make html_to_text private, fix its signature

0b9d139

add new tags to handle

ba7cdc0

handle more tags

952d895

remove cleaning of inline tags

9dafbf0

fix bug with multiple newlines

b3229d6

remove newline

695b458

add test html without text

03259b9

fix newline + space bug

cba531f

add bad punct test

9811349

add newline

d47138c

add tests on real webpages

76f9028

tests to hopefully make codecov happy

05c7702

remove pathlib import

4505e24

fix test

a27e4c8

Kebniss and others added 8 commits September 19, 2018 17:11

update readme

7aec8d2

update history

7653bf9

update readme

4300fe6

update readme and add newline personalization tests

4772061

change documentation

05b979a

DOC cleanup README

ad95bff

DOC cleanup function docstring

59d2d54

revert formatting change

51947d4

kmike changed the title ~~Add guess page layout~~ Add an option to guess page layout (try to preserve some of the formatting) Sep 21, 2018

kmike added 3 commits September 21, 2018 23:39

minor cleanup

1370647

add pytest files to gitignore

ab3f776

refactor _html_to_text function for readability:

22a7fa1

* prev is always a string now, never a 1-element list * unify newline and text handling between text and tail * another workaround for mutable variable in the outer scope (Context class) * append to a list instead of using a generator

kmike added 2 commits September 24, 2018 23:37

bikeshedding: rename guess_page_layout option to guess_layout, for br…

e161e92

…evity

TST mark test as xfail, change desired output

8b466f8

guess_punct_space doesn't provide good output in this case => xfail

kmike requested a review from lopuhin September 25, 2018 08:50

lopuhin reviewed Sep 25, 2018

View reviewed changes

kmike added 4 commits September 25, 2018 18:38

cleanup: comments in unclear places

729e11a

cleanup: remove unnecessary escaping in regex

2973ee0

backwards incompatible: make guess_layout=True by default

607b04a

typo fix in comment

13394ba

lopuhin approved these changes Sep 25, 2018

View reviewed changes

kmike added 2 commits September 25, 2018 19:50

make it clear "\n" and "\n\n" are constants which can't come from ele…

7a1b57b

…ment text directly

remove PY3-only assert

732c87d

not worths it to use six.string_types

kmike merged commit d8666c4 into master Sep 25, 2018

kmike mentioned this pull request Sep 25, 2018

improve newline handling #5

Closed

kmike deleted the add-guess-page-layout branch September 25, 2018 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to guess page layout (try to preserve some of the formatting) #11

Add an option to guess page layout (try to preserve some of the formatting) #11

kmike commented Sep 21, 2018

codecov-io commented Sep 21, 2018 •

edited

Loading

kmike commented Sep 24, 2018

Kebniss commented Sep 25, 2018

lopuhin left a comment

lopuhin Sep 25, 2018

lopuhin Sep 25, 2018

kmike Sep 25, 2018

lopuhin Sep 25, 2018

kmike Sep 25, 2018

lopuhin Sep 25, 2018

kmike Sep 25, 2018

lopuhin Sep 25, 2018

kmike Sep 25, 2018

lopuhin left a comment

kmike commented Sep 25, 2018

Add an option to guess page layout (try to preserve some of the formatting) #11

Add an option to guess page layout (try to preserve some of the formatting) #11

Conversation

kmike commented Sep 21, 2018

codecov-io commented Sep 21, 2018 • edited Loading

Codecov Report

kmike commented Sep 24, 2018

Kebniss commented Sep 25, 2018

lopuhin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lopuhin left a comment

Choose a reason for hiding this comment

kmike commented Sep 25, 2018

codecov-io commented Sep 21, 2018 •

edited

Loading