Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to guess page layout (try to preserve some of the formatting) #11

Merged
merged 54 commits into from
Sep 25, 2018

Conversation

kmike
Copy link
Contributor

@kmike kmike commented Sep 21, 2018

This is a follow-up to #9, with minor tweaks.

@kmike kmike changed the title Add guess page layout Add an option to guess page layout (try to preserve some of the formatting) Sep 21, 2018
@codecov-io
Copy link

codecov-io commented Sep 21, 2018

Codecov Report

Merging #11 into master will decrease coverage by 2.17%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
- Coverage     100%   97.82%   -2.18%     
==========================================
  Files           2        2              
  Lines          42       92      +50     
  Branches        6       17      +11     
==========================================
+ Hits           42       90      +48     
- Misses          0        2       +2
Impacted Files Coverage Δ
html_text/html_text.py 97.8% <100%> (-2.2%) ⬇️
html_text/__init__.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4836865...732c87d. Read the comment docs.

* prev is always a string now, never a 1-element list
* unify newline and text handling between text and tail
* another workaround for mutable variable in the outer scope (Context class)
* append to a list instead of using a generator
@kmike
Copy link
Contributor Author

kmike commented Sep 24, 2018

hey @Kebniss! The refactoring is ready; could you please run your benchmark on it, to check that it is not much slower?

@Kebniss
Copy link
Contributor

Kebniss commented Sep 25, 2018

Performance now is better than before in 2/3 cases :)

  • extract_text(html, guess_layout=True, guess_punct_space=True):
    • avg extraction time per html: 0.00693ms
    • fastest: 8e-05ms
    • slowest: 0.06098ms
    • ~10% faster than previous PR
  • html_text.selector_to_text(sel, guess_punct_space=True, guess_layout=True) not including time to create the selector:
    • avg extraction time per html: 0.00406ms
    • fastest: 3e-05ms
    • slowest: 0.06513ms
    • ~10% slower than previous PR
  • html_text.selector_to_text(sel, guess_punct_space=True, guess_layout=True) including time to create the selector:
    • avg extraction time per html: 0.01195
    • fastest: 0.00013ms
    • slowest: 0.10915ms
    • ~13% faster than previous PR

@kmike kmike requested a review from lopuhin September 25, 2018 08:50
Copy link
Contributor

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great feature to add, and I like the docs update.
I think it's fine to enable guess_layout by default (or at least not take backwards compatibility into account when deciding whether to leave it on or off).
To be honest I didn't quite get some details in the algorithm which handles newlines, I left a question, and will read into it again.

Two functions that do it are ``html_text.cleaned_selector`` and
``html_text.selector_to_text``:
NB Selectors are not cleaned automatically you need to call
``html_text.cleaned_selector`` first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I didn't realize this, 👍

------------------

* ``guess_layout`` option to to make extracted text look more like how
it is rendered in browser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we enable it by default? There is no speed hit and the text looks nicer (almost always).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also wonder about this. Let's enable it by default.


class Context:
""" workaround for missing `nonlocal` in Python 2 """
prev = '\n\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can chunks[-1] be used instead? Is this added to avoid checking the case when chunks is empty? Edit: actually I see it's not always equal, please see next comment and sorry for any misunderstanding on my side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually tried it, and was unable to make it work, at least quick enough :)

return
space = get_space_between(text, context.prev)
chunks.extend([space, text])
context.prev = text_content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if text_content happened to be \n\n or \n? Probably I'm completely missing how this works, but I though that either context.prev should be in 3 states: \n, \n\n or something else, or it must be equal to chunks[-1] (or even last chars of ''.join(chunks) to account for the case when two last chunks are \n).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If text_content is \n\n or \n, then text is empty, so function returns earlier. Can't say I understood this before you asked, that's a great question! Probably it makes sense to have the logic more explicit, though I'm not sure how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Yes, I see that such message won't reach this code part, thanks!

Probably it makes sense to have the logic more explicit, though I'm not sure how.

I see two ways (not sure how much of an improvement they are):

  • have an integer prev_newlines variable, which can have values 0 (instead of prev = text_content), 1 (instead of \n), and 2 (instead of \n\n)
  • have a enum with 3 values

But putting a comment that explains that text_content here contains some text and can not be equal to newlines is also fine by me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion @lopuhin, thanks! Did something along these lines here: 7a1b57b

Copy link
Contributor

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @Kebniss and @kmike !

@kmike kmike merged commit d8666c4 into master Sep 25, 2018
@kmike
Copy link
Contributor Author

kmike commented Sep 25, 2018

Thanks @Kebniss for the implementation and benchmarks, and @lopuhin for the review!

@kmike kmike mentioned this pull request Sep 25, 2018
@kmike kmike deleted the add-guess-page-layout branch September 25, 2018 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants