Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guess page layout #9

Merged
merged 40 commits into from
Sep 25, 2018
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0ae6d24
add first working approach plus debug code
Kebniss Aug 24, 2018
566dc9b
add newline only at the end of selected tags
Kebniss Aug 24, 2018
587e9a7
fix multiple consecutive newlines
Kebniss Aug 27, 2018
6c9d27e
add guess_space = False option
Kebniss Aug 27, 2018
c22f3fa
move add space and newline checks to a function
Kebniss Aug 28, 2018
8a78fc5
add tests guess_page_layout
Kebniss Aug 28, 2018
a783e31
remove old test
Kebniss Aug 29, 2018
cb8dc1c
guess_punct_space = False behavior same as before this PR
Kebniss Aug 30, 2018
fb599bc
fix tests
Kebniss Aug 30, 2018
90e37b7
fixed tests
Kebniss Aug 30, 2018
ae26d29
fix indent and make add_space more readable
Kebniss Aug 30, 2018
bb33d4b
add double newline before and after title, p and h tags
Kebniss Aug 31, 2018
3069a73
by default tail of root node will not be extracted
Kebniss Sep 6, 2018
dd03201
add test
Kebniss Sep 6, 2018
0f2fb2b
fix indentation
Kebniss Sep 7, 2018
e8da507
newline tags as set and extendable, add new features comments, delete…
Kebniss Sep 7, 2018
0b9d139
make html_to_text private, fix its signature
Kebniss Sep 8, 2018
ba7cdc0
add new tags to handle
Kebniss Sep 8, 2018
952d895
handle more tags
Kebniss Sep 10, 2018
9dafbf0
remove cleaning of inline tags
Kebniss Sep 11, 2018
b3229d6
fix bug with multiple newlines
Kebniss Sep 11, 2018
695b458
remove newline
Kebniss Sep 11, 2018
03259b9
add test html without text
Kebniss Sep 11, 2018
cba531f
fix newline + space bug
Kebniss Sep 11, 2018
9811349
add bad punct test
Kebniss Sep 11, 2018
d47138c
add newline
Kebniss Sep 11, 2018
76f9028
add tests on real webpages
Kebniss Sep 11, 2018
05c7702
tests to hopefully make codecov happy
Kebniss Sep 11, 2018
4505e24
remove pathlib import
Kebniss Sep 11, 2018
a27e4c8
fix test
Kebniss Sep 11, 2018
b926c8c
remove space
Kebniss Sep 12, 2018
73f49ad
handle list of selectors
Kebniss Sep 19, 2018
15d22e0
a list of selectors returns a list of texts
Kebniss Sep 19, 2018
8f68b2c
selectors_to_text add to res only if something is extracted
Kebniss Sep 20, 2018
cf02b94
selectors_to_text merge results as in previous implementation
Kebniss Sep 20, 2018
7aec8d2
update readme
Kebniss Sep 20, 2018
7653bf9
update history
Kebniss Sep 20, 2018
4300fe6
update readme
Kebniss Sep 20, 2018
4772061
update readme and add newline personalization tests
Kebniss Sep 20, 2018
05b979a
change documentation
Kebniss Sep 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion html_text/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# -*- coding: utf-8 -*-

from .html_text import extract_text, parse_html, cleaned_selector, selector_to_text
from .html_text import extract_text, parse_html, html_to_text, cleaned_selector, selector_to_text
108 changes: 86 additions & 22 deletions html_text/html_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
import parsel


NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add table tags like tr and th as well? Check e.g http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - currently product information is on the same line.

It'd also be nice to add a few realistic tests: a few examples of HTML pages and their text output (in separate files, for readability). Text should be extracted with guess_page_layout=True. I think this would allow us to detect regressions / changes in the output better, and also find cases which are not handled properly.

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: which HTML tags are not in these two lists because they shouldn't be there, and which are not there because we're not handling them yet? Which of the tags from https://developer.mozilla.org/en-US/docs/Web/HTML/Element have you checked? Maybe it makes sense to handle more of them?

DOUBLE_NEWLINE_TAGS = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lists are used for lookups; even though they're short, I think it is cleaner and faster to have them as sets.

In [1]: x = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
In [2]: x_set = {'title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}
In [3]: %timeit 'foo' in x
162 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %timeit 'foo' in x_set
39.8 ns ± 0.153 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

^^ lookup in a set is ~4x faster even when the list is that short; not a lot, but why not do that :)

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd also be nice to allow overriding these double_newline_tags and newline_tags in extract_text; these constants would then just be defaults (use frozenset instead of set if you do so). A use case is the following: you want to extract text from a particular website, and know that e.g. div element should add a new line. You then write

extract_text(html, guess_page_layout=True, newline_tags=NEWLINE_TAGS | {'div'})


_clean_html = Cleaner(
scripts=True,
javascript=False, # onclick attributes are fine
Expand Down Expand Up @@ -44,30 +47,89 @@ def parse_html(html):
_whitespace = re.compile(r'\s+')
_has_trailing_whitespace = re.compile(r'\s$').search
_has_punct_after = re.compile(r'^[,:;.!?"\)]').search
_has_punct_before = re.compile(r'\($').search
_has_open_bracket_before = re.compile(r'\($').search


def html_to_text(tree, guess_punct_space=True, guess_page_layout=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this function private - rename it to _html_to_text, remove from __init__ exports. The API becomes rather confusing: there is html_text.extract_text and html_text.html_to_text, which do almost the same - extract_text supports html as a string in addition to lxml trees, and also does cleaning on its own; html_to_text only works on lxml trees, and is not doing cleaning.

If allowing to pass an already cleaned tree is important, we can add an argument to extract_text - though this can be done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reason for not wanting the html to be cleaned is tied to performance. Let's just make _html_to_text private and later I can check performance with and without cleaning to see if there is a difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kebniss right, though let's not worry about adding clean=True argument now. Main use cases are

  • user has an already cleaned tree, and wants to save some time by avoiding cleaning again;
  • user wants to change cleaning options.

"""
Convert a cleaned html tree to text.
See html_text.extract_text docstring for description of the approach
and options.
"""

def selector_to_text(sel, guess_punct_space=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a parsed tree for a selector using sel.root

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you want to create a selector in extract_text(html) and then apply traverse_text_fragments on sel.root, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if selector_to_text is supported, cleaned_selector is also nice to have

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have an tail_text argument in html_to_text? When it is False, tail extraction is skipped, but only on top level (i.e. text is still extracted from children tails). It can be False by default - I don't see why would anyone want to extract text from the tail of the root element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual).

In a common case (root <html> element) there shouldn't be any text in the tail. And when user passes another element explicitly, extracting text from element tail is likely undesirable - it is the same issues as with Selectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a root object, can't is just be a boolean flag process_tail in some internal function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there

def add_space(text, prev):
if prev is None:
return False
if prev == '\n' or prev == '\n\n':
return False
if not _has_trailing_whitespace(prev):
if _has_punct_after(text) or _has_open_bracket_before(prev):
return False
return True

def add_newline(tag, prev):
if prev is None or prev == '\n\n':
return ''
if tag in DOUBLE_NEWLINE_TAGS:
if prev == '\n':
return '\n'
return '\n\n'
if tag in NEWLINE_TAGS:
if prev == '\n':
return ''
return '\n'
return ''

def traverse_text_fragments(tree, prev, depth):
space = ' '
newline = ''
if tree.text:
text = _whitespace.sub(' ', tree.text.strip())
if text:
if guess_page_layout:
newline = add_newline(tree.tag, prev[0])
if newline:
prev[0] = newline
if guess_punct_space and not add_space(text, prev[0]):
space = ''
yield [newline, space, text]
prev[0] = tree.text
space = ' '
newline = ''

for child in tree:
for t in traverse_text_fragments(child, prev, depth+1):
yield t

if guess_page_layout:
newline = add_newline(tree.tag, prev[0])
if newline:
prev[0] = newline

tail = ''
if tree.tail and depth != 0:
tail = _whitespace.sub(' ', tree.tail.strip())
if tail:
if guess_punct_space and not add_space(tail, prev[0]):
space = ''
if tail:
yield [newline, space, tail]
prev[0] = tree.tail
elif newline:
yield [newline]

text = []
for fragment in traverse_text_fragments(tree, [None], 0):
text.extend(fragment)
return ''.join(text).strip()


def selector_to_text(sel, guess_punct_space=True, guess_page_layout=False):
""" Convert a cleaned selector to text.
See html_text.extract_text docstring for description of the approach and options.
"""
if guess_punct_space:

def fragments():
prev = None
for text in sel.xpath('.//text()').extract():
if prev is not None and (_has_trailing_whitespace(prev)
or (not _has_punct_after(text) and
not _has_punct_before(prev))):
yield ' '
yield text
prev = text

return _whitespace.sub(' ', ''.join(fragments()).strip())

else:
fragments = (x.strip() for x in sel.xpath('.//text()').extract())
return _whitespace.sub(' ', ' '.join(x for x in fragments if x))
return html_to_text(sel.root, guess_punct_space=guess_punct_space,
guess_page_layout=guess_page_layout)


def cleaned_selector(html):
Expand All @@ -85,7 +147,7 @@ def cleaned_selector(html):
return sel


def extract_text(html, guess_punct_space=True):
def extract_text(html, guess_punct_space=True, guess_page_layout=False, new=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new argument is unused and undocumented

"""
Convert html to text, cleaning invisible content such as styles.
Almost the same as normalize-space xpath, but this also
Expand All @@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True):

html should be a unicode string or an already parsed lxml.html element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess_page_layout argument should be documented.

"""
sel = cleaned_selector(html)
return selector_to_text(sel, guess_punct_space=guess_punct_space)
if html is None or len(html) == 0:
return ''
cleaned = _cleaned_html_tree(html)
return html_to_text(cleaned, guess_punct_space=guess_punct_space, guess_page_layout=guess_page_layout,)
44 changes: 41 additions & 3 deletions tests/test_html_text.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
# -*- coding: utf-8 -*-
import pytest
import lxml

from html_text import extract_text, parse_html, cleaned_selector, selector_to_text
from html_text import (extract_text, html_to_text, parse_html, parse_html,
cleaned_selector, selector_to_text)


@pytest.fixture(params=[{'guess_punct_space': True},
{'guess_punct_space': False}])
{'guess_punct_space': False},
{'guess_punct_space': True, 'guess_page_layout': True},
{'guess_punct_space': False, 'guess_page_layout': True}
])

def all_options(request):
return request.param

Expand Down Expand Up @@ -48,10 +54,42 @@ def test_punct_whitespace_preserved():
assert (extract_text(html, guess_punct_space=True) ==
u'по ле, and , more ! now a (boo)')


def test_selector(all_options):
html = '<div><div id="extract-me">text<div>more</div></div>and more text</div>'
sel = cleaned_selector(html)
assert selector_to_text(sel, **all_options) == 'text more and more text'
subsel = sel.xpath('//div[@id="extract-me"]')[0]
assert selector_to_text(subsel, **all_options) == 'text more'


def test_html_to_text():
html = (u'<title> title </title><div>text_1.<p>text_2 text_3</p><ul>'
'<li>text_4</li><li>text_5</li></ul><p>text_6<em>text_7</em>'
'text_8</p>text_9</div><p>...text_10</p>')

parser = lxml.html.HTMLParser(encoding='utf8')
tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)

assert (html_to_text(tree, guess_punct_space=False) ==
('title text_1. text_2 text_3 text_4 text_5'
' text_6 text_7 text_8 text_9 ...text_10'))
assert (html_to_text(tree, guess_punct_space=False, guess_page_layout=True) ==
('title\n\n text_1.\n\n text_2 text_3\n\n text_4\n text_5'
'\n\n text_6 text_7 text_8\n\n text_9\n\n ...text_10'))
assert (html_to_text(tree, guess_punct_space=True) ==
('title text_1. text_2 text_3 text_4 text_5'
' text_6 text_7 text_8 text_9...text_10'))
assert (html_to_text(tree, guess_punct_space=True, guess_page_layout=True) ==
('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
'\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10'))

def test_guess_page_layout():
html = (u'<title> title </title><div>text_1.<p>text_2 text_3</p>'
'<p id="demo"></p><ul><li>text_4</li><li>text_5</li></ul>'
'<p>text_6<em>text_7</em>text_8</p>text_9</div>'
'<script>document.getElementById("demo").innerHTML = '
'"This should be skipped";</script> <p>...text_10</p>'
)
assert (extract_text(html, guess_punct_space=True, guess_page_layout=True) ==
('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
'\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10'))