Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guess page layout #9

Merged
merged 40 commits into from
Sep 25, 2018
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0ae6d24
add first working approach plus debug code
Kebniss Aug 24, 2018
566dc9b
add newline only at the end of selected tags
Kebniss Aug 24, 2018
587e9a7
fix multiple consecutive newlines
Kebniss Aug 27, 2018
6c9d27e
add guess_space = False option
Kebniss Aug 27, 2018
c22f3fa
move add space and newline checks to a function
Kebniss Aug 28, 2018
8a78fc5
add tests guess_page_layout
Kebniss Aug 28, 2018
a783e31
remove old test
Kebniss Aug 29, 2018
cb8dc1c
guess_punct_space = False behavior same as before this PR
Kebniss Aug 30, 2018
fb599bc
fix tests
Kebniss Aug 30, 2018
90e37b7
fixed tests
Kebniss Aug 30, 2018
ae26d29
fix indent and make add_space more readable
Kebniss Aug 30, 2018
bb33d4b
add double newline before and after title, p and h tags
Kebniss Aug 31, 2018
3069a73
by default tail of root node will not be extracted
Kebniss Sep 6, 2018
dd03201
add test
Kebniss Sep 6, 2018
0f2fb2b
fix indentation
Kebniss Sep 7, 2018
e8da507
newline tags as set and extendable, add new features comments, delete…
Kebniss Sep 7, 2018
0b9d139
make html_to_text private, fix its signature
Kebniss Sep 8, 2018
ba7cdc0
add new tags to handle
Kebniss Sep 8, 2018
952d895
handle more tags
Kebniss Sep 10, 2018
9dafbf0
remove cleaning of inline tags
Kebniss Sep 11, 2018
b3229d6
fix bug with multiple newlines
Kebniss Sep 11, 2018
695b458
remove newline
Kebniss Sep 11, 2018
03259b9
add test html without text
Kebniss Sep 11, 2018
cba531f
fix newline + space bug
Kebniss Sep 11, 2018
9811349
add bad punct test
Kebniss Sep 11, 2018
d47138c
add newline
Kebniss Sep 11, 2018
76f9028
add tests on real webpages
Kebniss Sep 11, 2018
05c7702
tests to hopefully make codecov happy
Kebniss Sep 11, 2018
4505e24
remove pathlib import
Kebniss Sep 11, 2018
a27e4c8
fix test
Kebniss Sep 11, 2018
b926c8c
remove space
Kebniss Sep 12, 2018
73f49ad
handle list of selectors
Kebniss Sep 19, 2018
15d22e0
a list of selectors returns a list of texts
Kebniss Sep 19, 2018
8f68b2c
selectors_to_text add to res only if something is extracted
Kebniss Sep 20, 2018
cf02b94
selectors_to_text merge results as in previous implementation
Kebniss Sep 20, 2018
7aec8d2
update readme
Kebniss Sep 20, 2018
7653bf9
update history
Kebniss Sep 20, 2018
4300fe6
update readme
Kebniss Sep 20, 2018
4772061
update readme and add newline personalization tests
Kebniss Sep 20, 2018
05b979a
change documentation
Kebniss Sep 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 41 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ or ``.get_text()`` from Beautiful Soup?
Text extracted with ``html_text`` does not contain inline styles,
javascript, comments and other text that is not normally visible to the users.
It normalizes whitespace, but is also smarter than ``.xpath('normalize-space())``,
adding spaces around inline elements too
(which are often used as block elements in html markup),
and tries to avoid adding extra spaces for punctuation.
adding spaces around inline elements (which are often used as block
elements in html markup), tries to avoid adding extra spaces for punctuation and
can add newlines so that the output text looks like how it is rendered in
browsers.

Apart from just getting text from the page (e.g. for display or search),
one intended usage of this library is for machine learning (feature extraction).
Expand Down Expand Up @@ -56,25 +57,58 @@ Usage
Extract text from HTML::

>>> import html_text
>>> text = html_text.extract_text(u'<h1>Hey</h1>')
u'Hey'
>>> text = html_text.extract_text(u'<h1>Hello</h1> world!')
u'Hello world!'

>>> text = html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
u'Hello
world!'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples look off (actually, previous example was not good as well). I think it should look like this if you try it in a Python console:

>>> html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
'Hello\n\nworld!'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples below have the same issue: it shouldn't be text = html_text...., just html_text...., otherwise there is no output


You can also pass already parsed ``lxml.html.HtmlElement``:

>>> import html_text
>>> tree = html_text.parse_html(u'<h1>Hey</h1>')
>>> tree = html_text.parse_html(u'<h1>Hello</h1> world!')
>>> text = html_text.extract_text(tree)
u'Hey'
u'Hello world!'

Or define a selector to extract text only from specific elements:

>>> import html_text
>>> sel = html_text.cleaned_selector(u'<h1>Hello</h1> world!')
>>> subsel = sel.xpath('//h1')
>>> text = html_text.selector_to_text(subsel)
u'Hello'

Passed html will be first cleaned from invisible non-text content such
as styles, and then text would be extracted.
NB Selectors are not cleaned automatically you need to call
``html_text.cleaned_selector`` first.

Two functions that do it are ``html_text.cleaned_selector`` and
``html_text.selector_to_text``:

* ``html_text.cleaned_selector`` accepts html as text or as ``lxml.html.HtmlElement``,
and returns cleaned ``parsel.Selector``.
* ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns extracted
text.
* ``html_text.extract_text`` accepts html and returns extracted text.

If ``guess_page_layout`` is True (False by default for backward compatibility),
a newline is added before and after NEWLINE_TAGS and two newlines are added
before and after DOUBLE_NEWLINE_TAGS. This heuristic makes the extracted text
more similar to how it is rendered in the browser.
NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized, here are the lists of
the tags that are handled by default:

* NEWLINE_TAGS = frozenset([
'article', 'aside', 'br', 'dd', 'details', 'div', 'dt', 'fieldset',
'figcaption', 'footer', 'form', 'header', 'hr', 'legend', 'li', 'main',
'nav', 'table', 'tr'
])
* DOUBLE_NEWLINE_TAGS = frozenset([
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
'p', 'pre', 'title', 'ul'
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to just say constants are html_text.NEWLINE_TAGS and html_text.NEWLINE_TAGS (and maybe expose them to a top level) - copy-pasting these lists here requires maintenance, it is easy to forget to update README when making a code change



Credits
Expand Down
3 changes: 2 additions & 1 deletion html_text/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# -*- coding: utf-8 -*-

from .html_text import extract_text, parse_html, cleaned_selector, selector_to_text
from .html_text import (extract_text, parse_html, cleaned_selector,
selector_to_text)
156 changes: 130 additions & 26 deletions html_text/html_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,15 @@
from lxml.html.clean import Cleaner
import parsel

NEWLINE_TAGS = frozenset([
'article', 'aside', 'br', 'dd', 'details', 'div', 'dt', 'fieldset',
'figcaption', 'footer', 'form', 'header', 'hr', 'legend', 'li', 'main',
'nav', 'table', 'tr'
])
DOUBLE_NEWLINE_TAGS = frozenset([
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
'p', 'pre', 'title', 'ul'
])

_clean_html = Cleaner(
scripts=True,
Expand Down Expand Up @@ -44,30 +53,108 @@ def parse_html(html):
_whitespace = re.compile(r'\s+')
_has_trailing_whitespace = re.compile(r'\s$').search
_has_punct_after = re.compile(r'^[,:;.!?"\)]').search
_has_punct_before = re.compile(r'\($').search
_has_open_bracket_before = re.compile(r'\($').search


def selector_to_text(sel, guess_punct_space=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a parsed tree for a selector using sel.root

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you want to create a selector in extract_text(html) and then apply traverse_text_fragments on sel.root, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if selector_to_text is supported, cleaned_selector is also nice to have

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have an tail_text argument in html_to_text? When it is False, tail extraction is skipped, but only on top level (i.e. text is still extracted from children tails). It can be False by default - I don't see why would anyone want to extract text from the tail of the root element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual).

In a common case (root <html> element) there shouldn't be any text in the tail. And when user passes another element explicitly, extracting text from element tail is likely undesirable - it is the same issues as with Selectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it

Copy link
Contributor

@kmike kmike Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a root object, can't is just be a boolean flag process_tail in some internal function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there

""" Convert a cleaned selector to text.
See html_text.extract_text docstring for description of the approach and options.
def _html_to_text(tree,
guess_punct_space=True,
guess_page_layout=False,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert a cleaned html tree to text.
See html_text.extract_text docstring for description of the approach
and options.
"""
if guess_punct_space:

def fragments():
prev = None
for text in sel.xpath('.//text()').extract():
if prev is not None and (_has_trailing_whitespace(prev)
or (not _has_punct_after(text) and
not _has_punct_before(prev))):
yield ' '
yield text
prev = text

return _whitespace.sub(' ', ''.join(fragments()).strip())

def add_space(text, prev):
if prev is None:
return False
if prev == '\n' or prev == '\n\n':
return False
if not _has_trailing_whitespace(prev):
if _has_punct_after(text) or _has_open_bracket_before(prev):
return False
return True

def add_newline(tag, prev):
if prev is None or prev == '\n\n':
return '', '\n\n'
if tag in double_newline_tags:
if prev == '\n':
return '\n', '\n\n'
return '\n\n', '\n\n'
if tag in newline_tags:
if prev == '\n':
return '', prev
return '\n', '\n'
return '', prev

def traverse_text_fragments(tree, prev, depth):
space = ' '
newline = ''
text = ''
if guess_page_layout:
newline, prev[0] = add_newline(tree.tag, prev[0])
if tree.text:
text = _whitespace.sub(' ', tree.text.strip())
if text and guess_punct_space and not add_space(text, prev[0]):
space = ''
if text:
yield [newline, space, text]
prev[0] = tree.text
space = ' '
newline = ''
elif newline:
yield [newline]
newline = ''

for child in tree:
for t in traverse_text_fragments(child, prev, depth + 1):
yield t

if guess_page_layout:
newline, prev[0] = add_newline(tree.tag, prev[0])

tail = ''
if tree.tail and depth != 0:
tail = _whitespace.sub(' ', tree.tail.strip())
if tail:
if guess_punct_space and not add_space(tail, prev[0]):
space = ''
if tail:
yield [newline, space, tail]
prev[0] = tree.tail
elif newline:
yield [newline]

text = []
for fragment in traverse_text_fragments(tree, [None], 0):
text.extend(fragment)
return ''.join(text).strip()


def selector_to_text(sel, guess_punct_space=True, guess_page_layout=False):
""" Convert a cleaned selector to text.
See html_text.extract_text docstring for description of the approach
and options.
"""
if isinstance(sel, list):
# if selecting a specific xpath
text = []
for t in sel:
extracted = _html_to_text(
t.root,
guess_punct_space=guess_punct_space,
guess_page_layout=guess_page_layout)
if extracted:
text.append(extracted)
return ' '.join(text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to have it work as the previous implementation however I think it would make more sense to have it return a list of the text extracted by each selector. This way the user can decide whether and how to join it. Maybe they need all text as separate entities and that's why they want to select specific elements.

Copy link
Contributor

@kmike kmike Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not clear which behavior is better; let's keep the current one, but have this problem in mind.

else:
fragments = (x.strip() for x in sel.xpath('.//text()').extract())
return _whitespace.sub(' ', ' '.join(x for x in fragments if x))
return _html_to_text(
sel.root,
guess_punct_space=guess_punct_space,
guess_page_layout=guess_page_layout)


def cleaned_selector(html):
Expand All @@ -76,16 +163,18 @@ def cleaned_selector(html):
try:
tree = _cleaned_html_tree(html)
sel = parsel.Selector(root=tree, type='html')
except (lxml.etree.XMLSyntaxError,
lxml.etree.ParseError,
lxml.etree.ParserError,
UnicodeEncodeError):
except (lxml.etree.XMLSyntaxError, lxml.etree.ParseError,
lxml.etree.ParserError, UnicodeEncodeError):
kmike marked this conversation as resolved.
Show resolved Hide resolved
# likely plain text
sel = parsel.Selector(html)
return sel


def extract_text(html, guess_punct_space=True):
def extract_text(html,
guess_punct_space=True,
guess_page_layout=False,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert html to text, cleaning invisible content such as styles.
Almost the same as normalize-space xpath, but this also
Expand All @@ -96,7 +185,22 @@ def extract_text(html, guess_punct_space=True):
for punctuation. This has a slight (around 10%) performance overhead
and is just a heuristic.

When guess_page_layout is True (default is False), a newline is added
before and after NEWLINE_TAGS and two newlines are added before and after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more precise to say newline_tags instead of NEWLINE_TAGS - "a newline is added before and after newline_tags", and remove a note below ("NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized.") - users shouldn't be changing NEWLINE_TAGS, they should be passing newline_tags arguments, e.g.

html_text.extract_text(html, guess_page_layout=True, newline_tags=html_text.NEWLINE_TAGS | {'div'})

^^ maybe we should even provide this example somewhere, adding div may be a common thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add an example in the readme but div is already included in NEWLINE_TAGS. I will use a different tag for clarity :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you want me to add a test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's not a bad idea - let's to do both.

DOUBLE_NEWLINE_TAGS. This heuristic makes the extracted text more similar
to how it is rendered in the browser.

NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized.

html should be a unicode string or an already parsed lxml.html element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess_page_layout argument should be documented.

"""
sel = cleaned_selector(html)
return selector_to_text(sel, guess_punct_space=guess_punct_space)
if html is None or len(html) == 0:
return ''
cleaned = _cleaned_html_tree(html)
return _html_to_text(
cleaned,
guess_punct_space=guess_punct_space,
guess_page_layout=guess_page_layout,
newline_tags=newline_tags,
double_newline_tags=double_newline_tags,
)
Loading