-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guess page layout #9
Changes from 14 commits
0ae6d24
566dc9b
587e9a7
6c9d27e
c22f3fa
8a78fc5
a783e31
cb8dc1c
fb599bc
90e37b7
ae26d29
bb33d4b
3069a73
dd03201
0f2fb2b
e8da507
0b9d139
ba7cdc0
952d895
9dafbf0
b3229d6
695b458
03259b9
cba531f
9811349
d47138c
76f9028
05c7702
4505e24
a27e4c8
b926c8c
73f49ad
15d22e0
8f68b2c
cf02b94
7aec8d2
7653bf9
4300fe6
4772061
05b979a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
from .html_text import extract_text, parse_html, cleaned_selector, selector_to_text | ||
from .html_text import extract_text, parse_html, html_to_text, cleaned_selector, selector_to_text |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,9 @@ | |
import parsel | ||
|
||
|
||
NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol'] | ||
DOUBLE_NEWLINE_TAGS = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These lists are used for lookups; even though they're short, I think it is cleaner and faster to have them as sets.
^^ lookup in a set is ~4x faster even when the list is that short; not a lot, but why not do that :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd also be nice to allow overriding these double_newline_tags and newline_tags in extract_text; these constants would then just be defaults (use frozenset instead of set if you do so). A use case is the following: you want to extract text from a particular website, and know that e.g. extract_text(html, guess_page_layout=True, newline_tags=NEWLINE_TAGS | {'div'}) |
||
|
||
_clean_html = Cleaner( | ||
scripts=True, | ||
javascript=False, # onclick attributes are fine | ||
|
@@ -44,30 +47,89 @@ def parse_html(html): | |
_whitespace = re.compile(r'\s+') | ||
_has_trailing_whitespace = re.compile(r'\s$').search | ||
_has_punct_after = re.compile(r'^[,:;.!?"\)]').search | ||
_has_punct_before = re.compile(r'\($').search | ||
_has_open_bracket_before = re.compile(r'\($').search | ||
|
||
|
||
def html_to_text(tree, guess_punct_space=True, guess_page_layout=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should keep this function private - rename it to If allowing to pass an already cleaned tree is important, we can add an argument to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the only reason for not wanting the html to be cleaned is tied to performance. Let's just make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Kebniss right, though let's not worry about adding
|
||
""" | ||
Convert a cleaned html tree to text. | ||
See html_text.extract_text docstring for description of the approach | ||
and options. | ||
""" | ||
|
||
def selector_to_text(sel, guess_punct_space=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can get a parsed tree for a selector using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok so you want to create a selector in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't we have an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual). In a common case (root There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you need a root object, can't is just be a boolean flag There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there |
||
def add_space(text, prev): | ||
if prev is None: | ||
return False | ||
if prev == '\n' or prev == '\n\n': | ||
return False | ||
if not _has_trailing_whitespace(prev): | ||
if _has_punct_after(text) or _has_open_bracket_before(prev): | ||
return False | ||
return True | ||
|
||
def add_newline(tag, prev): | ||
if prev is None or prev == '\n\n': | ||
return '' | ||
if tag in DOUBLE_NEWLINE_TAGS: | ||
if prev == '\n': | ||
return '\n' | ||
return '\n\n' | ||
if tag in NEWLINE_TAGS: | ||
if prev == '\n': | ||
return '' | ||
return '\n' | ||
return '' | ||
|
||
def traverse_text_fragments(tree, prev, depth): | ||
space = ' ' | ||
newline = '' | ||
if tree.text: | ||
text = _whitespace.sub(' ', tree.text.strip()) | ||
if text: | ||
if guess_page_layout: | ||
newline = add_newline(tree.tag, prev[0]) | ||
if newline: | ||
prev[0] = newline | ||
if guess_punct_space and not add_space(text, prev[0]): | ||
space = '' | ||
yield [newline, space, text] | ||
prev[0] = tree.text | ||
space = ' ' | ||
newline = '' | ||
|
||
for child in tree: | ||
for t in traverse_text_fragments(child, prev, depth+1): | ||
yield t | ||
|
||
if guess_page_layout: | ||
newline = add_newline(tree.tag, prev[0]) | ||
if newline: | ||
prev[0] = newline | ||
|
||
tail = '' | ||
if tree.tail and depth != 0: | ||
tail = _whitespace.sub(' ', tree.tail.strip()) | ||
if tail: | ||
if guess_punct_space and not add_space(tail, prev[0]): | ||
space = '' | ||
if tail: | ||
yield [newline, space, tail] | ||
prev[0] = tree.tail | ||
elif newline: | ||
yield [newline] | ||
|
||
text = [] | ||
for fragment in traverse_text_fragments(tree, [None], 0): | ||
text.extend(fragment) | ||
return ''.join(text).strip() | ||
|
||
|
||
def selector_to_text(sel, guess_punct_space=True, guess_page_layout=False): | ||
""" Convert a cleaned selector to text. | ||
See html_text.extract_text docstring for description of the approach and options. | ||
""" | ||
if guess_punct_space: | ||
|
||
def fragments(): | ||
prev = None | ||
for text in sel.xpath('.//text()').extract(): | ||
if prev is not None and (_has_trailing_whitespace(prev) | ||
or (not _has_punct_after(text) and | ||
not _has_punct_before(prev))): | ||
yield ' ' | ||
yield text | ||
prev = text | ||
|
||
return _whitespace.sub(' ', ''.join(fragments()).strip()) | ||
|
||
else: | ||
fragments = (x.strip() for x in sel.xpath('.//text()').extract()) | ||
return _whitespace.sub(' ', ' '.join(x for x in fragments if x)) | ||
return html_to_text(sel.root, guess_punct_space=guess_punct_space, | ||
guess_page_layout=guess_page_layout) | ||
|
||
|
||
def cleaned_selector(html): | ||
|
@@ -85,7 +147,7 @@ def cleaned_selector(html): | |
return sel | ||
|
||
|
||
def extract_text(html, guess_punct_space=True): | ||
def extract_text(html, guess_punct_space=True, guess_page_layout=False, new=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
""" | ||
Convert html to text, cleaning invisible content such as styles. | ||
Almost the same as normalize-space xpath, but this also | ||
|
@@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True): | |
|
||
html should be a unicode string or an already parsed lxml.html element. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. guess_page_layout argument should be documented. |
||
""" | ||
sel = cleaned_selector(html) | ||
return selector_to_text(sel, guess_punct_space=guess_punct_space) | ||
if html is None or len(html) == 0: | ||
return '' | ||
cleaned = _cleaned_html_tree(html) | ||
return html_to_text(cleaned, guess_punct_space=guess_punct_space, guess_page_layout=guess_page_layout,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we add table tags like tr and th as well? Check e.g http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - currently product information is on the same line.
It'd also be nice to add a few realistic tests: a few examples of HTML pages and their text output (in separate files, for readability). Text should be extracted with guess_page_layout=True. I think this would allow us to detect regressions / changes in the output better, and also find cases which are not handled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: which HTML tags are not in these two lists because they shouldn't be there, and which are not there because we're not handling them yet? Which of the tags from https://developer.mozilla.org/en-US/docs/Web/HTML/Element have you checked? Maybe it makes sense to handle more of them?