Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added text_content() method to selectors. #34

Closed
wants to merge 1 commit into from
Closed

Added text_content() method to selectors. #34

wants to merge 1 commit into from

Conversation

paulo-raca
Copy link

I recently needed to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.

While that wasn't hard, it is a common operations that should be built into scrapy

It's useful to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.
@codecov-io
Copy link

Current coverage is 100.00%

Merging #34 into master will not affect coverage as of 26b4f8f

Powered by Codecov. Updated on successful CI builds.

@redapple
Copy link
Contributor

redapple commented Mar 3, 2016

hi @paulo-raca ,
Althouth I was trying to get this kind of functionality in scrapy selectors before, now I've gotten used to chaining XPaths in these cases:

sel.xpath('//ul/li').xpath('string()')

I would prefer that people get to know XPath string functions better.
Also, normalize-space() is great, but you sometimes you need string() to keep newlines for example. I believe a normalize=True/False arg could be useful

@Digenis
Copy link
Member

Digenis commented Mar 3, 2016

I would prefer that people get to know XPath string functions better.

I agree 100%

Also, I'd prefer it not have too many .extract()-alike methods.
There's a long running set of discussions and PRs
for a trivial extract_first shortcut which everyone can live without.
There's the .re() method whose results can now be achieved with .xpath('re:replace()')

I don't think parsel is to reinvent xpath through a collection of shortcut methods.
Let's for a second rename .text_content() to .normalize_whitespace()
and redapple's suggestion a .string() method.
This betters demonstrates the problem
that anyone can come up with a series of methods
that replicate every elementary xpath feature.

@kmike
Copy link
Member

kmike commented Mar 3, 2016

See also discussion at scrapy/scrapy#772.

I like an idea of providing a customizable html2text-like features in parsel. This task is very common. This is not only about xpaths: users may want to extract text knowing only css selectors (xpath is not required to use parsel), and not all transformations are expressed in xpaths easily. I think parsel can't depend on https://github.com/Alir3z4/html2text/ for that though because html2text is GPL.

IMHO .text() method should be a 'best effort' method to convert HTML to plain text, with some options (e.g. how to join parts? preserve newlines or not? normalize white spaces or not? handle paragraphs or not?). If we want to do that it is better to implement 'smarter' features from the beginning because if we just go with normalize-space() people may become accustumated with this straightforward behaviour; any change to .text() to make it smarter would be backwards incompatible, and it would be breaking user code.

There is no default implementation which will satisfy everyone, so it is fine to be opinionated. I'd love to have something quick which works better than .string() by default and provides some tweaks for common use cases.

@Digenis
Copy link
Member

Digenis commented Mar 3, 2016

lxml has an html cleaner.
Followed by getting the text nodes can be used for elementary textualization.
It can't be used in the current state of parsel
because parsel extracts attributes and text nodes unescaping html entities
and the cleaner needs to know what is not html by having plaintext entity-escaped.

@paulo-raca
Copy link
Author

IMHO, new method would be handy, particularly with some extra params (normalization, etc)

But I'm OK with telling people to use XPath instead of bloating the API.

If that is way you want to go, I think XPath's string() and normalize-space() should be mentioned on the tutorial and docs.
(While I eventually figured that out, it took me longer than it should -- Particularly because I started using CSS selectors)

@kmike
Copy link
Member

kmike commented Jun 17, 2016

Note for the future: this implementation returns text inside <style> and <script> tags; usually this is not desired. I think a general-purpose 'opinionated' implementation should exclude their content.

@Digenis
Copy link
Member

Digenis commented Jun 17, 2016

lxml provides a well performing built-in for html sanitization
which I use in selector subclasses.

from lxml.html.clean import Cleaner
clean_html = Cleaner(style=True).clean_html

(style=False seems to be the only non-sane default)

Html sanitization is a different topic.
This PR is about rendering but it's relevant to sanitation
because such an approach should follow a sanitization step.

@kmike
Copy link
Member

kmike commented Jun 17, 2016

@Digenis right; Cleaner is not straightforward to use with Parsel though because of #40

@Digenis
Copy link
Member

Digenis commented Jun 17, 2016

Yes, it requires parsing again.

When using small fragments (the parts to be extracted from a page)
the performance impact of parsing it again is negligible.

It's the endless gotchas when using html fragments that make it non-straightforward.

The resulting tree is unpredictable and whitespace is so inconsistent
that just upgrading lxml can break tests.

The hacks it takes to implement this are only a few
but are unintuitive and require lot of documenting examples and testing.
(Would parsel developers be interested in maintaining such a thing?)

My point in my earlier comments is that since getting the text content without sanitization
is something already solved by xpath, there's no need for a shortcut method.
Better reserve a name for something that gets the text content from sanitized html.

@NoxWings
Copy link

NoxWings commented Jan 12, 2017

Instead of creating new methods why not letting us specify which etree.tostring "method" keyword argument?
Using "text" method instead of "html" provides exactly the same output as text_content()
This way the API wouldn't change at all.

Something like the following:

    def extract(self, tostring_method=None):
        """
        Serialize and return the matched nodes in a single unicode string.
        Percent encoded content is unquoted.
        """
        try:
            return etree.tostring(self.root,
                                  method=tostring_method or self._tostring_method,
                                  encoding='unicode',
                                  with_tail=False)
        except (AttributeError, TypeError):
            if self.root is True:
                return u'1'
            elif self.root is False:
                return u'0'
            else:
                return six.text_type(self.root)

@Gallaecio
Copy link
Member

@paulo-raca Are you OK with closing this pull request and continuing the discussion in #127?

@paulo-raca
Copy link
Author

Sure 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants