Added text_content() method to selectors. #34

paulo-raca · 2016-03-02T23:25:16Z

I recently needed to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.

While that wasn't hard, it is a common operations that should be built into scrapy

It's useful to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.

codecov-io · 2016-03-02T23:25:16Z

Current coverage is `100.00%`

Merging #34 into master will not affect coverage as of 26b4f8f

Powered by Codecov. Updated on successful CI builds.

redapple · 2016-03-03T08:09:27Z

hi @paulo-raca ,
Althouth I was trying to get this kind of functionality in scrapy selectors before, now I've gotten used to chaining XPaths in these cases:

sel.xpath('//ul/li').xpath('string()')

I would prefer that people get to know XPath string functions better.
Also, normalize-space() is great, but you sometimes you need string() to keep newlines for example. I believe a normalize=True/False arg could be useful

Digenis · 2016-03-03T11:46:21Z

I would prefer that people get to know XPath string functions better.

I agree 100%

Also, I'd prefer it not have too many .extract()-alike methods.
There's a long running set of discussions and PRs
for a trivial extract_first shortcut which everyone can live without.
There's the .re() method whose results can now be achieved with .xpath('re:replace()')

I don't think parsel is to reinvent xpath through a collection of shortcut methods.
Let's for a second rename .text_content() to .normalize_whitespace()
and redapple's suggestion a .string() method.
This betters demonstrates the problem
that anyone can come up with a series of methods
that replicate every elementary xpath feature.

kmike · 2016-03-03T12:20:15Z

See also discussion at scrapy/scrapy#772.

I like an idea of providing a customizable html2text-like features in parsel. This task is very common. This is not only about xpaths: users may want to extract text knowing only css selectors (xpath is not required to use parsel), and not all transformations are expressed in xpaths easily. I think parsel can't depend on https://github.com/Alir3z4/html2text/ for that though because html2text is GPL.

IMHO .text() method should be a 'best effort' method to convert HTML to plain text, with some options (e.g. how to join parts? preserve newlines or not? normalize white spaces or not? handle paragraphs or not?). If we want to do that it is better to implement 'smarter' features from the beginning because if we just go with normalize-space() people may become accustumated with this straightforward behaviour; any change to .text() to make it smarter would be backwards incompatible, and it would be breaking user code.

There is no default implementation which will satisfy everyone, so it is fine to be opinionated. I'd love to have something quick which works better than .string() by default and provides some tweaks for common use cases.

Digenis · 2016-03-03T12:26:28Z

lxml has an html cleaner.
Followed by getting the text nodes can be used for elementary textualization.
It can't be used in the current state of parsel
because parsel extracts attributes and text nodes unescaping html entities
and the cleaner needs to know what is not html by having plaintext entity-escaped.

paulo-raca · 2016-03-04T14:00:50Z

IMHO, new method would be handy, particularly with some extra params (normalization, etc)

But I'm OK with telling people to use XPath instead of bloating the API.

If that is way you want to go, I think XPath's string() and normalize-space() should be mentioned on the tutorial and docs.
(While I eventually figured that out, it took me longer than it should -- Particularly because I started using CSS selectors)

kmike · 2016-06-17T06:24:25Z

Note for the future: this implementation returns text inside <style> and <script> tags; usually this is not desired. I think a general-purpose 'opinionated' implementation should exclude their content.

Digenis · 2016-06-17T07:40:01Z

lxml provides a well performing built-in for html sanitization
which I use in selector subclasses.

from lxml.html.clean import Cleaner
clean_html = Cleaner(style=True).clean_html

(style=False seems to be the only non-sane default)

Html sanitization is a different topic.
This PR is about rendering but it's relevant to sanitation
because such an approach should follow a sanitization step.

kmike · 2016-06-17T07:47:32Z

@Digenis right; Cleaner is not straightforward to use with Parsel though because of #40

Digenis · 2016-06-17T08:41:41Z

Yes, it requires parsing again.

When using small fragments (the parts to be extracted from a page)
the performance impact of parsing it again is negligible.

It's the endless gotchas when using html fragments that make it non-straightforward.

The resulting tree is unpredictable and whitespace is so inconsistent
that just upgrading lxml can break tests.

The hacks it takes to implement this are only a few
but are unintuitive and require lot of documenting examples and testing.
(Would parsel developers be interested in maintaining such a thing?)

My point in my earlier comments is that since getting the text content without sanitization
is something already solved by xpath, there's no need for a shortcut method.
Better reserve a name for something that gets the text content from sanitized html.

NoxWings · 2017-01-12T17:16:05Z

Instead of creating new methods why not letting us specify which etree.tostring "method" keyword argument?
Using "text" method instead of "html" provides exactly the same output as text_content()
This way the API wouldn't change at all.

Something like the following:

    def extract(self, tostring_method=None):
        """
        Serialize and return the matched nodes in a single unicode string.
        Percent encoded content is unquoted.
        """
        try:
            return etree.tostring(self.root,
                                  method=tostring_method or self._tostring_method,
                                  encoding='unicode',
                                  with_tail=False)
        except (AttributeError, TypeError):
            if self.root is True:
                return u'1'
            elif self.root is False:
                return u'0'
            else:
                return six.text_type(self.root)

Gallaecio · 2019-07-10T12:45:18Z

@paulo-raca Are you OK with closing this pull request and continuing the discussion in #127?

paulo-raca · 2019-07-10T17:43:44Z

Sure 👍

Added text_content() method to selectors.

81b29b9

It's useful to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.

kmike mentioned this pull request Mar 16, 2016

Export in CDRv2 format TeamHG-Memex/undercrawler#3

Merged

kmike mentioned this pull request May 29, 2017

Fix unwanted joins for inline tags TeamHG-Memex/html-text#2

Merged

kmike mentioned this pull request Jan 25, 2018

Add attributes dict accessor #107

Merged

kmike mentioned this pull request Nov 2, 2018

[WIP] text extraction in Selector and SelectorList #127

Open

12 tasks

paulo-raca closed this Jul 10, 2019

barrio mentioned this pull request Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added text_content() method to selectors. #34

Added text_content() method to selectors. #34

paulo-raca commented Mar 2, 2016

codecov-io commented Mar 2, 2016

redapple commented Mar 3, 2016

Digenis commented Mar 3, 2016

kmike commented Mar 3, 2016

Digenis commented Mar 3, 2016

paulo-raca commented Mar 4, 2016

kmike commented Jun 17, 2016

Digenis commented Jun 17, 2016

kmike commented Jun 17, 2016

Digenis commented Jun 17, 2016 •

edited

Loading

NoxWings commented Jan 12, 2017 •

edited

Loading

Gallaecio commented Jul 10, 2019

paulo-raca commented Jul 10, 2019

Added text_content() method to selectors. #34

Added text_content() method to selectors. #34

Conversation

paulo-raca commented Mar 2, 2016

codecov-io commented Mar 2, 2016

Current coverage is 100.00%

redapple commented Mar 3, 2016

Digenis commented Mar 3, 2016

kmike commented Mar 3, 2016

Digenis commented Mar 3, 2016

paulo-raca commented Mar 4, 2016

kmike commented Jun 17, 2016

Digenis commented Jun 17, 2016

kmike commented Jun 17, 2016

Digenis commented Jun 17, 2016 • edited Loading

NoxWings commented Jan 12, 2017 • edited Loading

Gallaecio commented Jul 10, 2019

paulo-raca commented Jul 10, 2019

Current coverage is `100.00%`

Digenis commented Jun 17, 2016 •

edited

Loading

NoxWings commented Jan 12, 2017 •

edited

Loading