Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

dream2333 · 2024-06-09T19:05:21Z

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This issue is mostly observed when using jmespath and xpath in combination.

body_selector = response.jmespath("news.body")
styles = body_selector.xpath("//style")
styles.drop()
# always contains the content of the style tags
content = body_selector.xpath("string(.)").get()

This pull request fixes an issue where HTML elements were not being dropped correctly from a text selector. The ·_text_lazy_html_root· attribute has been added to store a temporary root node, which prevents the creation of a new root HTMLElement copy each time a text selector is used

Fixes #297, resolves #299

dream2333 · 2024-06-09T19:07:30Z

#297

Gallaecio · 2024-06-11T16:09:01Z

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

dream2333 · 2024-06-12T08:25:20Z

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

Sure, I'll add a test for this issue after work. My main concern is that direct modifications to the selector might break the forward compatibility of the entire library. In the current version, the implementation of converting text to HtmlElement is not very elegant either. It doesn't cache the HtmlElement generated from text, nor does it associate the HtmlElement with the selector itself. This leads to the creation of a new copy every time a query is made on a text node. Any modifications made to this copy will not affect the original Selector at all.

# Conflicts: # parsel/selector.py # tests/test_selector.py

dream2333 · 2024-06-13T19:07:35Z

@Gallaecio
Updated pull request.
Added two test cases to verify the effect of dropping nodes from a Selector of type 'text'.

I tried to remove type="text", but it broke jmespath. Currently, I think adding a cache for root's HTML is the least destructive solution and it solves the problem of recreating etree._Element every time an xpath query is executed on a selector of type 'text'.

Gallaecio · 2024-06-14T19:07:26Z

parsel/selector.py

+                return etree.tostring(
+                    self._text_lazy_html_root, encoding="unicode", with_tail=False
+                )


A problem with this approach is that we are assuming HTML, when it could be XML.

Gallaecio · 2024-06-14T19:13:10Z

So, I have created #299 as an alternative, but I have no strong opinion on which way to go. To be honest, I think both could work, i.e. this approach makes things work by default for HTML (which is also assumed to be default when no type is specified), while #299 would provide a way to make things work with XML by specifying type="xml" manually.

@kmike @wRAR Any thoughts?

dream2333 mentioned this pull request Jun 9, 2024

SelectorList.drop() removing elements doesn't work as expected #297

Open

dream2333 closed this Jun 13, 2024

dream2333 force-pushed the master branch from d5728ed to 7407342 Compare June 13, 2024 02:39

dream2333 added 2 commits June 14, 2024 01:19

Fix drop html element from a text type Selector

8259dd4

Add testcases for drop html node

70aca9b

dream2333 reopened this Jun 13, 2024

dream2333 added 4 commits June 14, 2024 01:52

Add type hint

0c2b57a

Add testcases for drop html node

9c8869a

Fix drop html element from a text type Selector

955abd9

Merge remote-tracking branch 'origin/master'

4e140f9

# Conflicts: # parsel/selector.py # tests/test_selector.py

Gallaecio mentioned this pull request Jun 14, 2024

Support forcing a selector type into a subselector #299

Open

Gallaecio reviewed Jun 14, 2024

View reviewed changes

Gallaecio approved these changes Jun 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

dream2333 commented Jun 9, 2024 •

edited by Gallaecio

Loading

dream2333 commented Jun 9, 2024

Gallaecio commented Jun 11, 2024

dream2333 commented Jun 12, 2024

dream2333 commented Jun 13, 2024

Gallaecio Jun 14, 2024

Gallaecio commented Jun 14, 2024 •

edited

Loading

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Are you sure you want to change the base?

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Conversation

dream2333 commented Jun 9, 2024 • edited by Gallaecio Loading

dream2333 commented Jun 9, 2024

Gallaecio commented Jun 11, 2024

dream2333 commented Jun 12, 2024

dream2333 commented Jun 13, 2024

Gallaecio Jun 14, 2024

Choose a reason for hiding this comment

Gallaecio commented Jun 14, 2024 • edited Loading

dream2333 commented Jun 9, 2024 •

edited by Gallaecio

Loading

Gallaecio commented Jun 14, 2024 •

edited

Loading