Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dream2333
Copy link

@dream2333 dream2333 commented Jun 9, 2024

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This issue is mostly observed when using jmespath and xpath in combination.

body_selector = response.jmespath("news.body")
styles = body_selector.xpath("//style")
styles.drop()
# always contains the content of the style tags
content = body_selector.xpath("string(.)").get()

This pull request fixes an issue where HTML elements were not being dropped correctly from a text selector. The ·_text_lazy_html_root· attribute has been added to store a temporary root node, which prevents the creation of a new root HTMLElement copy each time a text selector is used

Fixes #297, resolves #299

@dream2333
Copy link
Author

#297

@Gallaecio
Copy link
Member

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

@dream2333
Copy link
Author

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

This feels like a workaround, I wonder if there is not some root issue that needs to be addressed here. Maybe type="text" should be removed here, although it would not surprise me if that broke something else.

Could you add a test for the issue, so it is easier to experiment with alternative solutions?

Sure, I'll add a test for this issue after work. My main concern is that direct modifications to the selector might break the forward compatibility of the entire library. In the current version, the implementation of converting text to HtmlElement is not very elegant either. It doesn't cache the HtmlElement generated from text, nor does it associate the HtmlElement with the selector itself. This leads to the creation of a new copy every time a query is made on a text node. Any modifications made to this copy will not affect the original Selector at all.

@dream2333 dream2333 reopened this Jun 13, 2024
@dream2333
Copy link
Author

@Gallaecio
Updated pull request.
Added two test cases to verify the effect of dropping nodes from a Selector of type 'text'.

I tried to remove type="text", but it broke jmespath. Currently, I think adding a cache for root's HTML is the least destructive solution and it solves the problem of recreating etree._Element every time an xpath query is executed on a selector of type 'text'.

Comment on lines +734 to +736
return etree.tostring(
self._text_lazy_html_root, encoding="unicode", with_tail=False
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem with this approach is that we are assuming HTML, when it could be XML.

@Gallaecio
Copy link
Member

Gallaecio commented Jun 14, 2024

So, I have created #299 as an alternative, but I have no strong opinion on which way to go. To be honest, I think both could work, i.e. this approach makes things work by default for HTML (which is also assumed to be default when no type is specified), while #299 would provide a way to make things work with XML by specifying type="xml" manually.

@kmike @wRAR Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SelectorList.drop() removing elements doesn't work as expected
2 participants