Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide option to strip attributes for empty tags #98

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukasbischof
Copy link

Introduces a new option "strip_attributes_of_empty_tags" which can be used to ignore any attributes when an empty tag is found, in favour of consistent return types. This can be helpful if the parsed XML document mainly specifies namespaces or other attributes one might not be interested in to reduce the code needed during parsing.

My use case which motivated this feature was a bug I had when parsing a feed, which sometimes returned

<Street xmlns="...">MyStreet</Street>
<BuildingNumber xmlns="...">6</BuildingNumber>

but sometimes also

<Street xmlns="...">MyStreet 6</Street>
<BuildingNumber xmlns="..." />

The code which processed the parsed feed looked something like

Model.update(address: [parsed[:street], parsed[:building_number]].join(' '))

which (sometimes) produced addressed like MyStreet {:@xmlns=>"..."}.

Since I'm not interested in the attribute anyway, it would be easier to just strip them already while parsing. closes #97

Introduce strip_attributes_of_empty_tags option
which can be set to have consistent return values for empty tags
by neglecting their attributes.
@pcai
Copy link
Member

pcai commented Oct 14, 2022

Hi. Thanks for making this PR as we discussed. My understanding from reading #97 is you care about consistency and the problem was that the presence of an extra attribute without a value changes the shape of an empty tag (e.g., without child elements or inner text). I agree this is weird!

To expand more on the problem, it should be the case that

result = Nori.new(ignore_empty_attributes: true).parse('<foo bar />')
#=> {"foo"=>nil}

should give similar same result as

result = Nori.new.parse('<foo />')
#=> {"foo"=>nil}

because they are semantically equivalent, except that result['foo'].attributes would return {"bar" => ""} in the first case.

Is there a reason you took this current approach of stripping the attributes instead? My fear is that as-is...you are adding an additional feature that makes nori operate in a different, backwards-incompatible way, instead of "opt in to make the api more consistent"

@lukasbischof
Copy link
Author

Hi, thanks for your review.
I'm not sure if I understand your concern about backwards-incompatibility correctly. The stripping is deactivated by default, so omitting the option would not activate the feature thus behave as it used to. Hence, all tests still pass without me setting it explicitly to false.

The reason why I implemented stripping of all attributes of empty tags is because it solves my current issue I have in my project. There, I had parsing difficulties because the SOAP API I'm using returns data sometimes already concatenated and sometimes not (The problem I described in the description of this issue).
I can fix my parsing problem also on application level where Savon first parses the request using Nori and I then have to take care of standardising the response so it results in consistent output. However, that's quite tedious since I have to recursively flatten the empty attributes. It would be easier if there was an option for that in the library itself, hence this feature.

I see that this may be quite specific and that the library may/should not be responsible for that or only be responsible for handling the <foo bar /> and <foo /> case.
Unfortunately, that's not too easy to achieve, since <foo bar /> actually isn't valid XML. Nokogiri handles it still since in HTML it's allowed (Or at least tolerated, I'm not 100% if it's actually in the standard, haven't checked :)). But when using rexml, it even raises:

pry(main)> Nori.new(parser: :rexml).parse('<foo bar />')
REXML::ParseException: Missing attribute equal: <bar>
Line: 1
Position: 11
Last 80 unconsumed characters:

from [...]/rexml-3.2.5/lib/rexml/parsers/baseparser.rb:619:in `block in parse_attributes'

so being consistent would only work for Nokogiri. But then, would this be an opt-in feature just for Nokogiri for backwards-compatibility? Would you still expect the attribute to be present somewhere? If yes, wouldn't it be weird that (in the example you gave), result['foo'] is actually nil but result['foo'].attributes returns {'bar'=>''}? And if this behaviour is wanted, it would require extending NilClass which seems rather hacky to me...
To circumvent this issue, one could implement only stripping of empty attributes in empty tags, but this would then mean that we're losing information again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent types when parsing empty tags
2 participants