Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] [JSON] Broader support of allOf JSON keyword #1070

Merged
merged 72 commits into from
Nov 6, 2024

Conversation

hudson-ai
Copy link
Collaborator

Implements framework for support of allOf by manually intersecting schemas. Tests are adapted from the JSON Schema test suite.

Out of the box, this supports:

  • Unioning properties, recursively merging duplicate properties using allOf
  • Unioning required lists
  • Intersecting additionalProperties with a recursive allOf
  • Intersecting items with a recursive allOf
  • Intersecting types
  • Intersecting enums
  • Combining disjoint assertions like minItems, maxItems, minLength, maxLength, pattern, format, minimum, maximum, ...
    • Note that any duplicates across different schemas in the allOf will cause an exception, however we can extend this implementation in the future
  • Accounting for "sibling" keywords in the base schema next to an allOf, anyOf, oneOf, or $ref by internally rewriting the node as an allOf

A couple of notes:

  • As always, only one particular order of keys is allowed when generating an object. Usually, we just repeat the order given by the user's schema, but the idea of a "canonical order" is trickier here. We may want to play around with this a bit
  • Technically, we may have to intersect the additionalProperties of one schema with the properties of another in an allOf. This is subtle and here is where unevaluatedProperties may start to become important. I say we punt and make incremental improvements here
  • Wherever possible, I tried to prioritize raising exceptions if we aren't certain we know how to handle things (maybe with the exception of the previous additionalProperties note.)
  • I had to rewrite the $ref implementation again. The allOf tests revealed some edge cases I previously didn't know about. This caused some additional ugliness with passing around a base_uri string, so I'll have a think on how to simplify this if possible. I think it's ok for now.

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.97125% with 22 lines in your changes missing coverage. Please review.

Project coverage is 65.19%. Comparing base (11892c2) to head (0a63f59).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
guidance/library/_json.py 92.97% 22 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1070      +/-   ##
==========================================
- Coverage   65.35%   65.19%   -0.17%     
==========================================
  Files          65       65              
  Lines        4892     5102     +210     
==========================================
+ Hits         3197     3326     +129     
- Misses       1695     1776      +81     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

const: Union[Unset, Any] = _unset

def handle_keyword(key: str, value: Any, base_uri: str):
nonlocal type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have learned a new Python keyword today.....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super handy for modifying closure variables :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of 'closed but not really' ? :-)

@hudson-ai
Copy link
Collaborator Author

I made the mistake of refactoring test_json into a folder as well as running black on it... The diff in the tests looks much bigger than it should... ctrl-f for "unsatisfiable" in tests/unit/library/json/test_json.py to see the additions relevant to my last comment.

This reverts commit 496718e.
Undo blacken to reduce diff size of PR
@hudson-ai hudson-ai changed the title [Feature] [JSON] Broader support of allOf JSON keyword [WIP][Feature] [JSON] Broader support of allOf JSON keyword Nov 5, 2024
@hudson-ai
Copy link
Collaborator Author

@riedgar-ms just a heads up that I moved this back to WIP as @Saibo-creator gave me some examples of schemas for which this fails -- will let you know when it's ready for review again.

@riedgar-ms
Copy link
Collaborator

@riedgar-ms just a heads up that I moved this back to WIP as @Saibo-creator gave me some examples of schemas for which this fails -- will let you know when it's ready for review again.

Thanks. One suggestion: why not do the split into separate test files now, as a separate PR. Keep things orthogonal and more reasonably sized (well, slightly).

@hudson-ai hudson-ai changed the title [WIP][Feature] [JSON] Broader support of allOf JSON keyword [Feature] [JSON] Broader support of allOf JSON keyword Nov 6, 2024
@hudson-ai
Copy link
Collaborator Author

@riedgar-ms refactored a little bit just to move some of the potentially reusable parts out into their own methods. I was hoping to make anyOf and oneOf work inside of an allOf, but I'm finding it a bit mind-bending at the moment... Rather than make the PR wait on that functionality, I'd like to go ahead with it as-is, adding those in later. This PR already introduces a good amount of functionality with allOfs.

TLDR: ready for review.

try:
additional_properties_grammar = self.json(json_schema=additional_properties, base_uri=base_uri)
except UnsatisfiableSchemaError as e:
if any(k not in properties for k in required):
Copy link
Collaborator

@riedgar-ms riedgar-ms Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No else needed here? Even to just re-raise (and in other exception blocks with a conditional), since this is going to swallow exceptions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-is, this says "if the additionalProperties schema is unsatisfiable, we simply disallow additionalProperties". This can come up, for example, if you take the intersection (allOf) of two schemas with contradicting additionalProperties. From JSON schema's point of view, this is expected behavior. additionalProperties=False (which we previously had an explicit check for) is just a sub-case of this, NOT a special case.

This only makes the parent schema unsatisfiable in the case that there are some required properties that don't have schemas in the properties dict -- (thereby requiring a property that is unsatisfiable). See just a few lines earlier (L574-L582) where we only raise an UnsatisfiableSchemaError for an unsatisfiable property if it's required, and we otherwise "blacklist" it.

TLDR: UnsatisfiableSchemaError is semantically different from all other exception types, as unsatisfiable schemas are part of JSON schema's design. We only re-raise them in particular situations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If this and other places this sort of pattern pops up are considered normal. Perhaps a comment, though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add a comment

# i corresponds to the number of items we've already satisfied
if i < min_items:
raise UnsatisfiableSchemaError(f"prefixItems[{i}] is unsatisfiable but min_items is {min_items}") from e
# Having an unsatisfiable prefix item is fine if we've already satisfied min_items, but this effectively sets max_items to i
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth a warnings.warn()? Both from a Python perspective (breaking out of a loop by catching an exception is not normal flow control), and from a JSON-schema one? Your comment is one which makes me feel that the user might not have quite written the schema they meant to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking out of a loop by catching an exception is not normal flow control

Not sure I agree with this premise, but I'll admit that doing something like {"prefixItems": [True, True, False, True]} would be a VERY odd way of saying "no more than two items please". That being said, this behavior is semantically consistent with JSON schema, and we only need to re-raise an UnsatisfiableSchemaError if minItems conflicts.

More philosophically, I'm not sure we should be in the business of policing the semantic meaning of peoples' schemas. I'd err on the side of giving the user what they asked for (without polluting them with warnings), only raising if there's something actually semantically invalid about what they asked for (e.g. an UnsatisfiableSchemaError makes its way up to the top without being caught) or if they ask for something we don't support (e.g. a NotImplementedError). Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some additional color on above -- this can arise organically if you try and do something like

{
   "allOf": [
      {"prefixItems": [{"type": "integer"}, {"type": "string"}, {"type": "integer"}]}
      {"prefixItems": [{"type": "integer"}, {"type": "boolean"}, {"type": "integer"}]}
   ]
}

This schema should be equivalent to

{
   "prefixItems": [
      {"type": "integer"},
      {"allOf": [{"type": "string"}, {"type": "string"}]},
      {"type": "integer"}]
}

and therefore

{"prefixItems": [{"type": "integer"}, false, {"type": "integer"}]}

As false isn't actually a grammar, we can't really return those from further down the stack. I suppose we could return a null grammar, but I (personally) think that's a bit trickier than having a special exception type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm happy to leave it as-is for now. Your point about expecting users to know how to write a JSON schema is fair (although it seems that it's easy to mess up by accident in some subtle and non-obvious way).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope is that people writing JSON know json, and everyone else will use pydantic, which is less expressive and harder to shoot yourself in the foot with 😅

if len(allof_list) != 1:
raise ValueError("Only support allOf with exactly one item")
return lm + self.json(json_schema=allof_list[0])
assert not sibling_keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that these bare assertions (without even a text message) should never be hit, unless things are going really wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, there's a philosophy that "production code" should never have bare asserts, which I may agree with. But yes, these asserts are essentially "whoever wrote this code really messed up".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to try and find something better to raise in these situations if it would make you more comfortable. I have a (maybe bad) tendency to use asserts almost as comments to help myself reason about state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it's good to leave them in as-is. I tend to like rechecking things too, since it's often the really simple bit of code that couldn't possibly be wrong.....

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few queries, but happy to keep this moving

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hudson-ai hudson-ai merged commit 0ace873 into guidance-ai:main Nov 6, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants