-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-6836] schema registry json internal references #23461
Conversation
auto maybe_draft4_id_it = this_obj.find("id"); | ||
auto maybe_id_it = this_obj.find("$id"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the dialect be lookup up first, to determine whether to use "$id" or "id"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried to come with a simpler form but the fact is that we know that we are in bundled schema if we find a "id" member, and then if we are in a bundled schema then "$schema" can be analyzed, and the dialect has agree to "$id" or "id. it's kinda a chicken/egg problem,
so that's why i just try to find both, and if at least one is found then i assume we are in bundled schema and i can check if it's correct or not.
77db262
to
9096718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great effort, looks good so far. I'd love to see some tests for all this behaviour and various edge cases 😄
9096718
to
03ec3de
Compare
force push: addressed comments, added test cases |
new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:
new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:
|
…e uri type the base uri for a schema is saved with the "$id" or "id" key and any relative reference is relative to this uri. since the uri can contain a bunch of other parts like protocol (http vs https) or port, this type is meant to be a canonical representation as far as references are concerned. the form is host[/path] where /path is optional new type alias: id_to_schema_pointer it's a map to resolve references json_id_uri->{json_pointer, dialect} with a json_id_uri, we get the path to the actual json object for the bundled schema, and the dialect used by it. an absolute reference will first query the json_id_uri, get the path to the root object, and then reach for the specific object
03ec3de
to
2818ab0
Compare
force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.
|
60e2e23
to
8fb9da1
Compare
force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format |
8fb9da1
to
92ca754
Compare
#23461 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Not reviewed all the tests yet.
bundled_schemas.insert_or_assign( | ||
{}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is anything inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to keep the invariant of bundled schemas always having at least an entry for the root.
in this specific case there is no practical purpose, but since in the rest of the file we try to treat true
as {}
i felt that it's no harm to extend this also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.
for (auto i = 0u; i < value.size(); ++i) { | ||
if (value[i].is_object()) { | ||
collect_and_fix(i, value[i]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in theory, there could be arrays of arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭 in the interest of getting this pr in, I'll rework this in the next pr for external refs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second thought, I don't think it's possible meaningfully to have arrays of arrays with some reachable references:
"type": "array" requires prefixItems to be a schema of objects,
similarly allOf/oneOf/anyOf are schema arrays
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cdc-6216-4f22-bddd-a8d4fadad868 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13c1-441c-85f1-9474004c556e ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13bd-4fcc-b7ca-c41b5c03981e ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55820#0192591e-d718-4be7-b33b-4e62a8ec48e3 |
switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for the next commit
92ca754
to
bce34ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor suggestions/questions but it pretty much looks good.
bundled_schemas.insert_or_assign( | ||
{}, std::pair{json::Pointer{}, dialect}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 commits here:
- I didn't see tests for relative references so I added some
- Recursion results in stack-overflow (this should be fixed)
bce34ad
to
1f01a68
Compare
the switch to jsoncons::json in parse_json means that the function always performed key sorting, making the "normalize" flag redundant jsoncons can work in insertion-order mode, to do so we switch to the type alias jsoncons::ojson. this is done to preserve the original order of the input see tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize for an example where this can be observed externally from the schema registry api
"^b" is repeated twice, and this is not illegal but the result is not dependent on the strategy used by the json parse
before this commit, get_object_or_empty would get a context to pass to get_schema. this was done with the idea of resolving references in get_schema and benefit from that. however, this clash with the design of resolving references only in is_superset, and the standard prevents "properties" "patternProperties" "dependency" from being references (see tests). this means that the whole code can be simplified (at the cost of handing objects and bools directly in get_object_or_empty). keywords (like items, additionalItems, additionalProperties) that can be references where already passed directly to is_superset, so resolving early the references is a capability that's not needed
this recursive function crawl the schema to gather all bundled schemas. a bundled schema is defined as a schema with "$id" defined. the recursion is on the "$defs"/"definitions" members of a schema, since the schema validation ensures that children of these properties are valid schemas. some small notes: - it's possible that bundled schemas use a different dialect than the root object, so we collect this - the bundled schemas could be nested - we can't error in this function since it's called at parse time and it's not assured that a bundled schema is actually accessed
add helper to resolve references (local or to bundled schemas) add _ref_unit to keep track of how many references can be traversed in the current subtree, this will be used to prevent infinite recursion.
this helper can hold a json::Value const& or a json::Document, and can transparently convert to json::Value const&. it will be used to merge references, when to resolve a reference with siblings we need to create a new json schema internally is a variant of two pointers, because holding a json::Document is the unlikely case of and it would occupy 104 bytes
this function accepts list of json object to merge into a new schema with the form { "allOf": [ *input_list ] } the input_list is expected to be a chain of reference objects of size >=2, all of them with a $ref key except the last one the final schema will contain these objects, with the $ref key removed from each one the user is responsible to keep the original json::Document alive, since the non-ref strings will be referenced in the result. one optimization: if the chain contains only one non-empty object without the $ref key, this will be returned directly. this is to support the likely case of a chain of pure $ref pointing to the final schema
get_schema will use resolve_reference to traverse references until the final target is reached. it's expected for references to absolute, thanks to the preprocessing phase. retrieval is two step process: first retrieve the bundled schema, then retrieve the sub object. if the sub object contains a reference, repeat the process. the function uses schema_context to limit the number of iterations. the number of iterations is shared across children of an object to prevent indirect references to cause a stack overflow. external references are not supported different dialects in bundled schemas are not supported in this commit, the function throws if the dialects do not match
the test cases are positive/negative test around $ref resolution. we have local relative refs local absolute ref refs to bundled schemas multiple jumps refs recursive refs (rejection) array of refs refs with siblings this last one shows the transformation perfomed: a { $ref: uri siblings... } will be internally treated as { allOf: [ { //uri referenced schema} { siblings... } ] }
this test shows how the in-memory representation of $ref is absolute relative to the correct base uri. it uses jsoncons and the jsonpatch extension for easier parse/print/compare ops
1f01a68
to
cae592f
Compare
|
fixed stackoverflow due to unhandled infinite recursion
take the safe approach and reject schemas with invalid bundled schema. reason for rejection: the dialect is unknown, the dialect does not match the key used for "$id", or the bundled schema does not pass validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
// invalid bundled schemas | ||
error_test_case{ | ||
R"( | ||
{ | ||
"$comment": "the root schema is valid but the bundled schema is not", | ||
"$defs": { | ||
"$comment": "dialect is draft-04, but id key is '$id'", | ||
"$id": "https://example.com/mismatch_id", | ||
"$schema": "http://json-schema.org/draft-04/schema#" | ||
} | ||
} | ||
)", | ||
pps::error_info{ | ||
pps::error_code::schema_invalid, | ||
"bundled schema with mismatched dialect " | ||
"'http://json-schema.org/draft-04/schema#' for id key"}}, | ||
error_test_case{ | ||
R"( | ||
{ | ||
"$comment": "the root schema is valid but the bundled schema is not", | ||
"$defs": { | ||
"$comment": "dialect is unkown", | ||
"$id": "https://example.com/mismatch_id", | ||
"$schema": "http://json-schema.org/draft-3000/schema#" | ||
} | ||
} | ||
)", | ||
pps::error_info{ | ||
pps::error_code::schema_invalid, | ||
"bundled schema without a known dialect: " | ||
"'http://json-schema.org/draft-3000/schema#'"}}, | ||
error_test_case{ | ||
R"( | ||
{ | ||
"$comment": "the root schema is valid but the bundled schema is not", | ||
"$defs": { | ||
"$comment": "schema is invalid", | ||
"$id": "https://example.com/mismatch_id", | ||
"type": "potato" | ||
} | ||
} | ||
)", | ||
pps::error_info{ | ||
pps::error_code::schema_invalid, | ||
R"(bundled schema is invalid. Invalid json schema: '{"$comment":"schema is invalid","$id":"https://example.com/mismatch_id","type":"potato"}'. Error: '/type: Must be valid against at least one schema, but found no matching schemas')"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me that these are rejected, but this appears stricter than the reference implementation.
I wonder if referencing a schema in the bundled schema changes that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes so as far as i tested, only if a reference from the root points to an invalid bundled schema, then the whole schema is rejected.
So if a bundled schema is never referenced, then the schema is not rejected.
to implement this we need to evolve the whole collect+fix function, since the reference implementation only considers references that can be actually accessed, so this is not not considered
/randomname/$ref
while this is /properties/a_property_name/$ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes so as far as i tested, only if a reference from the root points to an invalid bundled schema, then the whole schema is rejected.
So if a bundled schema is never referenced, then the schema is not rejected.
to implement this we need to evolve the whole collect+fix function, since the reference implementation only considers references that can be actually accessed, so this is not not considered
/randomname/$ref
while this is/properties/a_property_name/$ref
Do you think anything needs changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a more refined traversal. Now everything that looks like a $ref is resolved against the base uri.
The more complete operation would limit the operation to $ref that can be reached when applying the schema, and those $ref need to be validated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy for this to merge, any followups in another PR
Support for internal references for json schema compat checks
Implementation strategy:
prepoc phase:
unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>
note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.
then
for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.
this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".
The object is then modified in memory.
No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase
Compat check phase:
Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.
Error out if:
To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of
is_superset
, the chosen value is 20get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.
reference merging
$ref with siblings is transformed, during evaluation, to
allOf
schemas.This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.
an example:
this schema
will be converted to
not implemented yet
Fixes https://redpandadata.atlassian.net/browse/CORE-6836
Backports Required
Release Notes