Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-6836] schema registry json internal references #23461

Merged
merged 14 commits into from
Oct 7, 2024

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Sep 24, 2024

Support for internal references for json schema compat checks

Implementation strategy:

prepoc phase:

unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>

note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.

then
for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.

this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".

The object is then modified in memory.

No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase

Compat check phase:

Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.

Error out if:

  • the local ref is invalid
  • the absolute schema uri does not exist
  • The maximum depth of references is reached. This is to safeguard against direct recursive references and indirect recursive references

To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of is_superset, the chosen value is 20

get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.

reference merging

$ref with siblings is transformed, during evaluation, to allOf schemas.
This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.

an example:

this schema

{ 
"$ref": "#/a_fragment",
"type": "number", 
"minimum": 10
}

will be converted to

{
  "allOf": [
    { //schema at #/a_fragment
    },
    { "type": "number", "minimum": 10 }
  ]
}

not implemented yet

  • different dialects for bundled schema: need to keep track of the dialect for each json::Value
  • external references: part of another jira ticket.

Fixes https://redpandadata.atlassian.net/browse/CORE-6836

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines +2029 to +2228
auto maybe_draft4_id_it = this_obj.find("id");
auto maybe_id_it = this_obj.find("$id");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the dialect be lookup up first, to determine whether to use "$id" or "id"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to come with a simpler form but the fact is that we know that we are in bundled schema if we find a "id" member, and then if we are in a bundled schema then "$schema" can be analyzed, and the dialect has agree to "$id" or "id. it's kinda a chicken/egg problem,
so that's why i just try to find both, and if at least one is found then i assume we are in bundled schema and i can check if it's correct or not.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 77db262 to 9096718 Compare September 26, 2024 16:09
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great effort, looks good so far. I'd love to see some tests for all this behaviour and various edge cases 😄

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 9096718 to 03ec3de Compare September 30, 2024 16:16
@andijcr andijcr requested review from BenPope and pgellert September 30, 2024 16:37
@andijcr andijcr marked this pull request as ready for review September 30, 2024 16:37
@andijcr
Copy link
Contributor Author

andijcr commented Sep 30, 2024

force push: addressed comments, added test cases

@andijcr andijcr requested a review from a team September 30, 2024 16:56
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 30, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

…e uri type

the base uri for a schema is saved with the "$id" or "id" key and any
relative reference is relative to this uri.

since the uri can contain a bunch of other parts like protocol (http vs
https) or port, this type is meant to be a canonical representation as
far as references are concerned.

the form is

host[/path]

where /path is optional

new type alias: id_to_schema_pointer

it's a map to resolve references json_id_uri->{json_pointer, dialect}

with a json_id_uri, we get the path to the actual json object for the
bundled schema, and the dialect used by it.

an absolute reference will first query the json_id_uri, get the path to
the root object, and then reach for the specific object
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 03ec3de to 2818ab0 Compare October 1, 2024 20:08
@andijcr
Copy link
Contributor Author

andijcr commented Oct 1, 2024

force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.

    the switch to jsoncons::json in parse_json means that the function always performed
    key sorting, making the "normalize" flag redundant

    jsoncons can work in insertion-order mode, to do so we switch to the
    type alias jsoncons::ojson.

    this is done to preserve the original order of the input
    see
    tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
    for an example where this can be observed externally from the schema
    registry API

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch 3 times, most recently from 60e2e23 to 8fb9da1 Compare October 2, 2024 09:37
@andijcr
Copy link
Contributor Author

andijcr commented Oct 2, 2024

force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 8fb9da1 to 92ca754 Compare October 2, 2024 09:41
@andijcr
Copy link
Contributor Author

andijcr commented Oct 2, 2024

#23461 (comment)
edit: issue was between keyboard and chair

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Not reviewed all the tests yet.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines 2086 to 2317
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is anything inserted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to keep the invariant of bundled schemas always having at least an entry for the root.
in this specific case there is no practical purpose, but since in the rest of the file we try to treat true as {} i felt that it's no harm to extend this also here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
Comment on lines +2072 to +2319
for (auto i = 0u; i < value.size(); ++i) {
if (value[i].is_object()) {
collect_and_fix(i, value[i]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in theory, there could be arrays of arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 in the interest of getting this pr in, I'll rework this in the next pr for external refs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought, I don't think it's possible meaningfully to have arrays of arrays with some reachable references:
"type": "array" requires prefixItems to be a schema of objects,
similarly allOf/oneOf/anyOf are schema arrays

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Show resolved Hide resolved
switch roles between jsoncons<->rapidjson. jsoncons has a nicer api for
the next commit
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 92ca754 to bce34ad Compare October 2, 2024 13:23
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor suggestions/questions but it pretty much looks good.

Comment on lines 2086 to 2317
bundled_schemas.insert_or_assign(
{}, std::pair{json::Pointer{}, dialect});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will be important to the implementation of external references. E.g. if there's a schema A referencing schema B where B is the bool schema "true". I guess it will depend on the implementation specifics.

src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
BenPope
BenPope previously requested changes Oct 3, 2024
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 commits here:

  1. I didn't see tests for relative references so I added some
  2. Recursion results in stack-overflow (this should be fixed)

@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from bce34ad to 1f01a68 Compare October 4, 2024 14:07
andijcr added 11 commits October 4, 2024 16:39
the switch to jsoncons::json in parse_json means that the function always performed
key sorting, making the "normalize" flag redundant

jsoncons can work in insertion-order mode, to do so we switch to the
type alias jsoncons::ojson.

this is done to preserve the original order of the input
see
tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
for an example where this can be observed externally from the schema
registry api
"^b" is repeated twice, and this is not illegal but the result is not
dependent on the strategy used by the json parse
before this commit, get_object_or_empty would get a context to pass to
get_schema. this was done with the idea of resolving references in
get_schema and benefit from that.

however, this clash with the design of resolving references only in
is_superset, and the standard prevents "properties" "patternProperties"
"dependency" from being references (see tests).

this means that the whole code can be simplified (at the cost of handing
objects and bools directly in get_object_or_empty).

keywords (like items, additionalItems, additionalProperties) that can be
references where already passed directly to is_superset, so resolving
early the references is a capability that's not needed
this recursive function crawl the schema to gather all bundled schemas.
a bundled schema is defined as a schema with "$id" defined.
the recursion is on the "$defs"/"definitions" members of a schema, since
the schema validation ensures that children of these properties are
valid schemas.

some small notes:
- it's possible that bundled schemas use a different
dialect than the root object, so we collect this
- the bundled schemas could be nested
- we can't error in this function since it's called at parse time and
  it's not assured that a bundled schema is actually accessed
add helper to resolve references (local or to bundled schemas)

add _ref_unit to keep track of how many references can be traversed in
the current subtree, this will be used to prevent infinite recursion.
this helper can hold a json::Value const& or a json::Document, and can
transparently convert to json::Value const&.

it will be used to merge references, when to resolve a reference with siblings we need to
create a new json schema

internally is a variant of two pointers, because holding a json::Document is the
unlikely case of and it would occupy 104 bytes
this function accepts list of json object to merge into a new schema
with the form

{
  "allOf": [ *input_list ]
}

the input_list is expected to be a chain of reference objects of size
>=2, all of them with a $ref key except the last one

the final schema will contain these objects, with the $ref key removed
from each one

the user is responsible to keep the original json::Document alive, since
the non-ref strings will be referenced in the result.

one optimization: if the chain contains only one non-empty object
without the $ref key, this will be returned directly.
this is to support the likely case of a chain of pure $ref pointing to
the final schema
get_schema will use resolve_reference to traverse references until the
final target is reached.

it's expected for references to absolute, thanks to the preprocessing
phase.
retrieval is two step process: first retrieve the bundled schema, then
retrieve the sub object.

if the sub object contains a reference, repeat the process.

the function uses schema_context to limit the number of iterations. the
number of iterations is shared across children of an object to prevent
indirect references to cause a stack overflow.

external references are not supported

different dialects in bundled schemas are not supported in this commit,
the function throws if the dialects do not match
the test cases are positive/negative test around $ref resolution.

we have local relative refs
local absolute ref
refs to bundled schemas
multiple jumps refs
recursive refs (rejection)
array of refs
refs with siblings
	this last one shows the transformation perfomed:

a
{
    $ref: uri
    siblings...
}
will be internally treated as
{
    allOf: [
      { //uri referenced schema}
      { siblings... }
    ]
}
this test shows how the in-memory representation of $ref is absolute
relative to the correct base uri.

it uses jsoncons and the jsonpatch extension for easier
parse/print/compare ops
@andijcr andijcr force-pushed the feat/core-6836/schema-registry-json-references branch from 1f01a68 to cae592f Compare October 4, 2024 15:34
@andijcr andijcr requested review from pgellert and BenPope October 4, 2024 15:37
@andijcr
Copy link
Contributor Author

andijcr commented Oct 4, 2024

I have 2 commits here:

1. I didn't see tests for relative references so I added some

2. Recursion results in stack-overflow (this should be fixed)
  1. added them
  2. fix for this is to ensure that ref resolution is done only in is_superset. to do so i decoupled get_object_or_empty from get_schema. get_object_or_empty does not need the context because it's used for keyword like "properties" that can't be a reference (added a test for this), or for objects that gets passed directly to is_superset, so there is no need to resolve references

@andijcr andijcr dismissed BenPope’s stale review October 4, 2024 15:53

fixed stackoverflow due to unhandled infinite recursion

take the safe approach and reject schemas with invalid bundled schema.

reason for rejection: the dialect is unknown, the dialect does not match
the key used for "$id", or the bundled schema does not pass validation.
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

Comment on lines +157 to +201
// invalid bundled schemas
error_test_case{
R"(
{
"$comment": "the root schema is valid but the bundled schema is not",
"$defs": {
"$comment": "dialect is draft-04, but id key is '$id'",
"$id": "https://example.com/mismatch_id",
"$schema": "http://json-schema.org/draft-04/schema#"
}
}
)",
pps::error_info{
pps::error_code::schema_invalid,
"bundled schema with mismatched dialect "
"'http://json-schema.org/draft-04/schema#' for id key"}},
error_test_case{
R"(
{
"$comment": "the root schema is valid but the bundled schema is not",
"$defs": {
"$comment": "dialect is unkown",
"$id": "https://example.com/mismatch_id",
"$schema": "http://json-schema.org/draft-3000/schema#"
}
}
)",
pps::error_info{
pps::error_code::schema_invalid,
"bundled schema without a known dialect: "
"'http://json-schema.org/draft-3000/schema#'"}},
error_test_case{
R"(
{
"$comment": "the root schema is valid but the bundled schema is not",
"$defs": {
"$comment": "schema is invalid",
"$id": "https://example.com/mismatch_id",
"type": "potato"
}
}
)",
pps::error_info{
pps::error_code::schema_invalid,
R"(bundled schema is invalid. Invalid json schema: '{"$comment":"schema is invalid","$id":"https://example.com/mismatch_id","type":"potato"}'. Error: '/type: Must be valid against at least one schema, but found no matching schemas')"}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me that these are rejected, but this appears stricter than the reference implementation.

I wonder if referencing a schema in the bundled schema changes that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes so as far as i tested, only if a reference from the root points to an invalid bundled schema, then the whole schema is rejected.

So if a bundled schema is never referenced, then the schema is not rejected.

to implement this we need to evolve the whole collect+fix function, since the reference implementation only considers references that can be actually accessed, so this is not not considered
/randomname/$ref while this is /properties/a_property_name/$ref

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes so as far as i tested, only if a reference from the root points to an invalid bundled schema, then the whole schema is rejected.

So if a bundled schema is never referenced, then the schema is not rejected.

to implement this we need to evolve the whole collect+fix function, since the reference implementation only considers references that can be actually accessed, so this is not not considered /randomname/$ref while this is /properties/a_property_name/$ref

Do you think anything needs changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a more refined traversal. Now everything that looks like a $ref is resolved against the base uri.
The more complete operation would limit the operation to $ref that can be reached when applying the schema, and those $ref need to be validated

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for this to merge, any followups in another PR

@andijcr andijcr merged commit ad164e3 into dev Oct 7, 2024
18 checks passed
@andijcr andijcr deleted the feat/core-6836/schema-registry-json-references branch October 7, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants