Support instantiating mutually referencing schema bundles #76

handrews · 2023-03-17T06:45:44Z

This issue isn't exactly about bundling, but the recommended bundling process in the JSON Schema Core spec shows how the following schemas might get created:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/bundle1",
    "$defs": {
        "a": {
            "$id": "https://example.com/source1/a",
            "$ref": "../source2/b"
        },    
        "b": {
            "$id": "https://example.com/source1/b",
            "type": "object"
        }     
    }
}

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/bundle2",
    "$defs": {
        "a": {
            "$id": "https://example.com/source2/a",
            "$ref": "../source1/b"
        },    
        "b": {
            "$id": "https://example.com/source2/b",
            "type": "array"
        }     
    }
}

It is currently difficult to work with such schemas in jschon. They cannot be directly instantiated as JSONSchema instances, because each document references the other. They cannot be loaded via Catalog using the document root $ids because the reference in each document references an embedded resource URI, not the document root URI.

It is possible to get these to work by doing some very convoluted things with sources, such as symlinking the documents under filenames corresponding to the embedded $ids or writing a custom Source subclass that does... something. TBH I haven't quite worked that out, but I'm sure you could make it work.

But I would argue that this should be much easier than these workarounds. And it should also be possible to do without configuring a source — I often have schemas as in-memory Python structures and just want to instantiate them as schemas without having to put them somewhere a source can find them, or create an "InMemorySource" just to route them through the Catalog code.

Schema compilation and reference resolving

The issue here is the otherwise sensible architectural choice to resolve all references on construction of a JSONSchema instance. Whether this is done by instantiating schemas in careful order to always have reference targets available, or by configuring a Source that can load things on-demand-but-during-instantiation, this approach satisfies the directive:

JSON Schema implementations SHOULD understand ahead of time which schemas they will be using, and the URIs that identify them.

from §9.1.2 "Loading a referenced schema". And it satisfies that directive in a way that makes things easy for library users. But it breaks on this one use case.

Open world schema networks

There is another use case that is problematic, which would be more likely to be an issue if JSON Hyper-Schema had taken off. In a hypermedia environment with API schemas, the set of linked schemas is potentially unbounded - a resource with a link to another resource will $ref the other resource's schema as part of the link description. I don't want to focus on this too much as there's no real-world example where it's a problem, but theoretically there could be times when following all references would consume an impractical amount of time and memory.

Deferring reference resolution

After poking around a fair amount, I think the simplest thing to do is exploit the already-existing two phases of compilation. For the benefit of people other than the project owner: JSONSchema.__init__() first instantiates the keywords, and recursively instantiates subschemas. This registers any keyword-created URIs with the catalog, allowing the schema to be used as a reference target. At the very end of the constructor of a root schema, references are recursively resolved by calling JSONSchema._resolve_references().

The limitation can be solved by allowing those two phases to be run completely separately. This is how mutually referencing schema documents without embedded resource URIs as schema targets work now: the first schema document has already registered its URIs with the catalog by the time it resolves its own references, triggering the loading of the second document, which registers all of its URIs before resolving its own references which call back to the first document.

It's only when the mutual references use URIs that don't naturally align with how Catalog.get_schema() finds schemas that there are problems.

Solving with direct instantiation

Since this is an unusual scenario, I think it is reasonable to preserve backwards compatibility and require a bit more effort (passing a constructor parameter) to handle this case. This solution also generally allows instantiating schemas regardless of reference ordering, which is useful when you have a bunch of schemas as in-memory data structures but do not know their referencing relationships (for example, if you extract all schemas from an OpenAPI 3.1 description file and instantiate each of them).

I would propose the following:

Add a resolve_references constructor parameter to JSONSchema, which defaults to True to preserve the current behavior
Add a boolean _references_resolved internal data member to JSONSchema, which means what it says
Change JSONSchema._resolve_references() to JSONSchema.resolve_references() and have it check _references_resolved before actually resolving the references; code that instantiates JSONSchema objects with resolve_references=False would call this method once all of the necessary schemas are instantiated
over all schemas in the given cache and calls resolve_references() on them, so that if you are instantiating a bunch of schemas with references as shown above, you don't have to keep track of them and resolve them all individually
Have JSONSchema.evaluate() check self._references_resolved and raise a JSONSchemaError if it is still False (but see the next section below)

If this is all that is done, schemas of this sort will be required to be instantiated this way, and not loaded through a Source. I'd say this would meet the minimum requirements for compliance with the specification.

More convenient resolution

Keeping track of the JSONSchema instances and calling resolve_references() on them at the right time could get annoying, and the Catalog already tracks schemas. Adding a Catalog.resolve_references(self, cacheid: str = 'default') -> None method that loops through the given cache and calls JSONSchema.resolve_references() on everything would streamline the process.

This method would also give code that can't be certain that all instantiated schemas have had their references resolved a way to ensure resolution across all cached schemas.

Also solving via `Catalog.load_json()`

It's possible that a schema with problematic referencing might first appear as a reference target to be loaded via the Catalog rather than be instantiated directly. This can be solved by adding a resolve_references parameter to Catalog.__init__() and create_catalog() that gets passed through to JSONSchema.__init__(). This also completes the symmetry suggested by the Catalog.resolve_references() method.

Since both Catalog extensions to the fundamental fix are quite simple, I'd propose doing those at the same time as the fix for direct instantiation.

Additional alternative: Automatic deferred resolution

The one drawback to the above plan is that users need to know that certain schemas will require non-default configuration, which further requires additional call(s) to resolve references prior to calling evaluate(). This is necessary to both maintain backwards compatibility and preserve the architectural principle of using only pre-compiled/pre-resolved schemas.

However, it is possible, using additional keyword-level resolution flags, to make a resolution failure not automatically fatal. Instead, the reference is just marked as unresolved. During evaluation, the code attempts to resolve the reference again if it is not already resolved. This makes the whole process as transparent and automatic as possible.

However, there are quite a few problems with on-demand reference resolution during evaluation. There are quite a few variations on this that mitigate those problems somewhat. For example, tracking resolution at the root JSONSchema level, checking it during evaluate(), and calling Catalog.resolve_references() to ensure that deferred resoluiton failures are caught before evaluation actually starts.

I figure if any of this is appealing, it can be discussed further in a separate issue and implemented. For now, I think what i proposed under "Deferring reference resolution" above is the minimum change to be in compliance with the spec.

As always, I'd be happy to implement an alternative solution, but I wanted to file an issue with a concrete proposal.

The text was updated successfully, but these errors were encountered:

This was referenced Mar 18, 2023

Add document and resource rootschema properties #78

Merged

Allow deferring reference resolution to support mutually referencing bundles #82

Open

Additional/alternative support for complex references #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support instantiating mutually referencing schema bundles #76

Support instantiating mutually referencing schema bundles #76

handrews commented Mar 17, 2023

Support instantiating mutually referencing schema bundles #76

Support instantiating mutually referencing schema bundles #76

Comments

handrews commented Mar 17, 2023

Schema compilation and reference resolving

Open world schema networks

Deferring reference resolution

Solving with direct instantiation

More convenient resolution

Also solving via Catalog.load_json()

Additional alternative: Automatic deferred resolution

Also solving via `Catalog.load_json()`