-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC recursive schemas #61
PoC recursive schemas #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 hey, thanks for the update @Stranger6667!
This looks pretty good to me - I think the remaining work is going to look like carefully tidying up around edge cases with a side of working out how the dataflow is meant to work.
We also seem to have some serious slowdowns in our tests, though part of that might just be running on slow schemas that didn't work at all before - not a problem for now, but I'll want to make CI fast again before merging.
Deepcopying should only be needed where we're resolving references too - I'm pretty sure dropping deepcopy(resolved)
into _recurse()
would help, and it might be needed in resolve_all_refs()
too.
I think careful use of I also don't mind unconditionally making copies, so long as each place we do so is required for at least some schemas. Clean code over fast code, at least until there's a benchmark to justify any extra complexity!
I would prefer to keep this separate, e.g. by making a public
It sounds like you're on the right track, though I don't have any advice here. |
Hi Zac! Thank you for your review! I've been a bit out for the last two weeks, but now I want to continue working on this issue :)
I made a bit of profiling for the following recursive schema: SCHEMA = {
"definitions": {
"Node": {
"type": "object",
"properties": {
"children": {
"type": "array",
"items": {"$ref": "#/definitions/Node"},
"maxItems": 2,
}
},
"required": ["children"],
"additionalProperties": False,
},
},
"$ref": "#/definitions/Node",
} And the following test: import cProfile
from hypothesis import given, settings, HealthCheck, seed
from hypothesis_jsonschema import from_schema
@given(value=from_schema(SCHEMA))
@settings(
max_examples=20,
suppress_health_check=[
HealthCheck.too_slow,
HealthCheck.filter_too_much,
HealthCheck.data_too_large,
],
derandomize=True,
)
@seed(1)
def test(value):
pass
with cProfile.Profile() as pr:
test()
pr.print_stats(sort=1) And here is the output:
So, JSON encoding takes most of the time. Such results are reproducible with other seeds and schemas. I am not sure if this is specific to my current implementation or if it is an inherited complexity of generating recursive schemas here 🤷♂️ I tried to improve the performance of JSON encoding and wrote a small Rust encoder with Python bindings - it produces logically the same output but more compact (and not that much optimized at the moment) from canonicalising_json import dumps
def encode_canonical_json(value: JSONType) -> str:
"""Canonical form serialiser, for uniqueness testing."""
return dumps(value) And the results are:
The cumulative time of "dumps" calls went from 44.484 seconds to 0.334, but I am not sure about bringing such an approach to the table - the encoder itself is in the proof-of-concept stage and not yet safe and well tested. Without cProfile tracing, the total running time went from ~16.8 seconds to 6.3 seconds. And I don't know if it is possible to solve somehow another way rather than bringing a compiled language to solve performance issues - maybe it is not necessary to call this encoding this way all the time ... What do you think?
Indeed! I removed the most of
Yep, you are right! Now I pass the resolver around - will keep it on the internal functions level |
Very nice work! And thanks again for taking on such a significant new feature 😍 The current design is biased very strongly towards obvious code rather than performant code, so it is very likely that we could simply call If it still seems useful to you after we're confident that the Python code is performing reasonably well, I would also be happy to take an optional dependency on some native extensions for speed. After all, I have this handy set of tools to check that they're equivalent 😉 |
Thanks!
I tried this approach, and it helped quite a lot - the running time for that exampled dropped from ~16.8 seconds to 10.2. However, I am wondering if I got your idea correctly. What do you mean by object ID? I was thinking about the following approaches here:
Here is rough implemenation of the 3rd option: def make_key(value):
if value is None or isinstance(value, (bool, int, float, str)):
return value
if isinstance(value, list):
return list, tuple(map(make_key, value))
if isinstance(value, dict):
return dict, tuple((k, make_key(v)) for k, v in value.items())
def hashable_json(func):
class HashableDict(dict):
def __hash__(self):
return hash(make_key(self))
class HashableList(list):
def __hash__(self):
return hash(make_key(self))
@functools.wraps(func)
def wrapped(value):
if isinstance(value, dict):
value = HashableDict(value)
elif isinstance(value, list):
value = HashableList(value)
return func(value)
return wrapped
@hashable_json
@functools.lru_cache()
def encode_canonical_json(value: JSONType) -> str:
"""Canonical form serialiser, for uniqueness testing."""
return json.dumps(value, sort_keys=True, cls=CanonicalisingJsonEncoder) Also, the current test suite running time consistently improved by ~10% by using this. Generally, it seems like recursive schemas are affected more by this. Does this approach make sense? If it makes sense, then would it be better if I submit this cache in another PR to reduce the change surface here?
Cool! I will keep experimenting :) |
Haha, I was in fact thinking about Your hash trick is pretty neat! I suspect it's still making shallow copies though, like the |
b9c9919
to
c2d2acd
Compare
I tried an alternative approach by not calling After updating the
And the data generation is much faster. Schema from my example above runs 240.3 ms ± 1.2 ms vs 5.494 s ± 0.098 s on the current commit, which is ~23x faster (and the generated data is the same). Most of the problems I experienced before with Also, I found the problem that caused major slowdowns - the Open API schema with resolved references before canonicalisation takes ~2.8 Mb of space as JSON, and then I will clean up these changes as a separate commit and push it tomorrow - I had to pass the resolver almost everywhere :) |
The really big one is that we need to inline references in order to canonicalise them - often we have "overlapping" subschemas, and if we can't canonicalise + merge them the constraint can only be expressed to Hypothesis as a filter. So we need to be somewhat slower in most cases - though amortised, so the more examples you run the less it matters - in order to work at all for harder schemas. This average-case/worst-case tradeoff comes up a lot in Hypothesis - and if you allow any pathological cases at all they will show up at the least convenient times. We just have to live with that for recursive references, but we can fully inline non-recursive references. |
And about recursive schemas: I think we're currently inlining too much of them. Consider the following: schema = {
"type": "object",
"$ref": "#/definitions/a",
"definitions": {
"a": {"properties": {"foo": {"$ref": "#/definitions/x"}}},
"x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
"y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
},
} We currently continue inlining possibly-recursive references until we get to a key which we've seen before. This tends to explode in size, because we've already inlined a copy of the subschema before we detect that it's recursive. current = {
"type": "object",
"properties": {
"foo": {
"type": "array",
"items": {"type": "array", "items": {"$ref": "#/definitions/x"}},
}
},
"definitions": {
"x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
"y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
},
} The ideal resolution would look like this: we inline up to the first reference which could recurse. wanted = {
"type": "object",
"properties": {"foo": {"$ref": "#/definitions/x"}},
"definitions": {
"x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
# in a future PR, we could inline `y` into `x`; not now though
"y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
},
} Actually doing this will probably be pretty fiddly (sorry!) but lead to much better performance. |
Aha, it is an important insight that I didn't understand! Thanks :)
Understood :) I pushed some more changes:
I also created #66 that may resolve some issues I faced in this branch :) |
9aace2c
to
7dfd92c
Compare
7dfd92c
to
c8f78f9
Compare
Closing this one in favor of #69, which seems to be much closer to be a fully working implementation :) |
Hello Zac!
I want to share my current progress with recursive schemas support and would appreciate if you could advise if I am moving in the right direction! :)
NOTE: The code has some hacks like
type: ignore
and generally is not cleaned upAt this moment I have a few cases working (and there is a minimal test for it):
Node.children
, wherechildren
is a list ofNode
instances)During the implementation process, I encountered a
ValueError: Circular reference detected
error inside theencode_canonical_json
function, but adding adeepcopy
call to the internals ofobject_schema
helped.However, when I tried to generate Open API schemas (the spec) the circular reference issue happened again and
deepcopy
didn't solve the problem - I am not sure if it is the right approach or maybe I missed it somewhere. And probably it is not needed in all cases, I am not 100% sure. What do you think?Also,
from_schema
does not accept a resolver. The current one inhypothesis-jsonschema
does not resolve remote references, but it is a crucial feature - often big schemas are decomposed into multiple local files and such an ability will be very helpful for such cases. What do you think about extendingfrom_schema
so it accepts a resolver?Another point is that I am not sure if my current approach to detecting recursive references inside
resolve_all_refs
is sufficient - there is a corner case with#
which is hardcoded and then it checks if it is the checked reference is in the same resolution scope, but as far as I understand it needs to check if the reference goes to any ancestors of the current scope + track cycles. Not sure what would be the best place to track cycles though, since (as far as I see) it should be passed along recursive calls - maybe some extra list/set with seen references?I would be very grateful for any feedback!
Thank you