Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC recursive schemas #61

Closed

Conversation

Stranger6667
Copy link
Contributor

Hello Zac!

I want to share my current progress with recursive schemas support and would appreciate if you could advise if I am moving in the right direction! :)

NOTE: The code has some hacks like type: ignore and generally is not cleaned up

At this moment I have a few cases working (and there is a minimal test for it):

  • A schema that contains a reference to its root
  • A schema that contains a parent-child style of relationship (Node.children, where children is a list of Node instances)

During the implementation process, I encountered a ValueError: Circular reference detected error inside the encode_canonical_json function, but adding a deepcopy call to the internals of object_schema helped.

However, when I tried to generate Open API schemas (the spec) the circular reference issue happened again and deepcopy didn't solve the problem - I am not sure if it is the right approach or maybe I missed it somewhere. And probably it is not needed in all cases, I am not 100% sure. What do you think?

Also, from_schema does not accept a resolver. The current one in hypothesis-jsonschema does not resolve remote references, but it is a crucial feature - often big schemas are decomposed into multiple local files and such an ability will be very helpful for such cases. What do you think about extending from_schema so it accepts a resolver?

Another point is that I am not sure if my current approach to detecting recursive references inside resolve_all_refs is sufficient - there is a corner case with # which is hardcoded and then it checks if it is the checked reference is in the same resolution scope, but as far as I understand it needs to check if the reference goes to any ancestors of the current scope + track cycles. Not sure what would be the best place to track cycles though, since (as far as I see) it should be passed along recursive calls - maybe some extra list/set with seen references?

I would be very grateful for any feedback!

Thank you

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 hey, thanks for the update @Stranger6667!

This looks pretty good to me - I think the remaining work is going to look like carefully tidying up around edge cases with a side of working out how the dataflow is meant to work.

We also seem to have some serious slowdowns in our tests, though part of that might just be running on slow schemas that didn't work at all before - not a problem for now, but I'll want to make CI fast again before merging.

Deepcopying should only be needed where we're resolving references too - I'm pretty sure dropping deepcopy(resolved) into _recurse() would help, and it might be needed in resolve_all_refs() too.

src/hypothesis_jsonschema/_from_schema.py Outdated Show resolved Hide resolved
@Zac-HD
Copy link
Member

Zac-HD commented Aug 9, 2020

deepcopy didn't solve the problem - I am not sure if it is the right approach or maybe I missed it somewhere. And probably it is not needed in all cases, I am not 100% sure. What do you think?

I think careful use of deepcopy when resolving references ought to do the trick, though you'll have to try it to find out!

I also don't mind unconditionally making copies, so long as each place we do so is required for at least some schemas. Clean code over fast code, at least until there's a benchmark to justify any extra complexity!

Also, from_schema does not accept a resolver. The current one in hypothesis-jsonschema does not resolve remote references, but it is a crucial feature - often big schemas are decomposed into multiple local files and such an ability will be very helpful for such cases. What do you think about extending from_schema so it accepts a resolver?

I would prefer to keep this separate, e.g. by making a public resolve_references(schema, *, allow_remote) function - there's nothing wrong with IO, I just don't want any implicit in such complicated functions as we have here because it makes them a pain to test.

Another point is that I am not sure if my current approach to detecting recursive references inside resolve_all_refs is sufficient - there is a corner case with # which is hardcoded and then it checks if it is the checked reference is in the same resolution scope, but as far as I understand it needs to check if the reference goes to any ancestors of the current scope + track cycles. Not sure what would be the best place to track cycles though, since (as far as I see) it should be passed along recursive calls - maybe some extra list/set with seen references?

It sounds like you're on the right track, though I don't have any advice here.

@Stranger6667
Copy link
Contributor Author

Hi Zac!

Thank you for your review! I've been a bit out for the last two weeks, but now I want to continue working on this issue :)

We also seem to have some serious slowdowns in our tests, though part of that might just be running on slow schemas that didn't work at all before - not a problem for now, but I'll want to make CI fast again before merging.

I made a bit of profiling for the following recursive schema:

SCHEMA = {
    "definitions": {
        "Node": {
            "type": "object",
            "properties": {
                "children": {
                    "type": "array",
                    "items": {"$ref": "#/definitions/Node"},
                    "maxItems": 2,
                }
            },
            "required": ["children"],
            "additionalProperties": False,
        },
    },
    "$ref": "#/definitions/Node",
}

And the following test:

import cProfile

from hypothesis import given, settings, HealthCheck, seed

from hypothesis_jsonschema import from_schema

@given(value=from_schema(SCHEMA))
@settings(
    max_examples=20,
    suppress_health_check=[
        HealthCheck.too_slow,
        HealthCheck.filter_too_much,
        HealthCheck.data_too_large,
    ],
    derandomize=True,
)
@seed(1)
def test(value):
    pass


with cProfile.Profile() as pr:
    test()

pr.print_stats(sort=1)

And here is the output:

⇒  python run.py
         482137147 function calls (107024655 primitive calls) in 57.344 seconds

   Ordered by: internal time

            ncalls   tottime  percall  cumtime  percall    filename:lineno(function)
382364705/14962422    37.758    0.000   40.692    0.000    encoder.py:333(_iterencode_dict)
     6359117/56015     3.304    0.000    7.213    0.000    copy.py:128(deepcopy)
         97952/658     1.865    0.000   49.343    0.075    _canonicalise.py:248(canonicalish)
          14962422     1.735    0.000   42.458    0.000    encoder.py:413(_iterencode)
     1233854/56015     1.492    0.000    7.130    0.000    copy.py:226(_deepcopy_dict)
          22945640     1.400    0.000    1.409    0.000    {built-in method builtins.isinstance}
             96981     1.203    0.000   44.245    0.000    encoder.py:182(encode)
...
             96981     0.201    0.000   44.484    0.000    __init__.py:183(dumps)

So, JSON encoding takes most of the time. Such results are reproducible with other seeds and schemas. I am not sure if this is specific to my current implementation or if it is an inherited complexity of generating recursive schemas here 🤷‍♂️

I tried to improve the performance of JSON encoding and wrote a small Rust encoder with Python bindings - it produces logically the same output but more compact (and not that much optimized at the moment)

from canonicalising_json import dumps

def encode_canonical_json(value: JSONType) -> str:
    """Canonical form serialiser, for uniqueness testing."""
    return dumps(value)

And the results are:

⇒  python run.py
         56660931 function calls (48977594 primitive calls) in 12.913 seconds

   Ordered by: internal time

         ncalls  tottime  percall  cumtime  percall   filename:lineno(function)
  6359117/56015    3.321    0.000    7.159    0.000   copy.py:128(deepcopy)
      97952/658    1.734    0.000    4.997    0.008   _canonicalise.py:248(canonicalish)
  1233854/56015    1.474    0.000    7.076    0.000   copy.py:226(_deepcopy_dict)
       17002148    0.973    0.000    0.973    0.000   {method 'get' of 'dict' objects}
          96981    0.753    0.000    0.753    0.000   decoder.py:343(raw_decode)
        9507798    0.548    0.000    0.548    0.000   {built-in method builtins.id}
        1546168    0.446    0.000    0.631    0.000   copy.py:242(_keep_alive)
        5955824    0.371    0.000    0.379    0.000   {built-in method builtins.isinstance}
      57331/658    0.339    0.000    7.683    0.012   _canonicalise.py:605(resolve_all_refs)
          96981    0.334    0.000    0.334    0.000   {dumps}

The cumulative time of "dumps" calls went from 44.484 seconds to 0.334, but I am not sure about bringing such an approach to the table - the encoder itself is in the proof-of-concept stage and not yet safe and well tested. Without cProfile tracing, the total running time went from ~16.8 seconds to 6.3 seconds. And I don't know if it is possible to solve somehow another way rather than bringing a compiled language to solve performance issues - maybe it is not necessary to call this encoding this way all the time ... What do you think?

Deepcopying should only be needed where we're resolving references too - I'm pretty sure dropping deepcopy(resolved) into _recurse() would help, and it might be needed in resolve_all_refs() too.

Indeed! I removed the most of deepcopy calls and placed a couple of more in the resolve_all_refs function :) Couple of more might be needed before resolve_all_refs recurse call - I will check that!

I would prefer to keep this separate, e.g. by making a public resolve_references(schema, *, allow_remote) function - there's nothing wrong with IO, I just don't want any implicit in such complicated functions as we have here because it makes them a pain to test.

Yep, you are right! Now I pass the resolver around - will keep it on the internal functions level

@Zac-HD
Copy link
Member

Zac-HD commented Aug 24, 2020

Very nice work! And thanks again for taking on such a significant new feature 😍

The current design is biased very strongly towards obvious code rather than performant code, so it is very likely that we could simply call encode_canonical_json much less often if we're willing to track the data more carefully. Adding an LRU cache keyed off object ID would probably also help, as we're serialising the same objects very many times.

If it still seems useful to you after we're confident that the Python code is performing reasonably well, I would also be happy to take an optional dependency on some native extensions for speed. After all, I have this handy set of tools to check that they're equivalent 😉

@Stranger6667
Copy link
Contributor Author

Thanks!

Adding an LRU cache keyed off object ID would probably also help, as we're serialising the same objects very many times.

I tried this approach, and it helped quite a lot - the running time for that exampled dropped from ~16.8 seconds to 10.2. However, I am wondering if I got your idea correctly. What do you mean by object ID?

I was thinking about the following approaches here:

  • Checking actual id of input objects - but unfortunately, there are many situations where objects are mutated (e.g. $ref is removed), so the canonicalisation result of the same object in memory differs from one point to another. Otherwise, I suspect that it would be the fastest way
  • Calculating a hash of the input via fast non-cryptographic hash (something from hashlib for example)
  • Converting the input into a hashable python object like the internals of lru_cache do (this approach showed improvement up to 10.2 seconds)

Here is rough implemenation of the 3rd option:

def make_key(value):
    if value is None or isinstance(value, (bool, int, float, str)):
        return value
    if isinstance(value, list):
        return list, tuple(map(make_key, value))
    if isinstance(value, dict):
        return dict, tuple((k, make_key(v)) for k, v in value.items())


def hashable_json(func):

    class HashableDict(dict):

        def __hash__(self):
            return hash(make_key(self))

    class HashableList(list):

        def __hash__(self):
            return hash(make_key(self))

    @functools.wraps(func)
    def wrapped(value):
        if isinstance(value, dict):
            value = HashableDict(value)
        elif isinstance(value, list):
            value = HashableList(value)
        return func(value)

    return wrapped


@hashable_json
@functools.lru_cache()
def encode_canonical_json(value: JSONType) -> str:
    """Canonical form serialiser, for uniqueness testing."""
    return json.dumps(value, sort_keys=True, cls=CanonicalisingJsonEncoder)

Also, the current test suite running time consistently improved by ~10% by using this. Generally, it seems like recursive schemas are affected more by this. Does this approach make sense? If it makes sense, then would it be better if I submit this cache in another PR to reduce the change surface here?

If it still seems useful to you after we're confident that the Python code is performing reasonably well, I would also be happy to take an optional dependency on some native extensions for speed. After all, I have this handy set of tools to check that they're equivalent

Cool! I will keep experimenting :)

@Zac-HD
Copy link
Member

Zac-HD commented Aug 25, 2020

Haha, I was in fact thinking about id and as you say that's not going to work... that's why I don't merge anything late at night 😅

Your hash trick is pretty neat! I suspect it's still making shallow copies though, like the list and dict constructors - so a larger speedup might be on the table if we can use a reference instead via some kind of proxy object. If you get this working well and can measure a substantial performance improvement with recursive schemas, I'd be happy to accept that as a separate PR and you can rebase on it after that.

@Stranger6667
Copy link
Contributor Author

Stranger6667 commented Aug 26, 2020

I tried an alternative approach by not calling resolve_all_refs at all - all references are basically are left in the schema without inlining (as far as I understand)

After updating the xfail_on_reference_resolve_error decorator by excluding check for recursive references I have:

And the data generation is much faster. Schema from my example above runs 240.3 ms ± 1.2 ms vs 5.494 s ± 0.098 s on the current commit, which is ~23x faster (and the generated data is the same). Most of the problems I experienced before with resolve_all_refs are not there ... It feels like it might be much easier to fix all the corner cases by using this approach.
I am wondering if there could be downsides to such an approach? Harder debugging maybe?

Also, I found the problem that caused major slowdowns - the Open API schema with resolved references before canonicalisation takes ~2.8 Mb of space as JSON, and then canonicalise just stuck - I waited for a few minutes but it didn't finish, however, JSON encoder was called a lot of times there.

I will clean up these changes as a separate commit and push it tomorrow - I had to pass the resolver almost everywhere :)

@Zac-HD
Copy link
Member

Zac-HD commented Aug 27, 2020

I am wondering if there could be downsides to such an approach? Harder debugging maybe?

The really big one is that we need to inline references in order to canonicalise them - often we have "overlapping" subschemas, and if we can't canonicalise + merge them the constraint can only be expressed to Hypothesis as a filter.

So we need to be somewhat slower in most cases - though amortised, so the more examples you run the less it matters - in order to work at all for harder schemas. This average-case/worst-case tradeoff comes up a lot in Hypothesis - and if you allow any pathological cases at all they will show up at the least convenient times.

We just have to live with that for recursive references, but we can fully inline non-recursive references.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 27, 2020

And about recursive schemas: I think we're currently inlining too much of them. Consider the following:

schema = {
    "type": "object",
    "$ref": "#/definitions/a",
    "definitions": {
        "a": {"properties": {"foo": {"$ref": "#/definitions/x"}}},
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

We currently continue inlining possibly-recursive references until we get to a key which we've seen before. This tends to explode in size, because we've already inlined a copy of the subschema before we detect that it's recursive.

current = {
    "type": "object",
    "properties": {
        "foo": {
            "type": "array",
            "items": {"type": "array", "items": {"$ref": "#/definitions/x"}},
        }
    },
    "definitions": {
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

The ideal resolution would look like this: we inline up to the first reference which could recurse.

wanted = {
    "type": "object",
    "properties": {"foo": {"$ref": "#/definitions/x"}},
    "definitions": {
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
		# in a future PR, we could inline `y` into `x`; not now though
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

Actually doing this will probably be pretty fiddly (sorry!) but lead to much better performance.

@Stranger6667
Copy link
Contributor Author

The really big one is that we need to inline references in order to canonicalise them - often we have "overlapping" subschemas, and if we can't canonicalise + merge them the constraint can only be expressed to Hypothesis as a filter.

Aha, it is an important insight that I didn't understand! Thanks :)

We just have to live with that for recursive references, but we can fully inline non-recursive references.

Understood :) I pushed some more changes:

  • I try to detect recursive references by tracking them in a tree-like structure. Each scopes stack is represented as a key in that tree and each one has a set of seen references - when this place is traversed second time it means that there is a cycle, since we somehow end up in the same place with the same reference. But, I am not 100% sure if it works in all cases
  • I pass resolver to almost all functions that use JSON schema validation because in some cases these validators need a proper reference resolving

I also created #66 that may resolve some issues I faced in this branch :)

@Stranger6667 Stranger6667 force-pushed the dd/recursive-poc branch 2 times, most recently from 9aace2c to 7dfd92c Compare September 10, 2020 07:53
@Stranger6667
Copy link
Contributor Author

Closing this one in favor of #69, which seems to be much closer to be a fully working implementation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants