PoC recursive schemas #61

Stranger6667 · 2020-08-07T18:19:18Z

Hello Zac!

I want to share my current progress with recursive schemas support and would appreciate if you could advise if I am moving in the right direction! :)

NOTE: The code has some hacks like type: ignore and generally is not cleaned up

At this moment I have a few cases working (and there is a minimal test for it):

A schema that contains a reference to its root
A schema that contains a parent-child style of relationship (Node.children, where children is a list of Node instances)

During the implementation process, I encountered a ValueError: Circular reference detected error inside the encode_canonical_json function, but adding a deepcopy call to the internals of object_schema helped.

However, when I tried to generate Open API schemas (the spec) the circular reference issue happened again and deepcopy didn't solve the problem - I am not sure if it is the right approach or maybe I missed it somewhere. And probably it is not needed in all cases, I am not 100% sure. What do you think?

Also, from_schema does not accept a resolver. The current one in hypothesis-jsonschema does not resolve remote references, but it is a crucial feature - often big schemas are decomposed into multiple local files and such an ability will be very helpful for such cases. What do you think about extending from_schema so it accepts a resolver?

Another point is that I am not sure if my current approach to detecting recursive references inside resolve_all_refs is sufficient - there is a corner case with # which is hardcoded and then it checks if it is the checked reference is in the same resolution scope, but as far as I understand it needs to check if the reference goes to any ancestors of the current scope + track cycles. Not sure what would be the best place to track cycles though, since (as far as I see) it should be passed along recursive calls - maybe some extra list/set with seen references?

I would be very grateful for any feedback!

Thank you

Zac-HD

👋 hey, thanks for the update @Stranger6667!

This looks pretty good to me - I think the remaining work is going to look like carefully tidying up around edge cases with a side of working out how the dataflow is meant to work.

We also seem to have some serious slowdowns in our tests, though part of that might just be running on slow schemas that didn't work at all before - not a problem for now, but I'll want to make CI fast again before merging.

Deepcopying should only be needed where we're resolving references too - I'm pretty sure dropping deepcopy(resolved) into _recurse() would help, and it might be needed in resolve_all_refs() too.

src/hypothesis_jsonschema/_from_schema.py

Zac-HD · 2020-08-09T06:59:29Z

deepcopy didn't solve the problem - I am not sure if it is the right approach or maybe I missed it somewhere. And probably it is not needed in all cases, I am not 100% sure. What do you think?

I think careful use of deepcopy when resolving references ought to do the trick, though you'll have to try it to find out!

I also don't mind unconditionally making copies, so long as each place we do so is required for at least some schemas. Clean code over fast code, at least until there's a benchmark to justify any extra complexity!

Also, from_schema does not accept a resolver. The current one in hypothesis-jsonschema does not resolve remote references, but it is a crucial feature - often big schemas are decomposed into multiple local files and such an ability will be very helpful for such cases. What do you think about extending from_schema so it accepts a resolver?

I would prefer to keep this separate, e.g. by making a public resolve_references(schema, *, allow_remote) function - there's nothing wrong with IO, I just don't want any implicit in such complicated functions as we have here because it makes them a pain to test.

Another point is that I am not sure if my current approach to detecting recursive references inside resolve_all_refs is sufficient - there is a corner case with # which is hardcoded and then it checks if it is the checked reference is in the same resolution scope, but as far as I understand it needs to check if the reference goes to any ancestors of the current scope + track cycles. Not sure what would be the best place to track cycles though, since (as far as I see) it should be passed along recursive calls - maybe some extra list/set with seen references?

It sounds like you're on the right track, though I don't have any advice here.

Stranger6667 · 2020-08-24T14:11:46Z

Hi Zac!

Thank you for your review! I've been a bit out for the last two weeks, but now I want to continue working on this issue :)

We also seem to have some serious slowdowns in our tests, though part of that might just be running on slow schemas that didn't work at all before - not a problem for now, but I'll want to make CI fast again before merging.

I made a bit of profiling for the following recursive schema:

SCHEMA = {
    "definitions": {
        "Node": {
            "type": "object",
            "properties": {
                "children": {
                    "type": "array",
                    "items": {"$ref": "#/definitions/Node"},
                    "maxItems": 2,
                }
            },
            "required": ["children"],
            "additionalProperties": False,
        },
    },
    "$ref": "#/definitions/Node",
}

And the following test:

import cProfile

from hypothesis import given, settings, HealthCheck, seed

from hypothesis_jsonschema import from_schema

@given(value=from_schema(SCHEMA))
@settings(
    max_examples=20,
    suppress_health_check=[
        HealthCheck.too_slow,
        HealthCheck.filter_too_much,
        HealthCheck.data_too_large,
    ],
    derandomize=True,
)
@seed(1)
def test(value):
    pass


with cProfile.Profile() as pr:
    test()

pr.print_stats(sort=1)

And here is the output:

⇒  python run.py
         482137147 function calls (107024655 primitive calls) in 57.344 seconds

   Ordered by: internal time

            ncalls   tottime  percall  cumtime  percall    filename:lineno(function)
382364705/14962422    37.758    0.000   40.692    0.000    encoder.py:333(_iterencode_dict)
     6359117/56015     3.304    0.000    7.213    0.000    copy.py:128(deepcopy)
         97952/658     1.865    0.000   49.343    0.075    _canonicalise.py:248(canonicalish)
          14962422     1.735    0.000   42.458    0.000    encoder.py:413(_iterencode)
     1233854/56015     1.492    0.000    7.130    0.000    copy.py:226(_deepcopy_dict)
          22945640     1.400    0.000    1.409    0.000    {built-in method builtins.isinstance}
             96981     1.203    0.000   44.245    0.000    encoder.py:182(encode)
...
             96981     0.201    0.000   44.484    0.000    __init__.py:183(dumps)

So, JSON encoding takes most of the time. Such results are reproducible with other seeds and schemas. I am not sure if this is specific to my current implementation or if it is an inherited complexity of generating recursive schemas here 🤷‍♂️

I tried to improve the performance of JSON encoding and wrote a small Rust encoder with Python bindings - it produces logically the same output but more compact (and not that much optimized at the moment)

from canonicalising_json import dumps

def encode_canonical_json(value: JSONType) -> str:
    """Canonical form serialiser, for uniqueness testing."""
    return dumps(value)

And the results are:

⇒  python run.py
         56660931 function calls (48977594 primitive calls) in 12.913 seconds

   Ordered by: internal time

         ncalls  tottime  percall  cumtime  percall   filename:lineno(function)
  6359117/56015    3.321    0.000    7.159    0.000   copy.py:128(deepcopy)
      97952/658    1.734    0.000    4.997    0.008   _canonicalise.py:248(canonicalish)
  1233854/56015    1.474    0.000    7.076    0.000   copy.py:226(_deepcopy_dict)
       17002148    0.973    0.000    0.973    0.000   {method 'get' of 'dict' objects}
          96981    0.753    0.000    0.753    0.000   decoder.py:343(raw_decode)
        9507798    0.548    0.000    0.548    0.000   {built-in method builtins.id}
        1546168    0.446    0.000    0.631    0.000   copy.py:242(_keep_alive)
        5955824    0.371    0.000    0.379    0.000   {built-in method builtins.isinstance}
      57331/658    0.339    0.000    7.683    0.012   _canonicalise.py:605(resolve_all_refs)
          96981    0.334    0.000    0.334    0.000   {dumps}

The cumulative time of "dumps" calls went from 44.484 seconds to 0.334, but I am not sure about bringing such an approach to the table - the encoder itself is in the proof-of-concept stage and not yet safe and well tested. Without cProfile tracing, the total running time went from ~16.8 seconds to 6.3 seconds. And I don't know if it is possible to solve somehow another way rather than bringing a compiled language to solve performance issues - maybe it is not necessary to call this encoding this way all the time ... What do you think?

Deepcopying should only be needed where we're resolving references too - I'm pretty sure dropping deepcopy(resolved) into _recurse() would help, and it might be needed in resolve_all_refs() too.

Indeed! I removed the most of deepcopy calls and placed a couple of more in the resolve_all_refs function :) Couple of more might be needed before resolve_all_refs recurse call - I will check that!

I would prefer to keep this separate, e.g. by making a public resolve_references(schema, *, allow_remote) function - there's nothing wrong with IO, I just don't want any implicit in such complicated functions as we have here because it makes them a pain to test.

Yep, you are right! Now I pass the resolver around - will keep it on the internal functions level

Zac-HD · 2020-08-24T14:31:17Z

Very nice work! And thanks again for taking on such a significant new feature 😍

The current design is biased very strongly towards obvious code rather than performant code, so it is very likely that we could simply call encode_canonical_json much less often if we're willing to track the data more carefully. Adding an LRU cache keyed off object ID would probably also help, as we're serialising the same objects very many times.

If it still seems useful to you after we're confident that the Python code is performing reasonably well, I would also be happy to take an optional dependency on some native extensions for speed. After all, I have this handy set of tools to check that they're equivalent 😉

Stranger6667 · 2020-08-24T19:08:28Z

Thanks!

Adding an LRU cache keyed off object ID would probably also help, as we're serialising the same objects very many times.

I tried this approach, and it helped quite a lot - the running time for that exampled dropped from ~16.8 seconds to 10.2. However, I am wondering if I got your idea correctly. What do you mean by object ID?

I was thinking about the following approaches here:

Checking actual id of input objects - but unfortunately, there are many situations where objects are mutated (e.g. $ref is removed), so the canonicalisation result of the same object in memory differs from one point to another. Otherwise, I suspect that it would be the fastest way
Calculating a hash of the input via fast non-cryptographic hash (something from hashlib for example)
Converting the input into a hashable python object like the internals of lru_cache do (this approach showed improvement up to 10.2 seconds)

Here is rough implemenation of the 3rd option:

def make_key(value):
    if value is None or isinstance(value, (bool, int, float, str)):
        return value
    if isinstance(value, list):
        return list, tuple(map(make_key, value))
    if isinstance(value, dict):
        return dict, tuple((k, make_key(v)) for k, v in value.items())


def hashable_json(func):

    class HashableDict(dict):

        def __hash__(self):
            return hash(make_key(self))

    class HashableList(list):

        def __hash__(self):
            return hash(make_key(self))

    @functools.wraps(func)
    def wrapped(value):
        if isinstance(value, dict):
            value = HashableDict(value)
        elif isinstance(value, list):
            value = HashableList(value)
        return func(value)

    return wrapped


@hashable_json
@functools.lru_cache()
def encode_canonical_json(value: JSONType) -> str:
    """Canonical form serialiser, for uniqueness testing."""
    return json.dumps(value, sort_keys=True, cls=CanonicalisingJsonEncoder)

Also, the current test suite running time consistently improved by ~10% by using this. Generally, it seems like recursive schemas are affected more by this. Does this approach make sense? If it makes sense, then would it be better if I submit this cache in another PR to reduce the change surface here?

If it still seems useful to you after we're confident that the Python code is performing reasonably well, I would also be happy to take an optional dependency on some native extensions for speed. After all, I have this handy set of tools to check that they're equivalent

Cool! I will keep experimenting :)

Zac-HD · 2020-08-25T07:47:05Z

Haha, I was in fact thinking about id and as you say that's not going to work... that's why I don't merge anything late at night 😅

Your hash trick is pretty neat! I suspect it's still making shallow copies though, like the list and dict constructors - so a larger speedup might be on the table if we can use a reference instead via some kind of proxy object. If you get this working well and can measure a substantial performance improvement with recursive schemas, I'd be happy to accept that as a separate PR and you can rebase on it after that.

Stranger6667 · 2020-08-26T20:21:05Z

I tried an alternative approach by not calling resolve_all_refs at all - all references are basically are left in the schema without inlining (as far as I understand)

After updating the xfail_on_reference_resolve_error decorator by excluding check for recursive references I have:

15 failed tests (didn't check everything yet, but reported Possible mishandling in allOf keyword canonicalisation #65 and Failing test_canonicalises_to_equivalent_fixpoint #64 )
38 xfailed tests. Currently, in the master branch, there are 87 xfailed tests

And the data generation is much faster. Schema from my example above runs 240.3 ms ± 1.2 ms vs 5.494 s ± 0.098 s on the current commit, which is ~23x faster (and the generated data is the same). Most of the problems I experienced before with resolve_all_refs are not there ... It feels like it might be much easier to fix all the corner cases by using this approach.
I am wondering if there could be downsides to such an approach? Harder debugging maybe?

Also, I found the problem that caused major slowdowns - the Open API schema with resolved references before canonicalisation takes ~2.8 Mb of space as JSON, and then canonicalise just stuck - I waited for a few minutes but it didn't finish, however, JSON encoder was called a lot of times there.

I will clean up these changes as a separate commit and push it tomorrow - I had to pass the resolver almost everywhere :)

Zac-HD · 2020-08-27T04:29:35Z

I am wondering if there could be downsides to such an approach? Harder debugging maybe?

The really big one is that we need to inline references in order to canonicalise them - often we have "overlapping" subschemas, and if we can't canonicalise + merge them the constraint can only be expressed to Hypothesis as a filter.

So we need to be somewhat slower in most cases - though amortised, so the more examples you run the less it matters - in order to work at all for harder schemas. This average-case/worst-case tradeoff comes up a lot in Hypothesis - and if you allow any pathological cases at all they will show up at the least convenient times.

We just have to live with that for recursive references, but we can fully inline non-recursive references.

Zac-HD · 2020-08-27T04:46:27Z

And about recursive schemas: I think we're currently inlining too much of them. Consider the following:

schema = {
    "type": "object",
    "$ref": "#/definitions/a",
    "definitions": {
        "a": {"properties": {"foo": {"$ref": "#/definitions/x"}}},
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

We currently continue inlining possibly-recursive references until we get to a key which we've seen before. This tends to explode in size, because we've already inlined a copy of the subschema before we detect that it's recursive.

current = {
    "type": "object",
    "properties": {
        "foo": {
            "type": "array",
            "items": {"type": "array", "items": {"$ref": "#/definitions/x"}},
        }
    },
    "definitions": {
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

The ideal resolution would look like this: we inline up to the first reference which could recurse.

wanted = {
    "type": "object",
    "properties": {"foo": {"$ref": "#/definitions/x"}},
    "definitions": {
        "x": {"type": "array", "items": {"$ref": "#/definitions/y"}},
		# in a future PR, we could inline `y` into `x`; not now though
        "y": {"type": "array", "items": {"$ref": "#/definitions/x"}},
    },
}

Actually doing this will probably be pretty fiddly (sorry!) but lead to much better performance.

Stranger6667 · 2020-09-01T21:29:57Z

The really big one is that we need to inline references in order to canonicalise them - often we have "overlapping" subschemas, and if we can't canonicalise + merge them the constraint can only be expressed to Hypothesis as a filter.

Aha, it is an important insight that I didn't understand! Thanks :)

We just have to live with that for recursive references, but we can fully inline non-recursive references.

Understood :) I pushed some more changes:

I try to detect recursive references by tracking them in a tree-like structure. Each scopes stack is represented as a key in that tree and each one has a set of seen references - when this place is traversed second time it means that there is a cycle, since we somehow end up in the same place with the same reference. But, I am not 100% sure if it works in all cases
I pass resolver to almost all functions that use JSON schema validation because in some cases these validators need a proper reference resolving

I also created #66 that may resolve some issues I faced in this branch :)

Stranger6667 · 2020-11-21T01:03:45Z

Closing this one in favor of #69, which seems to be much closer to be a fully working implementation :)

Zac-HD reviewed Aug 9, 2020

View reviewed changes

src/hypothesis_jsonschema/_from_schema.py Outdated Show resolved Hide resolved

Stranger6667 mentioned this pull request Aug 25, 2020

Caching for canonicalised JSON #62

Merged

Stranger6667 force-pushed the dd/recursive-poc branch from b9c9919 to c2d2acd Compare August 25, 2020 15:03

Stranger6667 added 4 commits September 10, 2020 09:51

PoC recursive schemas

a5f3e4a

Further improvements

f06c423

Copy data before recursing

f03d92a

Pass resolver where it is needed

448e36a

Stranger6667 force-pushed the dd/recursive-poc branch 2 times, most recently from 9aace2c to 7dfd92c Compare September 10, 2020 07:53

Add seen_map

c8f78f9

Stranger6667 force-pushed the dd/recursive-poc branch from 7dfd92c to c8f78f9 Compare September 10, 2020 07:55

Stranger6667 mentioned this pull request Oct 8, 2020

[BUG] Generated requests have empty values for some schemas schemathesis/schemathesis#801

Closed

Stranger6667 mentioned this pull request Nov 13, 2020

[BUG] - Error running with more than 1 worker and some hyphothesis schemathesis/schemathesis#860

Closed

Zac-HD mentioned this pull request Nov 18, 2020

Support recursive references #33

Open

Stranger6667 mentioned this pull request Nov 18, 2020

Minimal recursive references support #69

Draft

Stranger6667 closed this Nov 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC recursive schemas #61

PoC recursive schemas #61

Stranger6667 commented Aug 7, 2020

Zac-HD left a comment

Zac-HD commented Aug 9, 2020

Stranger6667 commented Aug 24, 2020

Zac-HD commented Aug 24, 2020 •

edited

Loading

Stranger6667 commented Aug 24, 2020

Zac-HD commented Aug 25, 2020

Stranger6667 commented Aug 26, 2020 •

edited

Loading

Zac-HD commented Aug 27, 2020

Zac-HD commented Aug 27, 2020

Stranger6667 commented Sep 1, 2020

Stranger6667 commented Nov 21, 2020

PoC recursive schemas #61

PoC recursive schemas #61

Conversation

Stranger6667 commented Aug 7, 2020

Zac-HD left a comment

Choose a reason for hiding this comment

Zac-HD commented Aug 9, 2020

Stranger6667 commented Aug 24, 2020

Zac-HD commented Aug 24, 2020 • edited Loading

Stranger6667 commented Aug 24, 2020

Zac-HD commented Aug 25, 2020

Stranger6667 commented Aug 26, 2020 • edited Loading

Zac-HD commented Aug 27, 2020

Zac-HD commented Aug 27, 2020

Stranger6667 commented Sep 1, 2020

Stranger6667 commented Nov 21, 2020

Zac-HD commented Aug 24, 2020 •

edited

Loading

Stranger6667 commented Aug 26, 2020 •

edited

Loading