Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive filtering when generating object properties #55

Closed
yurikhan opened this issue Jun 14, 2020 · 5 comments
Closed

Excessive filtering when generating object properties #55

yurikhan opened this issue Jun 14, 2020 · 5 comments

Comments

@yurikhan
Copy link

I’m bumping into the filter_too_much hypothesis health check.

I seem to be unable to reproduce the actual health check failure on a small example, but running the following test with --hypothesis-show-statistics shows filtering failure events:

from hypothesis import given
from hypothesis_jsonschema import from_schema

SCHEMA = {
    'type': 'object',
    'additionalProperties': False,
    'required': ['foo'],
    'properties': {
        'foo': {'type': 'integer'}
    }
}

@given(from_schema(SCHEMA))
def test_excessive_filtering(instance):
    pass
$ py.test --hypothesis-show-statistics
[…]
  - during generate phase (0.65 seconds):
    - Typical runtimes: < 1ms, ~ 88% in data generation
    - 99 passing examples, 0 failing examples, 475 invalid examples
    - Events:
      * 82.75%, Aborted test because unable to satisfy sampled_from(['foo']).filter(<jsonschema.validators.create.<locals>.Validator object at 0x7ff9a9962b00>.is_valid).filter(lambda s: s not in out)
      * 82.75%, Retried draw from sampled_from(['foo']).filter(<jsonschema.validators.create.<locals>.Validator object at 0x7ff9a9962b00>.is_valid).filter(lambda s: s not in out) to satisfy filter

  - Stopped because settings.max_examples=100

As far as I can understand the black magic in from_object_schema, here’s what happens:

  • It counts the minimum possible number of properties in the object, which is the number of required properties, which is 1.
  • It determines the maximum possible number of properties, which is +∞. (Seriously, does anybody ever set maxProperties on a schema for an object with all properties known?)
  • It draws some elements, no fewer than min_size (1) and on average min_size + 5 (6).
  • For each element drawn:
    • If a required property is missing, it generates that property. (This happens for the first element and property foo.)
    • If all required properties are set, it sees if any dependent properties are missing, and generates those. (My schema does not have any of these.)
    • Otherwise, it tries to generate an arbitrary property name it could add. But the only statically named property is already set, and no dynamic names are allowed, so it rejects the element. This happens, on average, 5 times per generated object.

The filtering events go away with this trivial patch:

--- _from_schema.py.orig	2020-06-15 03:14:06.321863065 +0700
+++ _from_schema.py	2020-06-15 03:13:56.965921488 +0700
@@ -474,6 +474,9 @@
     additional = schema.get("additionalProperties", {})
     additional_allowed = additional != FALSEY
 
+    if not patterns and not additional_allowed:
+        max_size = min(max_size, len(properties))
+
     dependencies = schema.get("dependencies", {})
     dep_names = {k: v for k, v in dependencies.items() if isinstance(v, list)}
     dep_schemas = {k: v for k, v in dependencies.items() if k not in dep_names}
@Zac-HD Zac-HD closed this as completed in e041bfc Jun 15, 2020
@Zac-HD
Copy link
Member

Zac-HD commented Jun 15, 2020

Fantastic issue writeup, making it very easy to fix - thanks so much! I've released 0.16.1 😁

I've patched the _canonicalise.canonicalish() function, instead of from_object_schema(), which is responsible for translating all the schemas that people actually write into equivalent schemas which are more efficient to translate into strategies.

@yurikhan
Copy link
Author

Thanks! Now hypothesis-jsonschema is working for me and finding real bugs :)

@Zac-HD
Copy link
Member

Zac-HD commented Jun 15, 2020

Happy to help!

I'd love to hear more about what you're doing with Hypothesis and hypothesis-jsonschema, if you've written it up somewhere?

@yurikhan
Copy link
Author

We integrate with poorly documented external APIs. For each API, we write a response schema and validate responses; if the API suddenly starts responding differently, this validation will fail visibly and safely, so we will know we need to look at the code again. hypothesis-jsonschema helps us test that our code is actually able to cope with data that passes that validation.

As for the nature of those APIs, not sure I can talk about that. Consider them boring businessy things ;)

@Zac-HD
Copy link
Member

Zac-HD commented Jun 16, 2020

Nice! The "does my code actually work for arbitrary valid data" problem is exactly what I built hypothesis-jsonschema for, it's a lovely test pattern. Combine with lots of internal assertions to find even more bugs 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants