Excessive filtering when generating object properties #55

yurikhan · 2020-06-14T20:25:19Z

I’m bumping into the filter_too_much hypothesis health check.

I seem to be unable to reproduce the actual health check failure on a small example, but running the following test with --hypothesis-show-statistics shows filtering failure events:

from hypothesis import given
from hypothesis_jsonschema import from_schema

SCHEMA = {
    'type': 'object',
    'additionalProperties': False,
    'required': ['foo'],
    'properties': {
        'foo': {'type': 'integer'}
    }
}

@given(from_schema(SCHEMA))
def test_excessive_filtering(instance):
    pass

$ py.test --hypothesis-show-statistics
[…]
  - during generate phase (0.65 seconds):
    - Typical runtimes: < 1ms, ~ 88% in data generation
    - 99 passing examples, 0 failing examples, 475 invalid examples
    - Events:
      * 82.75%, Aborted test because unable to satisfy sampled_from(['foo']).filter(<jsonschema.validators.create.<locals>.Validator object at 0x7ff9a9962b00>.is_valid).filter(lambda s: s not in out)
      * 82.75%, Retried draw from sampled_from(['foo']).filter(<jsonschema.validators.create.<locals>.Validator object at 0x7ff9a9962b00>.is_valid).filter(lambda s: s not in out) to satisfy filter

  - Stopped because settings.max_examples=100

As far as I can understand the black magic in from_object_schema, here’s what happens:

It counts the minimum possible number of properties in the object, which is the number of required properties, which is 1.
It determines the maximum possible number of properties, which is +∞. (Seriously, does anybody ever set maxProperties on a schema for an object with all properties known?)
It draws some elements, no fewer than min_size (1) and on average min_size + 5 (6).
For each element drawn:
- If a required property is missing, it generates that property. (This happens for the first element and property foo.)
- If all required properties are set, it sees if any dependent properties are missing, and generates those. (My schema does not have any of these.)
- Otherwise, it tries to generate an arbitrary property name it could add. But the only statically named property is already set, and no dynamic names are allowed, so it rejects the element. This happens, on average, 5 times per generated object.

The filtering events go away with this trivial patch:

--- _from_schema.py.orig	2020-06-15 03:14:06.321863065 +0700
+++ _from_schema.py	2020-06-15 03:13:56.965921488 +0700
@@ -474,6 +474,9 @@
     additional = schema.get("additionalProperties", {})
     additional_allowed = additional != FALSEY
 
+    if not patterns and not additional_allowed:
+        max_size = min(max_size, len(properties))
+
     dependencies = schema.get("dependencies", {})
     dep_names = {k: v for k, v in dependencies.items() if isinstance(v, list)}
     dep_schemas = {k: v for k, v in dependencies.items() if k not in dep_names}

The text was updated successfully, but these errors were encountered:

Zac-HD · 2020-06-15T04:38:32Z

Fantastic issue writeup, making it very easy to fix - thanks so much! I've released 0.16.1 😁

I've patched the _canonicalise.canonicalish() function, instead of from_object_schema(), which is responsible for translating all the schemas that people actually write into equivalent schemas which are more efficient to translate into strategies.

yurikhan · 2020-06-15T09:00:55Z

Thanks! Now hypothesis-jsonschema is working for me and finding real bugs :)

Zac-HD · 2020-06-15T10:00:38Z

Happy to help!

I'd love to hear more about what you're doing with Hypothesis and hypothesis-jsonschema, if you've written it up somewhere?

yurikhan · 2020-06-16T07:29:18Z

We integrate with poorly documented external APIs. For each API, we write a response schema and validate responses; if the API suddenly starts responding differently, this validation will fail visibly and safely, so we will know we need to look at the code again. hypothesis-jsonschema helps us test that our code is actually able to cope with data that passes that validation.

As for the nature of those APIs, not sure I can talk about that. Consider them boring businessy things ;)

Zac-HD · 2020-06-16T07:41:26Z

Nice! The "does my code actually work for arbitrary valid data" problem is exactly what I built hypothesis-jsonschema for, it's a lovely test pattern. Combine with lots of internal assertions to find even more bugs 😁

Zac-HD closed this as completed in e041bfc Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive filtering when generating object properties #55

Excessive filtering when generating object properties #55

yurikhan commented Jun 14, 2020

Zac-HD commented Jun 15, 2020

yurikhan commented Jun 15, 2020

Zac-HD commented Jun 15, 2020

yurikhan commented Jun 16, 2020

Zac-HD commented Jun 16, 2020

Excessive filtering when generating object properties #55

Excessive filtering when generating object properties #55

Comments

yurikhan commented Jun 14, 2020

Zac-HD commented Jun 15, 2020

yurikhan commented Jun 15, 2020

Zac-HD commented Jun 15, 2020

yurikhan commented Jun 16, 2020

Zac-HD commented Jun 16, 2020