Categories filtering for nested properties #59

kiendang · 2021-02-23T09:02:53Z

This PR adds support for setting categories for non top-level properties. Close #58

Rules for the `categories` keyword:

must be under properties subschema
will be ignored if found inside a schema under any of these keywords: if, not, then, else, anyOf, oneOf

Implementation details

This makes use of jsonschema's validation for extracting details about categories. jsonschema's iter_errors lists all validation errors occurred during validation. The errors contain information about which part of the schema and which element of the json instance the error comes from among other things. By modifying the validator for the properties keyword so that it yields a custom error ExtractCategories (inheriting jsonschema.ValidationError) when the categories keyword is encountered, we can extract all categories declared in the schema.

The implementation is based on https://python-jsonschema.readthedocs.io/en/stable/faq/#why-doesn-t-my-schema-s-default-property-set-the-default-on-my-instance

Examples of schemas that work

simple

{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "properties": {
                "email": {
                    "description": "email address",
                    "categories": ["user-identifiable-information"],
                    "type": "string"
                },
                "id": {
                    "description": "user ID",
                    "type": "string"
                }
            }
        }
    }
}

allOf

{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "allOf": [
                {
                    "properties": {
                        "email": {
                            "description": "email address",
                            "categories": ["user-identifiable-information"],
                            "type": "string"
                        }
                    }
                },
                {
                    "properties": {
                        "id": {
                            "description": "user ID",
                            "type": "string"
                        }
                    }
                }
            ]
        }
    }
}

reference

{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "$ref": "#/definitions/properties"
        }
    },
    "definitions": {
        "properties": {
            "properties": {
                "email": {
                    "description": "email address",
                    "categories": ["user-identifiable-information"],
                    "type": "string"
                },
                "id": {
                    "description": "user ID",
                    "type": "string"
                }
            }
        }
    }
}

Caveats

Limitations

With this implementation, categories is treated as a validation error instead of annotation. It gets the job done, but comes with some limitations:

Since categories raises a validation error, it interferes with JSON schema validation. Thus when recording event we have to run the equivalence of jsonschema.validate 2 times, one for validation, the other for category extraction. Filtering categories and validating at the same time since category filtering would invalidate some otherwise valid json's. For example, setting categories for a property under the allOf keyword would raise a ExtractCategories (which is also a jsonschema.ValidationError) and would invalidate all perfectly valid json. This might affect performance. A solution being considered is using fastjsonschema for fast validation. See Use fastjsonschema for json schema validation #64.
Supporting categories under keywords such as if, then, else, anyOf, oneOf is non trivial.
Things like this which could be useful, maybe for Allow schemas to be extended #52, are not supported

"additionalProperties": {
  "categories": [...]
}

Semantically categories is more like an annotation instead of validation keyword and would be better implemented using annotation collection/output formats, a feature of JSON schema available in draft 2019-09 (formerly draft-08). Annotation collection/output formats aims to extend the use of JSON schema beyond validation and was designed for use cases like ours and would address these shortcomings. More details here and here. However support for draft 2019-09 in the python jsonschema library is not available yet, though development has started. Also note that currently not all JSON schema implementations support draft 2019-09 and not all such implementations support output formats since this is a suggested but not required feature. Only a couple of implementations supporting output formats at the moment that I know of, in Perl and C#. Though with the new release of OpenAPI that supports draft 2019-09 and 2020-12 wider adoptions hopefully would follow.

For us we could stick with this implementation for now and wait for draft 2019-09 and output formats.

Overlapped categories

There are situations where a property have categories declared multiple times. For example,

"user": {
  "allOf": [
    {
      "properties": {
        "id": {
          "categories": ["user-identifiable-information"]
        }
      }
    },
    {
      "properties": {
        "id": {
          "categories": ["user-identifier"]
        }
      }
    }
  ]
}

Here id has categories declared twice, as ["user-identifiable-information"] and ["user-identifier"].

"user": {
  "properties": {
    "id": {
      "categories": ["user-identifiable-information"]
    }
  },
  "dependencies": {
    "email": {
      "properties": {
        "id": {
          "categories": ["user-identifier"]
        }
      }
    }
  }
}

For instance {"user": {"id": 1, "email": "[email protected]"}}, id has categories declared twice, as ["user-identifiable-information"] and ["user-identifier"].

The current stand, based on discussion with Min, is that we do not support this behavior and advise end users to design and test their event schema to make sure that each property only has their categories declared once.

In the current implementation if this situation happens then the properties will take whatever categories the validator encounters last, which is not deterministic.

Extras

Explicit JSON schema version

This PR also explicitly uses json schema DRAFT-07 instead of supporting all drafts and use jsonschema.validators.validator_for to determine which draft to use for a certain schema.

`allowed_properties`

Currently allowed_properties is for top-level properties only. If a top-level property is included in allowed_properties then all the descendant properties under it are recorded, even if some of them do not have all their categories whitelisted.

Censored fields

Properties that do not have all their categories allowed got their values set to null. There is no way to differentiate whether a property is null because its value is actually null or because it is hidden due to categories. Should we have a field in the metadata that specifies which properties are censored? For example,

"__masked__": [
  ["user", "email"],
  ["user", "id"]
]

Modify event in place?

UPDATE: We decided to just modify event in place for efficiency. Can't think of many use cases where the event is needed after being emitted, and even in that case users can just make a copy of the event themselves.

Category filtering means setting properties in then event to null so to avoid modifying the event passed to record_event I did a deepcopy on the event at the beginning. This probably affects performance for large event since deepcopy is notoriously slow. Should we just skip that, meaning getting rid of the deepcopy and just modifying the event in place and putting it in the doc and if users want to preserve the event they could make a copy themselves before passing to record_event?

Move assert outside the util function so that diffs are displayed in pytest error message.

Otherwise some valid schemas will fail by "error" caused by category extraction in subschema.

jupyter_telemetry/_eventschema.py

Zsailer · 2021-06-01T18:26:17Z

Overall, this PR looks great! Thanks, @kiendang. Just one minor comment left above.

Zsailer · 2021-06-01T18:26:48Z

Should we have a field in the metadata that specifies which properties are censored?

~~Yes, I think so. I believe adding a masked field is the way to go here. Do you want to add it here in this PR or submit a separate PR?~~

I missed #61. Will comment over there when I get a chance.

Add docstrings Rename stuff more appropriately Expose only what's necessary

Zsailer

Awesome work, @kiendang! Merging!

kiendang added 16 commits February 21, 2021 16:16

Update gitignore

a7b04ac

Add category filtering for sub fields

e0cfb39

Fix allowed_properties for sub-properties

9026d9b

Add tests for category filtering

a6a5a70

Fix deep_get

d2e86a6

Remove additional top-level properties without declared categories

6f4445b

Update capsule in place for better perf

9a34b1e

Add tests for category filtering for array elements

fb3e1ab

Refactor tests

5bdd155

Set top-level additional properties to null

7a86bb7

Explicitly use JSON Schema Draft-07

723c4fc

Modify tests

5073ec1

Move assert outside the util function so that diffs are displayed in pytest error message.

Separate schema validation and category extraction

5813b0e

Otherwise some valid schemas will fail by "error" caused by category extraction in subschema.

Yield ExtractCategories included as context in another validation error

ab0f509

Unexpose eventschema module

9e45aba

Make flake8 happy

b4e01f6

Zsailer mentioned this pull request Mar 3, 2021

Jupyter Server Notes 2021 jupyter-server/team-compass#4

Closed

kiendang added 2 commits April 30, 2021 02:02

Modify event data inplace for efficiency

4680b5e

Clean up

b17dc59

kiendang mentioned this pull request May 3, 2021

Handle category-filtered fields #61

Open

kiendang mentioned this pull request May 27, 2021

Use fastjsonschema for json schema validation #64

Draft

kiendang added this to the 0.2 milestone May 27, 2021

Zsailer reviewed Jun 1, 2021

View reviewed changes

jupyter_telemetry/_eventschema.py Outdated Show resolved Hide resolved

Reorganize

27b21e6

Add docstrings Rename stuff more appropriately Expose only what's necessary

kiendang force-pushed the categories-filter branch from 7538b01 to 27b21e6 Compare June 2, 2021 06:46

Zsailer approved these changes Jun 2, 2021

View reviewed changes

Zsailer merged commit d2ded1d into jupyter:master Jun 2, 2021

kiendang deleted the categories-filter branch June 2, 2021 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categories filtering for nested properties #59

Categories filtering for nested properties #59

kiendang commented Feb 23, 2021 •

edited

Loading

Zsailer commented Jun 1, 2021

Zsailer commented Jun 1, 2021 •

edited

Loading

Zsailer left a comment

Categories filtering for nested properties #59

Categories filtering for nested properties #59

Conversation

kiendang commented Feb 23, 2021 • edited Loading

Rules for the categories keyword:

Implementation details

Examples of schemas that work

Caveats

Limitations

Overlapped categories

Extras

Explicit JSON schema version

allowed_properties

Censored fields

Modify event in place?

Zsailer commented Jun 1, 2021

Zsailer commented Jun 1, 2021 • edited Loading

Zsailer left a comment

Choose a reason for hiding this comment

kiendang commented Feb 23, 2021 •

edited

Loading

Rules for the `categories` keyword:

`allowed_properties`

Zsailer commented Jun 1, 2021 •

edited

Loading