Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categories filtering for nested properties #59

Merged
merged 19 commits into from
Jun 2, 2021

Conversation

kiendang
Copy link
Member

@kiendang kiendang commented Feb 23, 2021

This PR adds support for setting categories for non top-level properties. Close #58

Rules for the categories keyword:

  • must be under properties subschema
  • will be ignored if found inside a schema under any of these keywords: if, not, then, else, anyOf, oneOf

Implementation details

This makes use of jsonschema's validation for extracting details about categories. jsonschema's iter_errors lists all validation errors occurred during validation. The errors contain information about which part of the schema and which element of the json instance the error comes from among other things. By modifying the validator for the properties keyword so that it yields a custom error ExtractCategories (inheriting jsonschema.ValidationError) when the categories keyword is encountered, we can extract all categories declared in the schema.

The implementation is based on https://python-jsonschema.readthedocs.io/en/stable/faq/#why-doesn-t-my-schema-s-default-property-set-the-default-on-my-instance

Examples of schemas that work

simple
{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "properties": {
                "email": {
                    "description": "email address",
                    "categories": ["user-identifiable-information"],
                    "type": "string"
                },
                "id": {
                    "description": "user ID",
                    "type": "string"
                }
            }
        }
    }
}
allOf
{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "allOf": [
                {
                    "properties": {
                        "email": {
                            "description": "email address",
                            "categories": ["user-identifiable-information"],
                            "type": "string"
                        }
                    }
                },
                {
                    "properties": {
                        "id": {
                            "description": "user ID",
                            "type": "string"
                        }
                    }
                }
            ]
        }
    }
}
reference
{
    "$id": "test.schema",
    "title": "Test Event",
    "version": 1.0,
    "description": "Test Event.",
    "type": "object",
    "properties": {
        "nothing-exciting": {
            "description": "a property with nothing exciting happening",
            "categories": ["unrestricted"],
            "type": "string"
        },
        "user": {
            "description": "user",
            "categories": ["user-identifier"],
            "type": "object",
            "$ref": "#/definitions/properties"
        }
    },
    "definitions": {
        "properties": {
            "properties": {
                "email": {
                    "description": "email address",
                    "categories": ["user-identifiable-information"],
                    "type": "string"
                },
                "id": {
                    "description": "user ID",
                    "type": "string"
                }
            }
        }
    }
}

Caveats

Limitations

With this implementation, categories is treated as a validation error instead of annotation. It gets the job done, but comes with some limitations:

  1. Since categories raises a validation error, it interferes with JSON schema validation. Thus when recording event we have to run the equivalence of jsonschema.validate 2 times, one for validation, the other for category extraction. Filtering categories and validating at the same time since category filtering would invalidate some otherwise valid json's. For example, setting categories for a property under the allOf keyword would raise a ExtractCategories (which is also a jsonschema.ValidationError) and would invalidate all perfectly valid json. This might affect performance. A solution being considered is using fastjsonschema for fast validation. See Use fastjsonschema for json schema validation #64.

  2. Supporting categories under keywords such as if, then, else, anyOf, oneOf is non trivial.

  3. Things like this which could be useful, maybe for Allow schemas to be extended #52, are not supported

"additionalProperties": {
  "categories": [...]
}

Semantically categories is more like an annotation instead of validation keyword and would be better implemented using annotation collection/output formats, a feature of JSON schema available in draft 2019-09 (formerly draft-08). Annotation collection/output formats aims to extend the use of JSON schema beyond validation and was designed for use cases like ours and would address these shortcomings. More details here and here. However support for draft 2019-09 in the python jsonschema library is not available yet, though development has started. Also note that currently not all JSON schema implementations support draft 2019-09 and not all such implementations support output formats since this is a suggested but not required feature. Only a couple of implementations supporting output formats at the moment that I know of, in Perl and C#. Though with the new release of OpenAPI that supports draft 2019-09 and 2020-12 wider adoptions hopefully would follow.

For us we could stick with this implementation for now and wait for draft 2019-09 and output formats.

Overlapped categories

There are situations where a property have categories declared multiple times. For example,

"user": {
  "allOf": [
    {
      "properties": {
        "id": {
          "categories": ["user-identifiable-information"]
        }
      }
    },
    {
      "properties": {
        "id": {
          "categories": ["user-identifier"]
        }
      }
    }
  ]
}

Here id has categories declared twice, as ["user-identifiable-information"] and ["user-identifier"].

"user": {
  "properties": {
    "id": {
      "categories": ["user-identifiable-information"]
    }
  },
  "dependencies": {
    "email": {
      "properties": {
        "id": {
          "categories": ["user-identifier"]
        }
      }
    }
  }
}

For instance {"user": {"id": 1, "email": "[email protected]"}}, id has categories declared twice, as ["user-identifiable-information"] and ["user-identifier"].

The current stand, based on discussion with Min, is that we do not support this behavior and advise end users to design and test their event schema to make sure that each property only has their categories declared once.

In the current implementation if this situation happens then the properties will take whatever categories the validator encounters last, which is not deterministic.

Extras

Explicit JSON schema version

This PR also explicitly uses json schema DRAFT-07 instead of supporting all drafts and use jsonschema.validators.validator_for to determine which draft to use for a certain schema.

allowed_properties

Currently allowed_properties is for top-level properties only. If a top-level property is included in allowed_properties then all the descendant properties under it are recorded, even if some of them do not have all their categories whitelisted.

Censored fields

Properties that do not have all their categories allowed got their values set to null. There is no way to differentiate whether a property is null because its value is actually null or because it is hidden due to categories. Should we have a field in the metadata that specifies which properties are censored? For example,

"__masked__": [
  ["user", "email"],
  ["user", "id"]
]

Modify event in place?

UPDATE: We decided to just modify event in place for efficiency. Can't think of many use cases where the event is needed after being emitted, and even in that case users can just make a copy of the event themselves.

Category filtering means setting properties in then event to null so to avoid modifying the event passed to record_event I did a deepcopy on the event at the beginning. This probably affects performance for large event since deepcopy is notoriously slow. Should we just skip that, meaning getting rid of the deepcopy and just modifying the event in place and putting it in the doc and if users want to preserve the event they could make a copy themselves before passing to record_event?

@Zsailer
Copy link
Member

Zsailer commented Jun 1, 2021

Overall, this PR looks great! Thanks, @kiendang. Just one minor comment left above.

@Zsailer
Copy link
Member

Zsailer commented Jun 1, 2021

Should we have a field in the metadata that specifies which properties are censored?

Yes, I think so. I believe adding a masked field is the way to go here. Do you want to add it here in this PR or submit a separate PR?

I missed #61. Will comment over there when I get a chance.

Add docstrings
Rename stuff more appropriately
Expose only what's necessary
Copy link
Member

@Zsailer Zsailer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, @kiendang! Merging!

@Zsailer Zsailer merged commit d2ded1d into jupyter:master Jun 2, 2021
@kiendang kiendang deleted the categories-filter branch June 2, 2021 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set categories for properties that are not top-level
2 participants