-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categories filtering for nested properties #59
Conversation
Move assert outside the util function so that diffs are displayed in pytest error message.
Otherwise some valid schemas will fail by "error" caused by category extraction in subschema.
Overall, this PR looks great! Thanks, @kiendang. Just one minor comment left above. |
I missed #61. Will comment over there when I get a chance. |
Add docstrings Rename stuff more appropriately Expose only what's necessary
7538b01
to
27b21e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, @kiendang! Merging!
This PR adds support for setting
categories
for non top-level properties. Close #58Rules for the
categories
keyword:properties
subschemaif
,not
,then
,else
,anyOf
,oneOf
Implementation details
This makes use of
jsonschema
's validation for extracting details about categories.jsonschema
'siter_errors
lists all validation errors occurred during validation. The errors contain information about which part of the schema and which element of the json instance the error comes from among other things. By modifying the validator for theproperties
keyword so that it yields a custom errorExtractCategories
(inheritingjsonschema.ValidationError
) when thecategories
keyword is encountered, we can extract all categories declared in the schema.The implementation is based on https://python-jsonschema.readthedocs.io/en/stable/faq/#why-doesn-t-my-schema-s-default-property-set-the-default-on-my-instance
Examples of schemas that work
simple
allOf
reference
Caveats
Limitations
With this implementation,
categories
is treated as a validation error instead of annotation. It gets the job done, but comes with some limitations:Since
categories
raises a validation error, it interferes with JSON schema validation. Thus when recording event we have to run the equivalence ofjsonschema.validate
2 times, one for validation, the other for category extraction. Filtering categories and validating at the same time since category filtering would invalidate some otherwise valid json's. For example, settingcategories
for a property under theallOf
keyword would raise aExtractCategories
(which is also ajsonschema.ValidationError
) and would invalidate all perfectly valid json. This might affect performance. A solution being considered is using fastjsonschema for fast validation. See Use fastjsonschema for json schema validation #64.Supporting
categories
under keywords such asif
,then
,else
,anyOf
,oneOf
is non trivial.Things like this which could be useful, maybe for Allow schemas to be extended #52, are not supported
Semantically
categories
is more like an annotation instead of validation keyword and would be better implemented using annotation collection/output formats, a feature of JSON schema available in draft 2019-09 (formerly draft-08). Annotation collection/output formats aims to extend the use of JSON schema beyond validation and was designed for use cases like ours and would address these shortcomings. More details here and here. However support for draft 2019-09 in the pythonjsonschema
library is not available yet, though development has started. Also note that currently not all JSON schema implementations support draft 2019-09 and not all such implementations support output formats since this is a suggested but not required feature. Only a couple of implementations supporting output formats at the moment that I know of, in Perl and C#. Though with the new release of OpenAPI that supports draft 2019-09 and 2020-12 wider adoptions hopefully would follow.For us we could stick with this implementation for now and wait for draft 2019-09 and output formats.
Overlapped categories
There are situations where a property have
categories
declared multiple times. For example,Here
id
hascategories
declared twice, as["user-identifiable-information"]
and["user-identifier"]
.For instance
{"user": {"id": 1, "email": "[email protected]"}}
,id
hascategories
declared twice, as["user-identifiable-information"]
and["user-identifier"]
.The current stand, based on discussion with Min, is that we do not support this behavior and advise end users to design and test their event schema to make sure that each property only has their categories declared once.
In the current implementation if this situation happens then the properties will take whatever categories the validator encounters last, which is not deterministic.
Extras
Explicit JSON schema version
This PR also explicitly uses json schema DRAFT-07 instead of supporting all drafts and use
jsonschema.validators.validator_for
to determine which draft to use for a certain schema.allowed_properties
Currently
allowed_properties
is for top-level properties only. If a top-level property is included inallowed_properties
then all the descendant properties under it are recorded, even if some of them do not have all their categories whitelisted.Censored fields
Properties that do not have all their categories allowed got their values set to
null
. There is no way to differentiate whether a property isnull
because its value is actuallynull
or because it is hidden due to categories. Should we have a field in the metadata that specifies which properties are censored? For example,Modify event in place?
UPDATE: We decided to just modify event in place for efficiency. Can't think of many use cases where the event is needed after being emitted, and even in that case users can just make a copy of the event themselves.
Category filtering means setting properties in then event tonull
so to avoid modifying the event passed torecord_event
I did adeepcopy
on the event at the beginning. This probably affects performance for large event sincedeepcopy
is notoriously slow. Should we just skip that, meaning getting rid of thedeepcopy
and just modifying the event in place and putting it in the doc and if users want to preserve the event they could make a copy themselves before passing torecord_event
?