Add redactionPolicies field to Jupyter Event schemas #2

Zsailer · 2022-07-14T20:26:02Z

Highlights

Replaces the required field "categories" with "redactionPolicies" for all schemas to explicitly redact fields in a event that shouldn't be emitted.
A new required field on all properties called "unredactedPolicies" to explicity state what fields are emitted.
A new setting/trait (list) in the EventLogger, redacted_policies, to configure which properties should be plucked/redacted from the outgoing events.

Public API changes

"categories" is no longer used for redacting sensitive information. Instead, the explicit key "redactionPolicies" is used to remove fields from an event.
- To redact these fields, the EventLogger object uses a new trait, a list, called redactedPolicies. Any policies listed in this trait will be removed from emitted events
- By default, the EventLogger emits all events unless they are explicitly redacted. This is a major difference than the previous version, where the EventLogger redacted all events unless they were explicitly listed.
- You can prevent all events from being emitted by setting redactedPolicies="all"
"allowed_schemas", "allowed_categories", and "allowed_properties", have been removed from the EventLogger.
- All registered event types will be emitted.
- To redact any data, use the "redactedPolicies" field
- The expectation is that consumers will use filters in the logging handlers to listen for specific events.
"redactionPolicies" field is required on every schema and (nested) property object in a Jupyter event schema to pass validation.

Zsailer · 2022-07-21T19:22:42Z

cc @kiendang. This is a pretty major refactor, so I'd love to get your eyes on this. I'd like to merge and iterate at a high velocity, so that we can get this into Jupyter Server before the 2.0 final release. Since this library isn't likely used by anyone else at this point, I think that's okay (most folks are probably using telemetry).

A couple of things to note...I found that filtering on "categories" is a pretty challenging thing to maintain and confusing for users in practice. So I've replaced most of that logic with a simpler flow. I've added a required key, redactionPolicies, to every schema property. Then added a new List configurable trait on the EventLogger called redacted_policies which is a list of redaction policies to remove.

Also, I was seeing nasty memory leaks in from jsonschema when we didn't cache the validators for validated schemas. Those have been address in this PR. See the top comment for all the changes I made.

kevin-bates

This is exciting to see. I do think the redaction stuff complicates this significantly and, given the subjective nature of what is considered "sensitive" and by whom, I wonder if this should be configurable outside the schemas. The initial thinking (although I'm not sure I fully understand the object relationships) would be to let the Operator redact specific properties of event definitions.

Of course, permitting operators to determine redacted data can affect potential logic that wants to act on field data that is now redacted. I suppose this issue exists with the current policy mechanism anyway.

We should also nail down the redaction policy labels and their ownership.

Thanks for getting this going @Zsailer and @kiendang!

kevin-bates · 2022-07-21T18:23:14Z

docs/pages/configure.md

+# Explicitly list the types of events
+# to record and what properties or what categories
+# of data to begin collecting.


This comment doesn't seem to fit the code below it, probably due to the switch from categories to redaction policies and the fact the default behavior has also changed.

kevin-bates · 2022-07-21T18:26:24Z

docs/pages/redaction_policies.md

@@ -0,0 +1,5 @@
+# Redacting Sensitive Data
+
+Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce


Suggested change

Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce

Jupyter Events might possibly include sensitive data, specifically personally identifiable information (PII). To reduce

kevin-bates · 2022-07-21T20:50:08Z

docs/pages/schemas.md

+Jupyter Event Schemas must be valid [JSON schema](https://json-schema.org/) and can be written in valid
+YAML or JSON. Every schema is validated against Jupyter Event's "meta"-JSON schema, [here]().
+
+At a minimum, valid Jupyter Event schema requires have the following keys:


Suggested change

At a minimum, valid Jupyter Event schema requires have the following keys:

At a minimum, valid Jupyter Event schema requires the following keys:

kevin-bates · 2022-07-21T20:53:16Z

docs/pages/schemas.md

+
+- `$id` : a URI to identify (and possibly locate) the schema.
+- `version` : the schema version.
+- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data.


The last phrase doesn't parse for me. Take the suggested change with a grain of salt...

Suggested change

- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data.

- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting this event does not reveal any personal data.

kevin-bates · 2022-07-21T20:57:22Z

docs/pages/schemas.md

+  Each property should have the following attributes:
+
+  - `title` : name of the property
+  - `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed.


Suggested change

- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed.

- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if any of its `redactionPolicies` labels are listed in the event logger's `redactedPolicies` set.

kevin-bates · 2022-07-21T21:41:23Z

jupyter_events/schema_registry.py

+        )
+        self._add(schema)
+
+    def get(self, id: str, version: int) -> EventSchema:


Is id the name of the schema? if so, could we call these parameters name or schema_name? Using id has a connotation of an identifier like an integer or UUID.

kevin-bates · 2022-07-21T21:47:03Z

docs/pages/schemas.md

+## Redaction Policies
+
+Each property can be labelled with `redactionPolicies` field. This makes it easier to
+filter properties based on a category. We recommend that schema authors use valid


?

Suggested change

filter properties based on a category. We recommend that schema authors use valid

filter out properties based on a redaction policy. We recommend that schema authors use valid

kevin-bates · 2022-07-21T21:47:43Z

docs/pages/schemas.md

+- `category.jupyter.org/unrestricted`
+- `category.jupyter.org/user-identifier`
+- `category.jupyter.org/user-identifiable-information`
+- `category.jupyter.org/action-timestamp`


Hmm. If different event providers have their own definition of what is PII (including their own labels, but even perhaps not), how does an Operator:
a) determine what set of labels to add to the redactedPolicies property on the event logger?
b) change a given property's redaction criteria because they happen to deem the current settings inadequate?

kevin-bates · 2022-07-21T21:51:34Z

jupyter_events/schema_registry.py

+
+    def __init__(self, schemas: dict = None, redacted_policies: list = None):
+        self._schemas = schemas or {}
+        self._redacted_policies = redacted_policies


I guess I'm not following the relationship between EventLogger, EventSchema, and SchemaRegistry.

Are any of these singletons? I figured SchemaRegistry would be a singleton but now I suspect it's single-valued relative to an EventLogger and there's an EventLogger per what? Per event definition (e.g., content.file-saved) or event class (e.g., content)?

I'm assuming EventSchema is a per event definition thing.

kevin-bates · 2022-07-21T21:53:20Z

jupyter_events/schema_registry.py

+        schema.validate(data)
+
+    def process_event(self, id: str, version: int, data: dict) -> None:
+        """Validate and event and enforce an redaction policies (in place).


Suggested change

"""Validate and event and enforce an redaction policies (in place).

"""Validate an event and enforce its redaction policies (in place).

kevin-bates · 2022-07-21T22:19:40Z

@Zsailer - I just submitted a first review and noticed that you made a few commits during that time - so I apologize for any overlap.

I think seeing an object relationship diagram of sorts would be super helpful.

While poking around the web at PII definitions and event logging systems I found this Zendesk feature request - which implies the subjective nature of PII.

Zsailer · 2022-07-22T16:47:25Z

Thank you for the review, @kevin-bates! This is great!

Kevin and I chatted about this PR yesterday and had a good discussion about the redaction policies logic. We put together the following plan:

Split out the redaction policy and redaction enforcement into a separate PR here
Open a PR in Jupyter Server with some basic events (without redaction).
- This will help us determine how to properly enable redaction policy
Iterate on this PR based on the needs of Jupyter Server

The thing that's challenging with "redactionPolicies" here is that there are instances where you'd want one event handler to redact data, while another handler does not. For example, in the Jupyter Server's Event system/bus, redacting data might break client-side extensions that depend on having all the data. If we can create a way for the data to never leave our system (or at least encrypt it in transit), we should be able to send non-redacted data to the event bus for these extensions while handlers that direct data outside of the jupyter server (e.g. to file) get redacted version of the data.

kevin-bates · 2022-07-22T18:15:05Z

Thanks @Zsailer.

For completeness, I think another challenge is the subjective nature of what is sensitive. Although we don't have to deal with explicitly personal information as part of our inherent framework, it still can be encountered. For example, path may be fine in most installations, but could also contain PII in others, so it seems like we should expose the ability to control policies on various properties. Perhaps decoupling what is sensitive/redacted from the event schema may be a better fit.

Zsailer · 2022-07-22T18:25:16Z

I almost wonder if we should replace redactionPolicies with recommendedRedactionPolicies in the schema. I understand the reason for decoupling the redaction enforcement from the schema definition, but I still feel that there is some responsibility for the schema to label its properties with sensitivity labels. Maybe those labels aren't used for redaction enforcement, but they help consumers understand the risk of collecting the data in question.

Zsailer · 2022-07-22T18:30:12Z

When actually redacting the data, I agree that we need some way for people to apply their own policies to make this system useful. Otherwise, this system will become too restrictive for anyone to use. We want something flexible enough for people to use, while remaining data conscious and safe for users.

kiendang · 2022-07-22T19:44:45Z

Thanks @Zsailer and @kevin-bates, I have been monitoring this PR as well. I have a couple of thoughts here and am happy to be involved in later discussions.

Agree with Zach that there might have already been solutions to this existing out there. Will be happy to take a look and report back what I find.
Earlier I had a much harder time understanding redaction policies compared to categories, so would be nice if we could go with a solution that not only covers all bases but is also intuitive, though clear documentation/examples help a lot here.
If I remember correctly Jupyter Telemetry aimed to be opt-in. If I understand correctly in this redaction approach everything by default is emitted and users have to actively set policies to filter out sensitive fields so is that in anyway conflicting with our commitment to opt-in?
Regarding the implementation I was working on a way to filter out categories more efficiently as part of a WIP to solve Handle category-filtered fields telemetry#61, will see if I can apply the same approach here and submit a PR later.

kevin-bates

Good stuff, just a few comments.

Since, iirc, the redaction policy stuff will be separately implemented, should the title and description be revised? Or is the idea that this will be the vehicle for that implementation?

docs/user_guide/defining-schema.md

docs/user_guide/event-schemas.md

docs/user_guide/first-event.md

jupyter_events/logger.py

Zsailer · 2022-08-10T18:30:33Z

Thanks, Kevin!

Since, iirc, the redaction policy stuff will be separately implemented, should the title and description be revised? Or is the idea that this will be the vehicle for that implementation?

Yeah, sorry for the lack of clarity here.

I've moved all my work to #4. That's where I think we should continue review on the current work. At a later time, I'll move the redaction stuff back to this PR, because there was a lot of good comments around the redaction stuff in this PR thread. I didn't want to lose that history.

For now, I'll keep the title as-is as leave the PR in draft state. I'll move the redaction policy stuff here once #4 is finished and merged.

…tted

Zsailer added 2 commits July 13, 2022 15:06

Add an EventSchema and SchemaRegistry API

1090f7d

working on tests

91a9501

Zsailer marked this pull request as draft July 14, 2022 20:26

blink1073 added the enhancement New feature or request label Jul 14, 2022

Zsailer added 5 commits July 19, 2022 16:03

working unit tests

9c36c10

add myst docs

c42cbcd

use myst for documentation

433832d

working tests

4b83963

unit test working with latest jsonschema

7f17a63

Zsailer mentioned this pull request Jul 21, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

Zsailer added 2 commits July 21, 2022 12:12

protect reserved property names (starting with __)

b972ed3

protect reserved property names (starting with __)

be49d43

Zsailer added 3 commits July 21, 2022 13:45

more typing

a59619e

update readme

ea70c1d

precommit

5339800

kevin-bates reviewed Jul 21, 2022

View reviewed changes

Zsailer mentioned this pull request Jul 22, 2022

An event system for Jupyter jupyter-server/jupyter_server#780

Open

remove redacted policies

b90ce6e

Zsailer mentioned this pull request Jul 28, 2022

Add redactionPolicies field to Jupyter Event schemas #3

Closed

Zsailer added 2 commits August 9, 2022 15:13

rewrite most of the docs

3b98e7c

changelog update

f8f848c

Zsailer mentioned this pull request Aug 9, 2022

Add new EventSchema and SchemaRegistry API #4

Merged

Zsailer changed the title ~~Add new EventSchema and SchemaRegistry API~~ Add redactionPolicies field to Jupyter Event schemas Aug 9, 2022

Zsailer added 3 commits August 9, 2022 15:39

set version for now

492f5ae

remove link check

8390457

cleanup

98ac7c7

kevin-bates reviewed Aug 10, 2022

View reviewed changes

consolidate loading logic

461fa76

Zsailer added 3 commits August 10, 2022 11:31

remove unused import

f92a066

add a warning to capture the first time a non registered event is emi…

8d43c5b

…tted

separate the check for handler and registered schemas in the emit method

4659ddf

Zsailer merged commit 88acd8e into jupyter:main Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add redactionPolicies field to Jupyter Event schemas #2

Add redactionPolicies field to Jupyter Event schemas #2

Zsailer commented Jul 14, 2022 •

edited

Loading

Zsailer commented Jul 21, 2022 •

edited

Loading

kevin-bates left a comment

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates Jul 21, 2022

kevin-bates commented Jul 21, 2022

Zsailer commented Jul 22, 2022

kevin-bates commented Jul 22, 2022

Zsailer commented Jul 22, 2022

Zsailer commented Jul 22, 2022

kiendang commented Jul 22, 2022

kevin-bates left a comment

Zsailer commented Aug 10, 2022

		@@ -0,0 +1,5 @@
		# Redacting Sensitive Data

		Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce

	Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce
	Jupyter Events might possibly include sensitive data, specifically personally identifiable information (PII). To reduce

	At a minimum, valid Jupyter Event schema requires have the following keys:
	At a minimum, valid Jupyter Event schema requires the following keys:

	- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data.
	- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting this event does not reveal any personal data.

	filter properties based on a category. We recommend that schema authors use valid
	filter out properties based on a redaction policy. We recommend that schema authors use valid

	"""Validate and event and enforce an redaction policies (in place).
	"""Validate an event and enforce its redaction policies (in place).

Add redactionPolicies field to Jupyter Event schemas #2

Add redactionPolicies field to Jupyter Event schemas #2

Conversation

Zsailer commented Jul 14, 2022 • edited Loading

Highlights

Public API changes

Zsailer commented Jul 21, 2022 • edited Loading

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates commented Jul 21, 2022

Zsailer commented Jul 22, 2022

kevin-bates commented Jul 22, 2022

Zsailer commented Jul 22, 2022

Zsailer commented Jul 22, 2022

kiendang commented Jul 22, 2022

kevin-bates left a comment

Choose a reason for hiding this comment

Zsailer commented Aug 10, 2022

Zsailer commented Jul 14, 2022 •

edited

Loading

Zsailer commented Jul 21, 2022 •

edited

Loading