-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable finer-grained recording of event properties and categories. #46
Conversation
This seems like a good thing! I have one suggestion that I hope will make it even better 😉 The examples seems to use personal information (and degrees of this) as the only measure of sensitivity. I understand why this would be the top priority, and why a shorthand exist on the top-level and server config. That being said, I think this should support other kinds of sensitivity as well (and including them in the examples might make it easier to keep this in mind and reason about):
The main relevant points from this is that it might be relevant to assign several categories to one schema field (
Hope this makes sense, and that it can be added without detracting from achieving the personal data protection goal 👍 |
I agree with the concept of multiple categories for the reasons @vidartf states above (having needed to use such qualifiers many times at work). |
@vidartf good point on the possibility of multiple categories. For an inclusive match, should it be We can, of course, accomplish most of these with a flat list as long as enough unique tags are applied, but that probably gets untenable. It seems like maybe there's a case to be made for a more sophisticated label filter with arbitrary user filtering logic, as in kubernetes label filters? Should there be an arbitrary (one-level all strings) dict of labels to enable this? I think categories is a great start and we probably can solve this in a second pass with more generic 'labels', so we don't hold up progress, but I'm okay, too, if folks want to try to solve this better in the first pass. |
Thanks, @vidartf! I totally agree with everything you said. Allowing administrators to list multiple categories is probably something we should allow here.
Is the "unrestricted" category different than the "unknown" label? "unrestricted" essentially means "always record", which sounds similar to your use-case of "unknown". If these are equivalent, do you think "unknown" is a better label? |
@minrk, I think it should be I would like to err on the side of "more explicit" when handling personal data. Because this field is particularly targeted at handling personal data, I would rather data not be collected by mistake, rather than have an unexpected piece of data slip through from an This forces administrators to list every collected category explicitly in their config. It's more work on administrators, but safer for personal data. "Label filtering" with arbitrary filtering logic is an interesting idea, but again, I'm not sure we want to allow this when handling personal data. I would rather add more constraints that prevent major accidental breaches of personal data collection. Keep in mind, because we're using Python's logging standard library, admins can easily add their own filters to the logging handlers they pass to the EventLog (see my example here). It would be up to the schema authors to create separate properties for more clever filtering (preferably not on sensitivity level). Perhaps the issue is that "category" is too broad of a term. Maybe we should narrow this property to something like "pii-category". |
Thanks, then I think we can get started with a single category field, and require opt-in to all individual categories for now. Then maybe wait for a scenario where this becomes untenable before working on a next revision? |
Sorry for being slow to respond, I had a week off from coding. Given the example case, I might have misunderstood the PR so please don't stall on me. I was under the impression that the subscription was of the type "everything but |
It would be close to opposite, indicating "this message might match any and all categories" (e.g. if you for some reason wanted to log all kernel messages, it is hard to be exhaustive about which kinds of restriction categories it will match). Think of it as a |
Thanks, @vidartf. I think I understand what you mean. You're asking if there's a way to subscribe to multiple events without explicitly listing every event type—i.e. subscribe to all events matching some category. This is slightly different. This is mostly a post-processing mechanism that drops properties/fields/attrs from an event that includes personal data. It scans each incoming event (that you're already subscribed to) and drops any properties of that event that includes personal-data. The event itself is still collected, but won't include these properties. (You'll still record the type of event, timestamp, etc., just not any properties with categories not white-listed). That said, I did add one extra fail-safe switch to prevent collecting the events entirely if they include personal data. That's the |
I've updated this PR to allow multiple categories. I'd love another review before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late review, and thank you very much for explicitly pinging me off-github, @Zsailer!
docs/pages/personal-data.rst
Outdated
|
||
from jupyter_telemetry import EventLog | ||
|
||
e = EventLog(collect_personal_data=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will much more likely be set by operators with:
c.EventLog.collect_personal_data = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make operators explicitly specify the list of schemas they want (which we already do), this would possibly unintentionally hide that in slightly hard to debug ways. I would love for us to move everything to a per-schema setting, possibly just using categories. Maybe by default only 'unrestricted' will be recorded unless you explicitly opt in? Should have the same effect as this, but be more explicit.
docs/pages/personal-data.rst
Outdated
|
||
e = EventLog( | ||
collect_personal_data=True, | ||
allowed_categories=['personally-identifiable-information'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is set at an EventLog level, this doesn't let us tune what kinda information we want to collect from which schema. In one schema, 'personal-identifier' might be a pseudonym, and in another it could be your phone number. Those two should be treated completely differently, and keeping this at the global level doesn't allow admins to do that. Schemas could use something more specific than 'personal-identifier', but that requires co-ordination between different schema writers to make sure they don't overlap. This pseudonym / phone number mixup could happen unintentionally, for example, when you have a different spawner / authenticator combo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving this instead to allowed_schemas? Instead of strings, you can optionally add a dict with schema name, allowed categories, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea a lot, @yuvipanda. Let me refactor things a bit and see what we think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that isn't clear here—should individual handlers dictate what type of data is logged by that handler? Right now, allowed_schemas
is set at the EventLog
level. Should we, instead, set allowed_schemas
(with extra metadata about categories) in each handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be layered - handlers inherit allowed_schemas
but can impose their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to GitHub, tried to hit 'Request Changes' last time...
Ok! I've taken another crack at this given @yuvipanda's feedback. I think it's much better! The # User config to explicitly list what data should be recorded.
allowed_schemas = {
"uri.to.schema": {
# explicitly list properties to record
"properties": ["name", "email"],
# and/or list categories to record
"categories": ["personal-information"]
}
} If an event's property is explicitly listed, it will be recorded in that event's logged data. Otherwise, it will only be recorded if the property's categories matches the listed category. (<-- my only question is whether this should be an |
Closing and reopening as a new PR, since this work has changed dramatically. |
This replaces #30 and takes a different approach to handle personal data.
First, it's important to state under this PR (this is in the docs, too)—JT does not collect personal data+personally identifiable information by default.
This PR requires all event schemas to explicitly state whether they collect personal data or not. Further, it requires operators to explicitly opt-in to collecting personal data. Otherwise, no personal data will ever be collected.
Changes to the schema definition
personal-data: <bool>
, explicitly stating if the event would collect personal datacategory
field.unrestricted
is a special value for properties that are always safe to collect. Otherwise, category is a free field to schema authors and should define what type of data each event property is.Changes to the EventLog API.
personal-data: true
by default.collect_personal_data
. This must be set toTrue
to record events withpersonal-data: true
.collect_personal_data=True
, the EventLog will collect PII events, but remove all properties that aren't explicitly labeled ascategory: unrestricted
.allowed_categories
. EventLog will not remove properties whose categories are listed in the `allowed_categories`` trait.allowed_categories
attribute that supersedes theEventLog
trait.I think this addresses most of the points raised in #30. This does not apply any hashing/encryptions to personal data. I think we should definitely add that in a future PR. This is a step in the right direction that balances security and flexibility.
Example Snippets
Here is a simple example of a schema under this PR:
Here is an example of an EventLog configured to collect PII:
Let me know what y'all think. @ellisonbg @blink1073 @yuvipanda @jaipreet-s @minrk