JupyterLab Telemetry Design #8

jaipreet-s · 2019-09-15T23:48:24Z

Hi all,

I'm doing a quick recap of the proposed design as we're going to begin implementation on this basis.

Publishing and consuming events

Publishers call the recordEvent() API on the EventLog which emits event to each configured handlers. Handlers can be configured on the event log during instantiation. The EventLog will internally do schema validation and filter
events with allowed schemas.

const el = new EventLog({
  handlers: [handler1, handler2],
  allowedSchemas: ['org.jupyter.foo', 'org.jupyterlab.commands.docmanager:open']
})
el.recordEvent({
  schema: 'org.jupyter.foo',
  version: 1,
  body: {
    'foo': 'bar'
  }
});

The fan-out of an event to each handler is done using PhosphorJS Signals and each handler is simply a function with the signature

function consoleHandler(el: EventLog, events: EventLog.RecordedEvent[]) {
  console.log(`[Handler1] Received events ${JSON.stringify(events)}`)
}

There will be an in-built handler which sends events to the /eventlog endpoint on the notebook server.

JupyterLab commands

The EventLog has some configuration options which causes it to auto-subscribe to JupyterLab command executions (such as new Notebook opened, or terminal closed) and emit the to each handler. They still need to be whitelisting using the allowedSchemas property. This will be based off of Ian Rose's prototype

// This will record the "docmanager:open" command batched every 2 seconds
const el = new EventLog({
  handlers: [consoleHandler, consoleHandler2],
  allowedSchemas: ['org.jupyterlab.commands.docmanager:open'],
  commandRegistry: app.commands,
  commandEmitIntervalSeconds: 2
})

User Transparency

To make it transparent to users what evens are being recorded, the EventLog will publish the set of allowed event schemas to the JupyterLab Settings Registry. Users will be able to view what events are being recorded and can also modify the settings to opt-out of certain events.

{
    // Jupyter Telemetry
    // @jupyterlab/telemetry:plugin
    // Telemetry plgugin
    // *****************************************

    // Allowed telemetry events
    "allowedSchemas": [
        "org.jupyter.foo",
        "org.jupyterlab.commands.docmanager:open"
    ],
    // Disable or enable telemetry completely
    "telemetryEnabled": true
}

We'll have to deal with aggregating allowedSchemas from all EventLog instances but that is an implementation detail.

cc @Zsailer @yuvipanda @zuoyuanh

The text was updated successfully, but these errors were encountered:

Zsailer · 2019-09-16T17:15:25Z

Thanks, @jaipreet-s! This is great—I'm excited to see this move forward!

Just a few comments...

each handler is simply a function

I'm not sure if we're shooting for the Python and JS side to match exactly, but I wanted to bring up a minor difference I see here in the Handlers. The logging handlers we're using in Python are objects with a few extra helpful methods for filtering, formatting, etc. (I wrote an jupyter_config example for filtering here). If we want to follow this design on the JS side, handlers would objects with this design. What do you think?

There will be an in-built handler which sends events to the /eventlog endpoint on the notebook server.

(*thinking out loud here)

Is the /eventlog endpoint a part of the core server or a server extension?

If it's part of the core server, what should the server do with these events? Assuming the source is trusted (a whole other converation, started in jupyter/telemetry#21), does the server pass incoming events hitting the /eventlog endpoint to all Python handlers on the server? Should the Python handlers be able to to validate, filter, etc. this data?

If it's a server extension, jupyterlab-telemetry would have it's own EventLog that users/admins configure separately from the notebook server. That seems a bit cleaner...

jaipreet-s · 2019-09-16T17:45:07Z

@Zsailer - Thanks for the feedback!

If we want to follow this design on the JS side, handlers would objects with this design. What do you think?

In the Python side, we got filtering with little at the handler level because of the in-built capabilities of the logging module. There isn't an equivalent I'm aware of in the TS world. I'm proposing we take this incrementally by starting with filtering at the entire EventLog and then extend it to each handler, as required.

Is the /eventlog endpoint a part of the core server or a server extension?

This will be the /api/eventlog endpoint that is present on the notebook server (which is WIP as part of the PR in notebook jupyter/notebook#4752 and the relevant file)

Assuming the source is trusted (a whole other converation, started in jupyter/telemetry#21), does the server pass incoming events hitting the /eventlog endpoint to all Python handlers on the server? Should the Python handlers be able to to validate, filter, etc. this data?

Yes - the intention is to pass all whitelisted events to the server, and let the server EventLog configuration take care of which handlers to send the event to. This is useful in cases where an operator still wants to events occurring at the browser but keep common configuration across the server and browser.

BTW I'm missing some permissions on this repo which I have at the jupyter/telemetry repo, could you please check? :)

Zsailer · 2019-09-16T18:10:09Z

@jaipreet-s I added you to the list of maintainers :)

Zsailer · 2019-09-16T18:33:13Z

we got filtering with little at the handler level ... I'm proposing we take this incrementally

Right. Sounds good. I agree with taking an incremental approach. My question is whether we should start with a Handler object now (that lacks filtering etc. functionality for now) rather than using a function as the starting place. That way, we're not asking users to change from functions objects later? Or is this over-engineering 😆

This will be the /api/eventlog endpoint that is present on the notebook server.

Ah, yes. I forgot we added this in the notebook PR. I remember now... This means both EventLogs, server-side and client-side, must be configured by operators with the same set of whitelisted events. Then, validation happens twice, client-side then server-side. Correct? I'm not saying this is wrong, I'm just making sure I understand correctly :)

jaipreet-s · 2019-09-16T20:53:48Z

My question is whether we should start with a Handler object now (that lacks filtering etc. functionality for now) rather than using a function as the starting place.

The notion behind the function is because the "fan-out" of an event to each handler is managed by PhosphorJS signals, which accept a callback function for subscriber of an event. That is why the proposal was to start with a function so we get that aspect scalability for free. It looks very feasible to extend this to a Handler class and invoke the handleEvent() method of the handler instead.

This means both EventLogs, server-side and client-side, must be configured by operators with the same set of whitelisted events.

No its a good point - I've been thinking about this for other use-cases, and it may make sense to support regexes in the allowedSchemas. For e.g., a operator wanting to whitelist all JupyterLab commands could do allowedSchemas: [org.jupyterlab.commands.*]. An operator wanting to purely control the whitelisting configuration at the server, without duplicating to the client-side, could whitelist all events at the client-side using allowedSchemas: [*] and let the whitelist on the server do the actual work.

Zsailer · 2019-09-16T21:11:36Z

The notion behind the function is because the "fan-out" of an event to each handler is managed by PhosphorJS signals, which accept a callback function for subscriber of an event. That is why the proposal was to start with a function so we get that aspect scalability for free. It looks very feasible to extend this to a Handler class and invoke the handleEvent() method of the handler instead.

👍 great, that makes perfect sense to me. Thanks for explaining.

it may make sense to support regexes in the allowedSchemas.

That's a cool idea. We could definitely implement something like this. I'd be curious to hear what @yuvipanda thinks.

Zsailer · 2019-09-16T21:12:05Z

This proposal looks great to me!

jaipreet-s · 2019-09-29T23:10:56Z

xref #9 which implements this

ellisonbg · 2019-10-03T04:33:48Z

Finally having a look at this. One question that comes up is about the relationship between handlers and allowed schemas. My understanding is that we would like an event log to be able to send different events to different handlers. For example, an operator might be sending events from one set of schemas to one handler, but an extension author may be collecting a different schema and sending those somewhere else. From the above discussion, it sounds like this usage case would be handled in the following manner:

The EventLog would need to be configured with the union of all allowedSchema that all handlers would want to collect.
Each individual handler would then be configured to filter out the schemas it isn't interested in.

A pro of this model is that there is a single authoritative source for the all allowedSchemas, so a user doesn't have to worry about tracking down that information across different places. A con however, is that the use won't actually know precisely which data is being sent where. Is this evaluation accurate? Where this becomes really critical to understand is with private or sensitive data. If some of the allowedSchemas emit such data, as a user I would want to know exactly where that data is actually being sent, not just where the data might be sent.

I will have a look at the implementation now as well. Thanks for this work and discussion!

Zsailer · 2019-10-03T04:53:40Z

My understanding is that we would like an event log to be able to send different events to different handlers.

👍

use won't actually know precisely which data is being sent where

@ellisonbg allowedSchemas acts as the first "filter" in the EventLog object. Then, each handler can have additional filters to route subsets of allowedSchemas to different places. We should be able to query each handler for the schema subset they allow. We can surface this information to users so they know where events are being sent.

jaipreet-s · 2019-10-04T22:06:25Z

We should be able to query each handler for the schema subset they allow. We can surface this information to users so they know where events are being sent.

This can be codified in the Handler interface as well, for e.g. the allowedSchemas can be an abstract property. The EventLog then can query each handler's allowedSchemas property and take care of surfacing that to the user

ellisonbg · 2019-10-06T16:15:11Z

Good point!

…

On Fri, Oct 4, 2019 at 3:06 PM Jaipreet Singh ***@***.***> wrote: We should be able to query each handler for the schema subset they allow. We can surface this information to users so they know where events are being sent. This can be codified in the Handler interface as well, for e.g. the allowedSchemas can be an abstract property. The EventLog then can query each handler's allowedSchemas property and take care of surfacing that to the user — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AAAGXUEWRPVZUKL34UMBVP3QM646FA5CNFSM4IW4MMSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANADIA#issuecomment-538575264>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAGXUEUOC22PFZZG2U3YBLQM646FANCNFSM4IW4MMSA> .

-- Brian E. Granger Principal Technical Program Manager, AWS AI Platform ([email protected]) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub

yuvipanda · 2019-10-08T11:00:10Z

Thanks for working on this, @jaipreet-s!

I'm curious what you think of the schema naming recommendations here: https://github.com/jupyter/telemetry/blob/master/proposal/design.md#schema-naming-recommendations. I'd love for us to follow that for events we emit from inside the project, for the reasons outlined there.

jaipreet-s · 2019-10-08T17:22:42Z

Agreed @yuvipanda. #11 tracks the implementation work for updating schema names.

jaipreet-s mentioned this issue Oct 8, 2019

Update schema names to be consistent with server-side #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JupyterLab Telemetry Design #8

JupyterLab Telemetry Design #8

jaipreet-s commented Sep 15, 2019

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 16, 2019

Zsailer commented Sep 16, 2019

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 16, 2019 •

edited

Loading

Zsailer commented Sep 16, 2019

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 29, 2019

ellisonbg commented Oct 3, 2019

Zsailer commented Oct 3, 2019 •

edited

Loading

jaipreet-s commented Oct 4, 2019

ellisonbg commented Oct 6, 2019 via email

yuvipanda commented Oct 8, 2019

jaipreet-s commented Oct 8, 2019

JupyterLab Telemetry Design #8

JupyterLab Telemetry Design #8

Comments

jaipreet-s commented Sep 15, 2019

Publishing and consuming events

JupyterLab commands

User Transparency

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 16, 2019

Zsailer commented Sep 16, 2019

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 16, 2019 • edited Loading

Zsailer commented Sep 16, 2019

Zsailer commented Sep 16, 2019

jaipreet-s commented Sep 29, 2019

ellisonbg commented Oct 3, 2019

Zsailer commented Oct 3, 2019 • edited Loading

jaipreet-s commented Oct 4, 2019

ellisonbg commented Oct 6, 2019 via email

yuvipanda commented Oct 8, 2019

jaipreet-s commented Oct 8, 2019

jaipreet-s commented Sep 16, 2019 •

edited

Loading

Zsailer commented Oct 3, 2019 •

edited

Loading