Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(satellite,electric): Implement distributed tracing using OpenTelemetry in the client #1371

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

alco
Copy link
Member

@alco alco commented Jun 16, 2024

This is a follow-up to #1370.

Besides adding OT to the client, here we also implement the passing of client's tracing context to the server in RPC calls across the network boundary. The GH Actions workflows that run E2E tests are updated to enable the export of client traces to Honeycomb.io where each test gets its own ci.test_file attribute and a bunch of other CI-related attributes.

@alco alco force-pushed the alco/ot-export branch 4 times, most recently from 6b8a321 to 6076eaa Compare June 17, 2024 09:30
@alco alco force-pushed the alco/open-telemetry-in-client branch 9 times, most recently from c1ca87b to 91ce330 Compare June 17, 2024 14:41
@alco alco force-pushed the alco/ot-export branch 3 times, most recently from 8d9f744 to ccddd36 Compare June 17, 2024 23:01
@alco alco force-pushed the alco/open-telemetry-in-client branch from 91ce330 to 8a74517 Compare June 17, 2024 23:01
@alco alco force-pushed the alco/open-telemetry-in-client branch 3 times, most recently from 990e86f to e2ecd84 Compare June 18, 2024 11:04
msfstef and others added 23 commits June 19, 2024 11:43
These attributes are set on a "resource" which in our case is a
single CI job within a workflow.
This allows us to filter traces by test.

Unfortunately, LUX does not have a built-in way to get the current file
name, so we have to do some manual work for this.
@alco alco force-pushed the alco/open-telemetry-in-client branch from 014e5bd to 5a05ee9 Compare June 19, 2024 08:43
@alco alco changed the title Alco/open telemetry in client feat(satellite,electric): Implement distributed tracing using OpenTelemetry in the client Jun 19, 2024
@alco alco marked this pull request as ready for review June 19, 2024 10:19
Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the additional complexity added by the telemetry calls.
For example, the shape sync method which used to be 5 LoC is now >30 LoC.
Also, we know that we also want to add some improved error logging, which might add yet other code unrelated to the core logic.

Looks like the ideal use case for some aspected-oriented programming but haven't seen good AoP solutions for JS. Essentially, i want to move the telemetry logic outside of the core logic here. Perhaps the chain-of-responsibility pattern could solve this problem but we would have to see if and how that fits our implementation. If we're fine with only tracing public API methods, we can perhaps go for the chain-of-responsibility pattern.

@@ -88,6 +88,7 @@ jobs:
ELECTRIC_IMAGE_NAME: electric-sql-ci/electric
ELECTRIC_CLIENT_IMAGE_NAME: electric-sql-ci/electric-ws-client
ELECTRIC_IMAGE_TAG: ${{ env.ELECTRIC_VERSION }}
_OTEL_RESOURCE_ATTRIBUTES: deployment.environment=CI,ci.workflow=${{ github.workflow }},ci.job=${{ github.job }},ci.run=${{ github.run_id }},ci.branch=${{ github.head_ref }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this env var start with an underscore?
In the docs the env var does not have the underscore at the start.

@@ -55,14 +61,18 @@ export class RPC {
public request(
_service: string,
method: string,
message: Uint8Array
message: Uint8Array,
reqOptions?: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can type this more precisely using the generated SatRpcRequestOptions type: Partial<SatRpcRequestOptions>. If you want to make it 100% you'd have to use DeepPartial as used by protobuf-ts but is not exposed, as a workaround we could use reqOptions?: Parameters<typeof SatRpcRequestOptions.create>[0] (untested).

/**
* Wrap an RPC instance to augment RPC requests with trace propagation.
* Potenitally slight overkill, could directly add the propagation data
* on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken sentence? on what?

* @param rpc RPC instance to wrap
* @returns A proxy around the RPC instance
*/
export function withRpcRequestTracing(rpc: RPC): RPC {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the proxy is a bit complicated, i like how the tracing logic is separate from the core logic.

@@ -358,6 +368,7 @@ export class SatelliteProcess implements Satellite {
}

private async _stop(shutdown?: boolean): Promise<void> {
const span = startSpan('satellite.process.stop')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if _stop is called twice and they run concurrently?
i.e. the 2nd invocation would call startSpan again with the same label, what happens in that case?
Does it create a separate span that is somehow distinct from the first, or does it aggregate them, or does it ignore the 2nd call, or it throws an error, or something else? Tried looking it up but couldn't immediately find an answer in the docs.

@@ -178,6 +178,13 @@
},
"dependencies": {
"@electric-sql/prisma-generator": "workspace:*",
"@opentelemetry/api": "^1.9.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, found this in the docs about dependencies:

If you’re instrumenting a library, only install the OpenTelemetry API package for your language. Your library will not emit telemetry on its own. It will only emit telemetry when it is part of an app that uses the OpenTelemetry SDK. For more on instrumenting libraries, see Libraries.

Are we somehow different that we do need all the dependencies? Or could we move them into optional peer dependencies?

}

/**
* Sets the span status to `ERROR` and optinally records the provided error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in "optinally"


/**
* Sets the span status to `ERROR` and optinally records the provided error
* ass an exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type in "ass"

.then((_) => span.setStatus({ code: SpanStatusCode.OK }))
.catch((err) => recordSpanError(span, err))
.finally(() => span.end())
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving a comment here to make sure that we are on purpose returning the original promise and not the chained promise.

Base automatically changed from alco/ot-export to main June 27, 2024 10:19
@alco
Copy link
Member Author

alco commented Jun 28, 2024

@msfstef Can you help me out here with addressing Kevin's feedback?

I see there are merge conflicts but haven't rebased so as not to mess up code comments from Kevin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants