-
Notifications
You must be signed in to change notification settings - Fork 7
Add cassandra exporter #98
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: forestsword <[email protected]>
} | ||
for _, span := range batch.GetSpans() { | ||
span.Process = process | ||
err = s.writer.WriteSpan(ctx, span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the approach uses the existing Jaeger Cassandra writer.
More preferable approach would be to have a native implementation and convert OTLP directly to the Cassandra model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, there's a few rounds of conversion that we could do just once. This is mostly me hooking up two hoses that fit together. There is a considerable amount of work I believe to make it work the way we believe it should. I would prefer to use other people's (smarter people's) code and having done this I wonder if the better way to go about this is to actually submit a new exporter to the contrib repository. That gives us access to internal libaries for translation and testing that we don't have access to here.
For instance much of the code in the translator would be great to have https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/jaeger/traces_to_jaegerproto.go
Thoughts? Otherwise I'll do what I'm told and try it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The translator from contrib will not help us with the native implementation that directly translates OTLP to the Jaeger cassandra model https://github.com/jaegertracing/jaeger/blob/8611137adaffb9c661845109462408d5499b47dc/plugin/storage/cassandra/spanstore/dbmodel/model.go#L40.
In the past we have as well talked about potentially defining new Cassanda model that is closer to OTLP if there are valid arguments to do so.
s.settings.Logger.Debug("NO PROCESS") | ||
} | ||
for _, span := range batch.GetSpans() { | ||
span.Process = process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be conditional on span.Process == nil
} | ||
for _, span := range batch.GetSpans() { | ||
span.Process = process | ||
err = s.writer.WriteSpan(ctx, span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what is the implication of this from a pipeline perspective? It seems the sending to the database is (a) synchronous, and (b) will fail the rest of the batch if one span fails to save. That's quite different from Jaeger collector's default behavior: accept a batch over RPC, send spans into an internal queue to be saved independently (and usually concurrently), and respond with OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My novice understanding is that that's done for me via the mechanisms available in exporterhelper.
The queue actually sets up a set of workers to dequeue, 10 by default. So the function is synchronous here but there's many workers. The incoming td
may or may not be a batch, it could just be a single span. It depends upon if somewhere earlier in the write path the batch processor is used. But I use batch here as the translator is returning a batch defined in the Jaeger model https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c0666b60f669993fc07369c60d3d7c13143df599/pkg/translator/jaeger/traces_to_jaegerproto.go#L29.
The retry helper takes the full 'batch' of failed spans and attempts to retry.
Ah there's my bug. If the first succeeds and the second doesn't the first would be rewritten again. Which I guess doesn't matter, the data would still be consistent (it would just overwrite) but we waste a write. I'd think you'd have to have a de-batcher earlier in the pipeline to address that issue.
options..., | ||
) | ||
|
||
/* We can't use the factory due to the way servers is populated in GetPrimary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line would override whatever is defined in the configuration:
The unexported servers
, 127.0.0.1
by default, overrides the Servers
field in the configuration. It's then added to the unexported primaryConfig
in the factory which I can't get at either. Maybe a simple setter in the factory would let me have a clean hookup here. I don't believe we would care about the archive at all in an otel world. It would be a secondary cassandra exporter in the otel config.
"time" | ||
|
||
"github.com/jaegertracing/jaeger/pkg/metrics" | ||
"go.opencensus.io/stats" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is OTEL Collector planning to switch to OTEL-Go SDK? I am not a fan of depending on OpenCensus. Also, if you're implementing Jaeger's metrics.Factory anyway, could't we just use the normal one that we create? Or the issue is that the OTEL config does not allow to propagate that custom object through configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still blocked by the sdk open-telemetry/opentelemetry-collector#4198
My intention was to have it consistent with the collector experience I don't believe we can anything to avoid this without opening a second port to scrape. There's no way for me to tell the collector, 'no use this instead'. It's still opencensus everywhere.
At least the dependency is only over here not in the main source.
Signed-off-by: forestsword [email protected]
Which problem is this PR solving?
This lets us run the otel collector on the entire write path and gives us access to the full set of features that the otel collector has available.
Short description of the changes
Adds a cassandra exporter. It uses the jaeager translator from the contrib repo to do so. I'm not sure if there is a more efficient way but it was certainly the most straight-forward.
The rest is also pretty-straightforward plumbing for an otel exporter. I've called out a few things in comments in the code where I've had to compromise or copy code. For instance, I would have liked to have used the factory for cassandra but due to the way the servers are populated I had to fall back to the spanstore/writer directly.
I also wrote a metrics factory to be able to expose the cassandra metrics that we had before in an otel collector friendly fashion. I apologize it doesn't have tests and is not fully implemented. Only the counter and the timer are. This was all that was required to get the same metrics out that I was getting before with the jaeger collector. Only now they're prefixed
otelcol_exporter_cassandra...
. It should be usable by the possible future elasticsearch exporter.I do not use this binary directly but am importing the exporter as a library into a collector I build and run and is working to my expectations on our staging environment.
Thanks!