-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate feasibility of calculating breakdown metrics #5936
Comments
That was one of the main reasons why we implemented breakdown metrics in agents. With APM Server running on the edge that certainly becomes viable again. |
There's a POC in https://github.com/axw/apm-server/tree/breakdown This branch introduces a new model processor which aggregates transactions and their reachable spans, calculating span breakdown metrics. One significant issue I came across is how to deal with async spans that start before the transaction ends, but end after the transaction ends. In this case the transaction is received by apm-server before the span, so we would need to use a timer to reach these spans. This is the biggest downside of performing the calculation outside of agents. It's much easier within agents because we can take action when a span starts, and not just when it ends. @felixbarny and I had a short discussion about whether it would be reasonable to revise how we calculate breakdown metrics, such that spans that end after a transaction are completely disregarded. Felix said:
With that change in place, we could trigger aggregation on transaction receipt: buffer span events and process reachable spans with the same trace ID when a transaction is received. Spans may be buffered in Badger, like in the POC, to avoid unbounded memory growth. I think the only remaining hurdle is to avoid duplicating these metrics for agents that already calculate breakdowns. A couple of options:
Option 2 is a little complicated by RUM, which produces breakdown metrics for page-load transactions for which there are no corresponding spans. i.e. DNS, TCP, Request, Response, Processing, Load, using the navigation timing API. We could add an exception for RUM, and not drop those page-load breakdown metrics. Alternatively, we could explore adding navigation timing details to page-load transaction events. Then the server could aggregate those too, and the details may be useful for filtering and analysing individual page loads. |
Example: func doWork(ctx context.Context) {
tracer := otel.Tracer("test-tracer")
ctx, span := tracer.Start(ctx, "example")
defer span.End()
_, span1 := tracer.Start(ctx, "db_span")
span1.SetAttributes(semconv.DBSystemElasticsearch)
time.Sleep(50 * time.Millisecond)
span1.End()
_, span2 := tracer.Start(ctx, "rpc_span")
span2.SetAttributes(
semconv.RPCSystemKey.String("grpc"),
semconv.NetPeerNameKey.String("rpc_server"),
semconv.NetPeerPortKey.Int(1234),
)
time.Sleep(100 * time.Millisecond)
span2.End()
time.Sleep(200 * time.Millisecond)
} |
Does this issue also affect the display of the "Instances" and "Instances latency distribution" panels of the Overview tab and Metrics tab of a service? I'm using OpenTelemetry and I see that these metrics are empty, too, in addition to the "Time spent by span type" panel (here). There is also an error pop-up on those pages that says:
I observed this on Elastic 8.1.2 on Elastic Cloud. |
@davemoore- no, that would be unrelated to these metrics. I think that would happen if your OTel traces do not have either a container ID or host name in resource attributes. If you have a small program that reproduces the issue, would you mind providing it in a new issue? |
Currently if someone uses OTel or Jaeger, the "time spent by span type" will be empty: elastic/apm#471. The reason behind this is that breakdown metrics are computed by agents.
As an alternative, we could hypothetically have a feature where APM Server computes breakdown metrics. Similar to tail-based sampling, it would work by buffering events. When a transaction is received, the server would assemble all related spans (i.e. those with a matching
transaction.id
), and compute breakdown metrics.This would only work (and only be supported) when all transaction (not trace) events are sent to the same APM Server. That would be straightforward when running APM Server co-located with the instrumented service.
The text was updated successfully, but these errors were encountered: