Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate feasibility of calculating breakdown metrics #5936

Open
axw opened this issue Aug 13, 2021 · 5 comments
Open

Investigate feasibility of calculating breakdown metrics #5936

axw opened this issue Aug 13, 2021 · 5 comments

Comments

@axw
Copy link
Member

axw commented Aug 13, 2021

Currently if someone uses OTel or Jaeger, the "time spent by span type" will be empty: elastic/apm#471. The reason behind this is that breakdown metrics are computed by agents.

As an alternative, we could hypothetically have a feature where APM Server computes breakdown metrics. Similar to tail-based sampling, it would work by buffering events. When a transaction is received, the server would assemble all related spans (i.e. those with a matching transaction.id), and compute breakdown metrics.

This would only work (and only be supported) when all transaction (not trace) events are sent to the same APM Server. That would be straightforward when running APM Server co-located with the instrumented service.

@felixbarny
Copy link
Member

This would only work (and only be supported) when all events are sent to the same APM Server

That was one of the main reasons why we implemented breakdown metrics in agents. With APM Server running on the edge that certainly becomes viable again.
We might even want to do that when receiving data from our agents. Our agents could disable breakdown metrics based on which APM Server version they're running on.

@axw
Copy link
Member Author

axw commented Jan 19, 2022

There's a POC in https://github.com/axw/apm-server/tree/breakdown

This branch introduces a new model processor which aggregates transactions and their reachable spans, calculating span breakdown metrics.

One significant issue I came across is how to deal with async spans that start before the transaction ends, but end after the transaction ends. In this case the transaction is received by apm-server before the span, so we would need to use a timer to reach these spans. This is the biggest downside of performing the calculation outside of agents. It's much easier within agents because we can take action when a span starts, and not just when it ends.

@felixbarny and I had a short discussion about whether it would be reasonable to revise how we calculate breakdown metrics, such that spans that end after a transaction are completely disregarded. Felix said:

I think that's fair enough. If a span outlives its transaction, one can reasonably assume that it did not impact the response time of the transaction. Following that line of thinking, it's safe to not consider spans that end after their transaction in breakdown metrics.

Definitely seems worth to me to drop the timers which lowers complexity (I think) and increases efficiency.

With that change in place, we could trigger aggregation on transaction receipt: buffer span events and process reachable spans with the same trace ID when a transaction is received. Spans may be buffered in Badger, like in the POC, to avoid unbounded memory growth.

I think the only remaining hurdle is to avoid duplicating these metrics for agents that already calculate breakdowns. A couple of options:

  1. Just enable this for OpenTelemetry and Jaeger for now
  2. Discard breakdown metrics sent by our agents, deprecate and remove feature from agents later.

Option 2 is a little complicated by RUM, which produces breakdown metrics for page-load transactions for which there are no corresponding spans. i.e. DNS, TCP, Request, Response, Processing, Load, using the navigation timing API.

We could add an exception for RUM, and not drop those page-load breakdown metrics. Alternatively, we could explore adding navigation timing details to page-load transaction events. Then the server could aggregate those too, and the details may be useful for filtering and analysing individual page loads.

@axw
Copy link
Member Author

axw commented Jan 19, 2022

Example:

image

func doWork(ctx context.Context) {                          
        tracer := otel.Tracer("test-tracer")                
        ctx, span := tracer.Start(ctx, "example")           
        defer span.End()                                    
                                                            
        _, span1 := tracer.Start(ctx, "db_span")            
        span1.SetAttributes(semconv.DBSystemElasticsearch)  
        time.Sleep(50 * time.Millisecond)                   
        span1.End()                                         
                                                            
        _, span2 := tracer.Start(ctx, "rpc_span")           
        span2.SetAttributes(                                
                semconv.RPCSystemKey.String("grpc"),        
                semconv.NetPeerNameKey.String("rpc_server"),
                semconv.NetPeerPortKey.Int(1234),           
        )                                                   
        time.Sleep(100 * time.Millisecond)                  
        span2.End()                                         
                                                            
        time.Sleep(200 * time.Millisecond)                  
}                                                           

@davemoore-
Copy link

Does this issue also affect the display of the "Instances" and "Instances latency distribution" panels of the Overview tab and Metrics tab of a service? I'm using OpenTelemetry and I see that these metrics are empty, too, in addition to the "Time spent by span type" panel (here). There is also an error pop-up on those pages that says:

Error while fetching resource

Error
search_phase_execution_exception: [script_exception] Reason: link error (500)

I observed this on Elastic 8.1.2 on Elastic Cloud.

@axw
Copy link
Member Author

axw commented Apr 26, 2022

@davemoore- no, that would be unrelated to these metrics.

I think that would happen if your OTel traces do not have either a container ID or host name in resource attributes. If you have a small program that reproduces the issue, would you mind providing it in a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants