Investigate feasibility of calculating breakdown metrics #5936

axw · 2021-08-13T01:32:28Z

Currently if someone uses OTel or Jaeger, the "time spent by span type" will be empty: elastic/apm#471. The reason behind this is that breakdown metrics are computed by agents.

As an alternative, we could hypothetically have a feature where APM Server computes breakdown metrics. Similar to tail-based sampling, it would work by buffering events. When a transaction is received, the server would assemble all related spans (i.e. those with a matching transaction.id), and compute breakdown metrics.

This would only work (and only be supported) when all transaction (not trace) events are sent to the same APM Server. That would be straightforward when running APM Server co-located with the instrumented service.

The text was updated successfully, but these errors were encountered:

felixbarny · 2021-08-13T06:27:06Z

This would only work (and only be supported) when all events are sent to the same APM Server

That was one of the main reasons why we implemented breakdown metrics in agents. With APM Server running on the edge that certainly becomes viable again.
We might even want to do that when receiving data from our agents. Our agents could disable breakdown metrics based on which APM Server version they're running on.

axw · 2022-01-19T02:59:43Z

There's a POC in https://github.com/axw/apm-server/tree/breakdown

This branch introduces a new model processor which aggregates transactions and their reachable spans, calculating span breakdown metrics.

One significant issue I came across is how to deal with async spans that start before the transaction ends, but end after the transaction ends. In this case the transaction is received by apm-server before the span, so we would need to use a timer to reach these spans. This is the biggest downside of performing the calculation outside of agents. It's much easier within agents because we can take action when a span starts, and not just when it ends.

@felixbarny and I had a short discussion about whether it would be reasonable to revise how we calculate breakdown metrics, such that spans that end after a transaction are completely disregarded. Felix said:

I think that's fair enough. If a span outlives its transaction, one can reasonably assume that it did not impact the response time of the transaction. Following that line of thinking, it's safe to not consider spans that end after their transaction in breakdown metrics.

Definitely seems worth to me to drop the timers which lowers complexity (I think) and increases efficiency.

With that change in place, we could trigger aggregation on transaction receipt: buffer span events and process reachable spans with the same trace ID when a transaction is received. Spans may be buffered in Badger, like in the POC, to avoid unbounded memory growth.

I think the only remaining hurdle is to avoid duplicating these metrics for agents that already calculate breakdowns. A couple of options:

Just enable this for OpenTelemetry and Jaeger for now
Discard breakdown metrics sent by our agents, deprecate and remove feature from agents later.

Option 2 is a little complicated by RUM, which produces breakdown metrics for page-load transactions for which there are no corresponding spans. i.e. DNS, TCP, Request, Response, Processing, Load, using the navigation timing API.

We could add an exception for RUM, and not drop those page-load breakdown metrics. Alternatively, we could explore adding navigation timing details to page-load transaction events. Then the server could aggregate those too, and the details may be useful for filtering and analysing individual page loads.

axw · 2022-01-19T03:00:56Z

Example:

func doWork(ctx context.Context) {                          
        tracer := otel.Tracer("test-tracer")                
        ctx, span := tracer.Start(ctx, "example")           
        defer span.End()                                    
                                                            
        _, span1 := tracer.Start(ctx, "db_span")            
        span1.SetAttributes(semconv.DBSystemElasticsearch)  
        time.Sleep(50 * time.Millisecond)                   
        span1.End()                                         
                                                            
        _, span2 := tracer.Start(ctx, "rpc_span")           
        span2.SetAttributes(                                
                semconv.RPCSystemKey.String("grpc"),        
                semconv.NetPeerNameKey.String("rpc_server"),
                semconv.NetPeerPortKey.Int(1234),           
        )                                                   
        time.Sleep(100 * time.Millisecond)                  
        span2.End()                                         
                                                            
        time.Sleep(200 * time.Millisecond)                  
}

davemoore- · 2022-04-18T17:37:01Z

Does this issue also affect the display of the "Instances" and "Instances latency distribution" panels of the Overview tab and Metrics tab of a service? I'm using OpenTelemetry and I see that these metrics are empty, too, in addition to the "Time spent by span type" panel (here). There is also an error pop-up on those pages that says:

Error while fetching resource

Error
search_phase_execution_exception: [script_exception] Reason: link error (500)

I observed this on Elastic 8.1.2 on Elastic Cloud.

axw · 2022-04-26T08:26:03Z

@davemoore- no, that would be unrelated to these metrics.

I think that would happen if your OTel traces do not have either a container ID or host name in resource attributes. If you have a small program that reproduces the issue, would you mind providing it in a new issue?

axw added enhancement research OpenTelemetry labels Aug 13, 2021

zube bot added 8.1 candidate and removed [zube]: Inbox labels Oct 6, 2021

simitt removed the [zube]: Backlog label Dec 31, 2021

axw mentioned this issue Jan 4, 2022

Introduce efficient codec for model events #4120

Closed

simitt added 8.4-candidate and removed 8.1 candidate labels Mar 15, 2022

simitt removed the 8.4-candidate label Jun 1, 2022

axw mentioned this issue Jun 16, 2022

Feature: enable processing of RUM transactions, spans, and errors as a unit #4231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate feasibility of calculating breakdown metrics #5936

Investigate feasibility of calculating breakdown metrics #5936

axw commented Aug 13, 2021

felixbarny commented Aug 13, 2021

axw commented Jan 19, 2022

axw commented Jan 19, 2022

davemoore- commented Apr 18, 2022

axw commented Apr 26, 2022

Investigate feasibility of calculating breakdown metrics #5936

Investigate feasibility of calculating breakdown metrics #5936

Comments

axw commented Aug 13, 2021

felixbarny commented Aug 13, 2021

axw commented Jan 19, 2022

axw commented Jan 19, 2022

davemoore- commented Apr 18, 2022

axw commented Apr 26, 2022