From 2bd13cf237f8645ec4fe6e77014e655c4bda8e5e Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 21 Jun 2021 16:12:32 +0200 Subject: [PATCH 01/46] First draft of handling huge tracing specs --- specs/agents/README.md | 1 + specs/agents/tracing-spans-compress.md | 199 ++++++++++++++++++ specs/agents/tracing-spans-drop-fast.md | 3 + specs/agents/tracing-spans-dropped-stats.md | 52 +++++ .../tracing-spans-handling-huge-traces.md | 17 ++ specs/agents/tracing-spans-limit.md | 19 ++ 6 files changed, 291 insertions(+) create mode 100644 specs/agents/tracing-spans-compress.md create mode 100644 specs/agents/tracing-spans-drop-fast.md create mode 100644 specs/agents/tracing-spans-dropped-stats.md create mode 100644 specs/agents/tracing-spans-handling-huge-traces.md create mode 100644 specs/agents/tracing-spans-limit.md diff --git a/specs/agents/README.md b/specs/agents/README.md index 23262440..bf5f8351 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -40,6 +40,7 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Transactions](tracing-transactions.md) - [Spans](tracing-spans.md) - [Span destination](tracing-spans-destination.md) + - [Handling huge traces](tracing-spans-handling-huge-traces.md) - [Sampling](tracing-sampling.md) - [Distributed tracing](tracing-distributed-tracing.md) - [Tracer API](tracing-api.md) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md new file mode 100644 index 00000000..32a70214 --- /dev/null +++ b/specs/agents/tracing-spans-compress.md @@ -0,0 +1,199 @@ +## Compressing spans + +To mitigate the potential flood of spans to a backend, +agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans. + +While compressing multiple similar spans into a single composite span can't fully get rid of all the collection overhead, +it can significantly reduce the impact on the following areas, +with very little loss of information. +- Agent reporter queue utilization +- Capturing stack traces, serialization, compression, and sending events to APM Server +- Potential to re-use span objects, significantly reducing allocations +- Downstream effects like reducing impact on APM Server, ES storage, and UI performance + + +### Consecutive-Exact-Match compression strategy + +One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server. +This strategy detects consecutive spans that hold the same information (except for the duration) +and creates a single [composite span](tracing-spans-compress.md#composite-span). + +``` +[ ] +GET /users + [] [] [] [] [] [] [] [] [] [] + 10x SELECT FROM users +``` + +Two spans are considered to be an exact match if they are of the same kind and if their span names are equal. + +### Consecutive-Same-Kind compression strategy + +Another pattern that often occurs is a high amount of alternating queries to the same backend. +Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead. + +``` +[ ] +GET /users + [] [] [] [] [] [] [] [] [] [] + 10x Calls to mysql +``` + +Two span are considered to be of the same type if the following properties are equal: +- `type` +- `subtype` +- `destination.service.resource` + +```java +boolean isSameKind(Span other) { + return type == other.type + && subtype == other.subtype + && destination.service.resource == other.destination.service.resource +} +``` + +When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`. +The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span. + + +#### Configuration option `same_kind_compression_max_duration` + +Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span. +This reduces the collection, processing, and storage overhead, and removes clutter from the UI. +The tradeoff is that the statement information will not be collected. + +| | | +|----------------|----------| +| Type | `duration`| +| Default | `1ms` | +| Dynamic | `true` | + +### Composite span + +Compressed spans don't have a physical span document. +Instead, multiple compressed spans are represented by a composite span. + +#### Data model + +The `timestamp` and `duration` have slightly similar semantics, +and they define properties under the `composite` context. + +- `timestamp`: The start timestamp of the first span. +- `duration`: The sum of durations of all spans. +- `composite` + - `count`: The number of compressed spans this composite span represents. + The minimum count is 2 as a composite span represents at least two spans. + - `end`: The end timestamp of the last compressed span. + The net duration of all compressed spans is equal to the composite spans' `duration`. + The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`. + +#### Turning compressed spans into a composite span + +Spans have a `compress` method. +The first time `compress` is called on a regular span, it becomes a composite span. + +```java +void compress(Span other) { + if (compressed.count == 0) { + compressed.count = 2 + } else { + compressedCount++ + } + endTimestamp = max(endTimestamp, other.endTimestamp) +} +``` + +#### Effects on metric processing + +As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource), +APM Server tracks span destination metrics. +To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted, +APM Server will take `composite.count` into account when tracking span destination metrics. + +### Compression algorithm + +#### Eligibility for compression + +A span is eligible for compression if all the following conditions are met +- It's an exit span +- The trace context of this span has not been propagated to a downstream service + +The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. +This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view. + +```java +boolean isCompressionEligible() { + return exit && !context.hasPropagated +} +``` + +#### Span buffering + +Non-compression-eligible spans may be reported immediately after they have ended. +When a compression-eligible span ends, it does not immediately get reported. +Instead, the span is buffered within its parent. +A span/transaction can buffer at most one child span. + +Span buffering allows to "look back" one span when determining whether a given span should be compressed. + +A buffered span gets reported when +1. its parent ends +2. a non-compressible sibling ends + +```java +void onSpanEnd() { + if (isCompressionEligible()) { + if (parent.hasBufferedSpan()) { + parent.tryCompress(this) + } else { + parent.buffered = this + } + } else { + report(buffered) + report(this) + } +} +``` + +#### Compression + +On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling. + +If the spans are of the same kind but have different span names, +we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy). + +If the spans are of the same kind, and have the same name, +we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy). + +```java +void tryCompress(Span child) { + if (buffered.isSameKind(child)) { + if (buffered.name == child.name) { + buffered.compress(child) + return + } else if (buffered.duration <= same_kind_compression_max_duration + && child.duration <= same_kind_compression_max_duration) { + buffered.name = "Calls to $buffered.destination.service.resource" + buffered.compress(child) + return + } + } + report(buffered) + buffered = child +} +``` + +#### Concurrency + +The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise. +Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency. + +However, if there can be a situation where multiple spans may end concurrently, agents MUST guard against race conditions. +To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm) +paired with retry loops over blocking algorithms that use mutexes or locks. + +In particular, operations that work with the buffer require special attention. +- Setting a span into the buffer must be handled atomically. +- Retrieving a span from the buffer must be handled atomically. + Retrieving includes atomically getting and clearing the buffer. + This makes sure that only one thread can compare span properties and call mutating methods, such as `compress` at a time. diff --git a/specs/agents/tracing-spans-drop-fast.md b/specs/agents/tracing-spans-drop-fast.md new file mode 100644 index 00000000..a8736c2b --- /dev/null +++ b/specs/agents/tracing-spans-drop-fast.md @@ -0,0 +1,3 @@ +## Dropping fast spans + +TODO [`span_min_duration`](https://github.com/elastic/apm/pull/314) is viable again when tracking stats for fast spans [`tracing-spans-dropped-stats.md`](tracing-spans-dropped-stats.md). \ No newline at end of file diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/tracing-spans-dropped-stats.md new file mode 100644 index 00000000..0e33dc1f --- /dev/null +++ b/specs/agents/tracing-spans-dropped-stats.md @@ -0,0 +1,52 @@ +## Collecting statistics about dropped spans + +To still retain some information about dropped spans (for example due to `transaction_max_spans` or `span_min_duration`), +agents SHOULD collect statistics on the corresponding transaction about dropped spans. +These statistics MUST only be sent for sampled transactions. + +### Use cases + +This allows APM Server to consider these metrics for the service destination metrics. +In practice, +this means that the service map, the dependencies table, +and the backend details view can show accurate throughput statistics for backends like Redis, +even if most of the spans are dropped. + +This also allows the transaction details view (aka. waterfall) to show a summary of the dropped spans. + +### Data model + +This is an example of the statistics that are added do the `transaction` events sent via the intake v2 protocol. + +```json +{ + "dropped_spans_stats": [ + { + "type": "external", + "subtype": "http", + "destination_service_resource": "example.com:443", + "outcome": "failure", + "count": 28, + "duration.sum.us": 123456 + }, + { + "type": "db", + "subtype": "mysql", + "destination_service_resource": "mysql", + "outcome": "success", + "count": 81, + "duration.sum.us": 9876543 + } + ] +} +``` + +### Limits +TODO: limit the number of `dropped_spans_stats` elements? + +### Effects on destination service metrics + +As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource), +APM Server tracks span destination metrics. +To avoid dropped spans to skew latency metrics and cause throughput metrics to be under-counted, +APM Server will take `dropped_spans_stats` into account when tracking span destination metrics. \ No newline at end of file diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md new file mode 100644 index 00000000..30de8fd3 --- /dev/null +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -0,0 +1,17 @@ +# Handling huge traces + +Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues: +- A significant performance impact in the target application. + For example due to high allocation rate, network traffic, garbage collection, additional CPU cycles for serializing, compressing and sending spans, etc. +- Dropping of events in agents or APM Server due to exhausted queues. +- High load on the APM Server. +- High storage costs. +- Decreased performance of the Elastic APM UI due to slow searches and rendering of huge traces. +- Loss of clarity and overview (--> decreased user experience) in the UI when analyzing the traces. + +Agents can implement several strategies to mitigate these issues: +- [Hard limit on number of spans to collect](tracing-spans-limit.md) +- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) +- [Dropping fast spans](tracing-spans-drop-fast.md) +- [Compressing spans](tracing-spans-compress.md) + diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md new file mode 100644 index 00000000..9e64b5e1 --- /dev/null +++ b/specs/agents/tracing-spans-limit.md @@ -0,0 +1,19 @@ +## Hard limit on number of spans to collect + +This is the last line of defense that comes with the highest amount of data loss. +This strategy MUST be implemented by all agents. +Ideally, the other mechanisms limit the amount of spans enough so that the hard limit does not kick in. + +### Configuration option `transaction_max_spans` + +Limits the amount of spans that are recorded per transaction. + +This is helpful in cases where a transaction creates a very high amount of spans (e.g. thousands of SQL queries). + +Setting an upper limit will prevent overloading the agent and the APM server with too much work for such edge cases. + +| | | +|----------------|----------| +| Type | `integer`| +| Default | `500` | +| Dynamic | `true` | From 72d384d748674133ff3557aad0b6da4e824bf960 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 22 Jun 2021 10:56:35 +0200 Subject: [PATCH 02/46] Apply suggestions from code review Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 32a70214..41aea49a 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -159,7 +159,7 @@ void onSpanEnd() { On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling. -If the spans are of the same kind but have different span names, +If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `same_kind_compression_max_duration`, we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy). If the spans are of the same kind, and have the same name, From 411f529af90a77d3124cb690b0942b569c3e0cba Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 22 Jun 2021 10:56:50 +0200 Subject: [PATCH 03/46] Implement suggestions --- specs/agents/README.md | 4 ++++ specs/agents/tracing-spans-compress.md | 8 ++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/specs/agents/README.md b/specs/agents/README.md index bf5f8351..8c9d32c1 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -41,6 +41,10 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Spans](tracing-spans.md) - [Span destination](tracing-spans-destination.md) - [Handling huge traces](tracing-spans-handling-huge-traces.md) + - [Hard limit on number of spans to collect](tracing-spans-limit.md) + - [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) + - [Dropping fast spans](tracing-spans-drop-fast.md) + - [Compressing spans](tracing-spans-compress.md) - [Sampling](tracing-sampling.md) - [Distributed tracing](tracing-distributed-tracing.md) - [Tracer API](tracing-api.md) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 41aea49a..91fe2400 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -25,7 +25,11 @@ GET /users 10x SELECT FROM users ``` -Two spans are considered to be an exact match if they are of the same kind and if their span names are equal. +Two spans are considered to be an exact match if they are of the [same kind](consecutive-same-kind-compression-strategy) and if their span names are equal: +- `type` +- `subtype` +- `destination.service.resource` +- `name` ### Consecutive-Same-Kind compression strategy @@ -65,7 +69,7 @@ The tradeoff is that the statement information will not be collected. | | | |----------------|----------| | Type | `duration`| -| Default | `1ms` | +| Default | `5ms` | | Dynamic | `true` | ### Composite span From fd3d87910d31d25cb691daf4a6a257c7dd86ccaa Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 22 Jun 2021 11:51:20 +0200 Subject: [PATCH 04/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 91fe2400..22668ea1 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -175,7 +175,7 @@ void tryCompress(Span child) { if (buffered.name == child.name) { buffered.compress(child) return - } else if (buffered.duration <= same_kind_compression_max_duration + } else if ( (buffered.duration <= same_kind_compression_max_duration || buffered.composite.count > 1) && child.duration <= same_kind_compression_max_duration) { buffered.name = "Calls to $buffered.destination.service.resource" buffered.compress(child) From 42ad30006d4832b8cec94ef8ec5fc7e2c8f18628 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 22 Jun 2021 11:57:07 +0200 Subject: [PATCH 05/46] Pseudo code for how the strategies work in combination --- .../agents/tracing-spans-handling-huge-traces.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index 30de8fd3..5700afab 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -15,3 +15,18 @@ Agents can implement several strategies to mitigate these issues: - [Dropping fast spans](tracing-spans-drop-fast.md) - [Compressing spans](tracing-spans-compress.md) +In a nutshell, this is how the different settings work in combination: + +```java +if (span.transaction.spanCount > transaction_max_spans) { + // drop span + // collect statistics for dropped spans +} else if (compression possible) { + // apply compression +} else if (span.duration < span_min_duration) { + // drop span + // collect statistics for dropped spans +} else { + // report span +} +``` \ No newline at end of file From ae09511f88f6340f0f063f786e6b1eac7ad3ce34 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 22 Jun 2021 12:00:54 +0200 Subject: [PATCH 06/46] Add composite.exact_match flag --- specs/agents/tracing-spans-compress.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 22668ea1..57bce90d 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -90,6 +90,9 @@ and they define properties under the `composite` context. - `end`: The end timestamp of the last compressed span. The net duration of all compressed spans is equal to the composite spans' `duration`. The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`. + - `exact_match`: A boolean flag indicating whether the + [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) or the + [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) has been applied. #### Turning compressed spans into a composite span @@ -97,12 +100,13 @@ Spans have a `compress` method. The first time `compress` is called on a regular span, it becomes a composite span. ```java -void compress(Span other) { +void compress(Span other, boolean exactMatch) { if (compressed.count == 0) { compressed.count = 2 } else { compressedCount++ } + compressed.exactMatch = compressed.exactMatch && exactMatch endTimestamp = max(endTimestamp, other.endTimestamp) } ``` @@ -173,12 +177,12 @@ we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compre void tryCompress(Span child) { if (buffered.isSameKind(child)) { if (buffered.name == child.name) { - buffered.compress(child) + buffered.compress(child, exactMatch: true) return } else if ( (buffered.duration <= same_kind_compression_max_duration || buffered.composite.count > 1) && child.duration <= same_kind_compression_max_duration) { buffered.name = "Calls to $buffered.destination.service.resource" - buffered.compress(child) + buffered.compress(child, exactMatch: false) return } } From db173649ecf6bd2c7adea92ab9a60221545048e8 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Thu, 24 Jun 2021 08:09:52 +0200 Subject: [PATCH 07/46] Apply suggestions from code review Co-authored-by: Colton Myers --- specs/agents/tracing-spans-compress.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 57bce90d..e9bc971f 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -3,7 +3,7 @@ To mitigate the potential flood of spans to a backend, agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans. -While compressing multiple similar spans into a single composite span can't fully get rid of all the collection overhead, +While compressing multiple similar spans into a single composite span can't fully eliminate the collection overhead, it can significantly reduce the impact on the following areas, with very little loss of information. - Agent reporter queue utilization @@ -43,7 +43,7 @@ GET /users 10x Calls to mysql ``` -Two span are considered to be of the same type if the following properties are equal: +Two spans are considered to be of the same type if the following properties are equal: - `type` - `subtype` - `destination.service.resource` @@ -91,13 +91,14 @@ and they define properties under the `composite` context. The net duration of all compressed spans is equal to the composite spans' `duration`. The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`. - `exact_match`: A boolean flag indicating whether the - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) or the - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) has been applied. + [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) (`false`) or the + [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) (`true`) has been applied. #### Turning compressed spans into a composite span Spans have a `compress` method. -The first time `compress` is called on a regular span, it becomes a composite span. +The first time `compress` is called on a regular span, it becomes a composite span, +incorporating the new span by updating the count and end timestamp. ```java void compress(Span other, boolean exactMatch) { @@ -123,7 +124,7 @@ APM Server will take `composite.count` into account when tracking span destinati #### Eligibility for compression A span is eligible for compression if all the following conditions are met -- It's an exit span +- It's an [exit span](https://github.com/elastic/apm/blob/master/specs/agents/tracing-spans-destination.md#contextdestinationserviceresource) - The trace context of this span has not been propagated to a downstream service The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. From a81d78f9938894c497b6ff02b1cf9f5491987d20 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Wed, 30 Jun 2021 16:13:19 +0200 Subject: [PATCH 08/46] Add breadcrumbs --- specs/agents/tracing-spans-compress.md | 2 ++ specs/agents/tracing-spans-drop-fast.md | 2 ++ specs/agents/tracing-spans-dropped-stats.md | 2 ++ specs/agents/tracing-spans-handling-huge-traces.md | 2 ++ specs/agents/tracing-spans-limit.md | 2 ++ 5 files changed, 10 insertions(+) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index e9bc971f..d770c191 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -1,3 +1,5 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Compressing spans](tracing-spans-compress.md) + ## Compressing spans To mitigate the potential flood of spans to a backend, diff --git a/specs/agents/tracing-spans-drop-fast.md b/specs/agents/tracing-spans-drop-fast.md index a8736c2b..02528787 100644 --- a/specs/agents/tracing-spans-drop-fast.md +++ b/specs/agents/tracing-spans-drop-fast.md @@ -1,3 +1,5 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Dropping fast spans](tracing-spans-drop-fast.md) + ## Dropping fast spans TODO [`span_min_duration`](https://github.com/elastic/apm/pull/314) is viable again when tracking stats for fast spans [`tracing-spans-dropped-stats.md`](tracing-spans-dropped-stats.md). \ No newline at end of file diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/tracing-spans-dropped-stats.md index 0e33dc1f..68768139 100644 --- a/specs/agents/tracing-spans-dropped-stats.md +++ b/specs/agents/tracing-spans-dropped-stats.md @@ -1,3 +1,5 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) + ## Collecting statistics about dropped spans To still retain some information about dropped spans (for example due to `transaction_max_spans` or `span_min_duration`), diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index 5700afab..b29bea85 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -1,3 +1,5 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) + # Handling huge traces Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues: diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md index 9e64b5e1..c1036a34 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/tracing-spans-limit.md @@ -1,3 +1,5 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Hard limit on number of spans to collect](tracing-spans-limit.md) + ## Hard limit on number of spans to collect This is the last line of defense that comes with the highest amount of data loss. From af969da2f9793c27d2d86be0d066f2a017017fb7 Mon Sep 17 00:00:00 2001 From: Trent Mick Date: Tue, 29 Jun 2021 11:07:36 -0700 Subject: [PATCH 09/46] Add missing table of contents link to AWS tracing spec file --- specs/agents/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/specs/agents/README.md b/specs/agents/README.md index 8c9d32c1..415559d8 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -49,6 +49,7 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Distributed tracing](tracing-distributed-tracing.md) - [Tracer API](tracing-api.md) - Instrumentation + - [AWS](tracing-instrumentation-aws.md) - [Databases](tracing-instrumentation-db.md) - [HTTP](tracing-instrumentation-http.md) - [Messaging systems](tracing-instrumentation-messaging.md) From f5c010a157f9b237edc8981ddd2fed42f7ec9ec2 Mon Sep 17 00:00:00 2001 From: eyalkoren <41850454+eyalkoren@users.noreply.github.com> Date: Wed, 30 Jun 2021 06:49:48 +0300 Subject: [PATCH 10/46] Some clarifications for the destination APIs (#452) --- specs/agents/tracing-spans-destination.md | 26 +++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/specs/agents/tracing-spans-destination.md b/specs/agents/tracing-spans-destination.md index 7cdfd44e..0cd1e4aa 100644 --- a/specs/agents/tracing-spans-destination.md +++ b/specs/agents/tracing-spans-destination.md @@ -77,8 +77,11 @@ Same cardinality otherwise. **API** -Agents SHOULD offer a public API to set this field so that users can customize the value if the generic mapping is not sufficient. -User-supplied value MUST have the highest precedence, regardless if it was set before or after the automatic setting is invoked. +Agents SHOULD offer a public API to set this field so that users can customize the value if the generic mapping is not +sufficient. If set to `null` or an empty value, agents MUST omit the `span.destination.service` field altogether, thus +providing a way to manually disable the automatic setting/inference of this field (e.g. in order to remove a node +from a service map or an external service from the dependencies table). +A user-supplied value MUST have the highest precedence, regardless if it was set before or after the automatic setting is invoked. To allow for automatic inference, without users having to specify any destination field, @@ -87,16 +90,16 @@ This API sets the `exit` flag to `true` and returns `null` or a noop span in cas **Value** -For all exit spans, -agents MUST infer the value of this field based on properties that are set on the span. +For all exit spans, unless the `context.destination.service.resource` field was set by the user to `null` or an empty +string through API, agents MUST infer the value of this field based on properties that are set on the span. This is how to determine whether a span is an exit span: ```groovy exit = exit || context.destination || context.db || context.message || context.http ``` -For each exit span that does not have a value for `context.destination.service.resource`, -agents MUST run this logic to infer the value. +If no value is set to the `context.destination.service.resource` field, the logic for automatically inferring +it MUST be the following: ```groovy if (context.db?.instance) "${subtype ?: type}/${context.db?.instance}" else if (context.message?.queue?.name) "${subtype ?: type}/${context.message.queue.name}" @@ -104,6 +107,9 @@ else if (context.http?.url) "${context.http.url.host}:${context.http. else subtype ?: type ``` +If an agent API was used to set the `context.destination.service.resource` to `null` or an empty string, agents MUST +omit the `context.destination.service` field from the reported span event. + The inference of `context.destination.service.resource` SHOULD be implemented in a central place within the agent, such as an on-span-end-callback or the setter of a dependant property, rather than being implemented for each individual library integration/instrumentation. @@ -158,8 +164,16 @@ ES field: [`destination.address`](https://www.elastic.co/guide/en/ecs/current/ec Address is the destination network address: hostname (e.g. `localhost`), FQDN (e.g. `elastic.co`), IPv4 (e.g. `127.0.0.1`) IPv6 (e.g. `::1`) +Agents MAY offer a public API to set this field so that users can override the automatically discovered one. +This includes the ability to set `null` or empty value in order to unset the automatically-set value. +A user-supplied value MUST have the highest precedence, regardless of whether it was set before or after the automatic setting is invoked. + #### `context.destination.port` ES field: [`destination.port`](https://www.elastic.co/guide/en/ecs/current/ecs-destination.html#_destination_field_details) Port is the destination network port (e.g. 443) + +Agents MAY offer a public API to set this field so that users can override the automnatically discovered one. +This includes the ability to set a non-positive value in order to unset the automatically-set value. +A user-supplied value MUST have the highest precedence, regardless of whether it was set before or after the automatic setting is invoked. From 5916a63ab426562549e35e0819ede2b7ba607c1f Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 13:19:43 +0200 Subject: [PATCH 11/46] Add limit to dropped_spans_stats --- specs/agents/tracing-spans-dropped-stats.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/tracing-spans-dropped-stats.md index 68768139..abcd092b 100644 --- a/specs/agents/tracing-spans-dropped-stats.md +++ b/specs/agents/tracing-spans-dropped-stats.md @@ -2,7 +2,7 @@ ## Collecting statistics about dropped spans -To still retain some information about dropped spans (for example due to `transaction_max_spans` or `span_min_duration`), +To still retain some information about dropped spans (for example due to [`transaction_max_spans`](tracing-spans-limit.md) or [`span_min_duration`](tracing-spans-drop-fast.md)), agents SHOULD collect statistics on the corresponding transaction about dropped spans. These statistics MUST only be sent for sampled transactions. @@ -44,11 +44,14 @@ This is an example of the statistics that are added do the `transaction` events ``` ### Limits -TODO: limit the number of `dropped_spans_stats` elements? + +To avoid the structures from growing without bounds (which is only expected in pathological cases), +agents MUST limit the size of the `dropped_spans_stats` to 128 entries per transaction. +Any entries that would exceed the limit are silently dropped. ### Effects on destination service metrics As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource), APM Server tracks span destination metrics. To avoid dropped spans to skew latency metrics and cause throughput metrics to be under-counted, -APM Server will take `dropped_spans_stats` into account when tracking span destination metrics. \ No newline at end of file +APM Server will take `dropped_spans_stats` into account when tracking span destination metrics. From ccf43495cf8eb6e642c5c5299defeec56ed2a58f Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 13:30:37 +0200 Subject: [PATCH 12/46] Add implementation section to transaction_max_spans --- .../tracing-spans-handling-huge-traces.md | 19 ++++++++--- specs/agents/tracing-spans-limit.md | 34 +++++++++++++++++++ 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index b29bea85..b96c4dc9 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -11,11 +11,20 @@ Instrumenting applications that make lots of requests (such as 10k+) to backends - Decreased performance of the Elastic APM UI due to slow searches and rendering of huge traces. - Loss of clarity and overview (--> decreased user experience) in the UI when analyzing the traces. -Agents can implement several strategies to mitigate these issues: -- [Hard limit on number of spans to collect](tracing-spans-limit.md) -- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) -- [Dropping fast spans](tracing-spans-drop-fast.md) -- [Compressing spans](tracing-spans-compress.md) +Agents can implement several strategies to mitigate these issues. +These strategies are designed to capture significant information about relevant spans while at the same time limiting the trace to a manageable size. +Applying any of these strategies inevitably lead to a loss of information. +However, they aim to provide a better tradeoff between cost and insight by not capturing or summarizing less relevant data. + +- [Hard limit on number of spans to collect](tracing-spans-limit.md) \ + Even after applying the most advanced strategies, there must always be a hard limit on the number of spans we collect. + This is the last line of defense that comes with the highest amount of data loss. +- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) \ + Makes sure even if drop spans, we at least have stats about them. +- [Dropping fast spans](tracing-spans-drop-fast.md) \ + If a span was blazingly fast, it's probably not worth the cost to send and store it. +- [Compressing spans](tracing-spans-compress.md) \ + If there are a bunch of very similar spans, we can represent them in a single document - a composite span. In a nutshell, this is how the different settings work in combination: diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md index c1036a34..5553680c 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/tracing-spans-limit.md @@ -19,3 +19,37 @@ Setting an upper limit will prevent overloading the agent and the APM server wit | Type | `integer`| | Default | `500` | | Dynamic | `true` | + +### Implementation + +Before creating a span, +agents must determine whether creating that span would exceed the span limit. +The limit is reached when the total number of created spans minus the number of dropped spans is greater or equals to the max number of spans. +In other words, the limit is reached if this condition is true: + + span_count.total - span_count.dropped >= transaction_max_spans + +The `span_count.total` counter is not part of the intake API, +but it helps agents to determine whether the limit has been reached. +It reflects the total amount of started spans within a transaction. + +To ensure consistent behavior within one transaction, +the `transaction_max_spans` option should be read once on transaction start. +Even if the option is changed via remote config during the lifetime of a transaction, +the value that has been read at the start of the transaction should be used. + +Note that it's not enough to just consider this condition on span start: + + span_count.sent >= transaction_max_spans + +That's because there may be any number of concurrent spans that are started but not yet ended. +While the condition could potentially be evaluated on span end, +it's preferable to know at the start of the span whether the span should be dropped. +The reason being that agents can omit heavy operations, such as capturing a request body. + +### Metric collection + +Even though we can determine whether to drop a span before starting it, it's not legal to return a `null` or noop span in that case. +That's because we're [collecting statistics about dropped spans](tracing-spans-dropped-stats.md) as well as +[breakdown metrics](https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.ondan294nbpt) +even for spans that exceed `transaction_max_spans`. From b318ae6dc0d754bdbd2e0d3b46a382bc10929c33 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 16:08:13 +0200 Subject: [PATCH 13/46] Move exit span definition from destination spec to span spec --- specs/agents/tracing-spans-destination.md | 12 +----------- specs/agents/tracing-spans.md | 12 ++++++++++++ 2 files changed, 13 insertions(+), 11 deletions(-) diff --git a/specs/agents/tracing-spans-destination.md b/specs/agents/tracing-spans-destination.md index 0cd1e4aa..9176b3f7 100644 --- a/specs/agents/tracing-spans-destination.md +++ b/specs/agents/tracing-spans-destination.md @@ -83,21 +83,11 @@ providing a way to manually disable the automatic setting/inference of this fiel from a service map or an external service from the dependencies table). A user-supplied value MUST have the highest precedence, regardless if it was set before or after the automatic setting is invoked. -To allow for automatic inference, -without users having to specify any destination field, -agents SHOULD offer a dedicated API to start an exit span. -This API sets the `exit` flag to `true` and returns `null` or a noop span in case the parent already represents an `exit` span. - **Value** -For all exit spans, unless the `context.destination.service.resource` field was set by the user to `null` or an empty +For all [exit spans](tracing-spans.md#exit-spans), unless the `context.destination.service.resource` field was set by the user to `null` or an empty string through API, agents MUST infer the value of this field based on properties that are set on the span. -This is how to determine whether a span is an exit span: -```groovy -exit = exit || context.destination || context.db || context.message || context.http -``` - If no value is set to the `context.destination.service.resource` field, the logic for automatically inferring it MUST be the following: ```groovy diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index 9ed7a189..f4fcee63 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -76,6 +76,11 @@ Here's how the limit can be configured for [Node.js](https://www.elastic.co/guid Exit spans are spans that describe a call to an external service, such as an outgoing HTTP request or a call to a database. +A span is considered an exit span if it has explicitly been marked as such or if it has context fields that are indicative of it being an exit span: +```groovy +exit = exit || context.destination || context.db || context.message || context.http +``` + #### Child spans of exit spans Exit spans MUST not have child spans that have a different `type` or `subtype`. @@ -101,6 +106,13 @@ For example, agents MAY implement internal (or even public) APIs to mark a span Agents can then prevent the creation of a child span of a leaf/exit span. This can help to drop nested HTTP spans for instrumented calls that use HTTP as the transport layer (for example Elasticsearch). +#### Exit span API + +Agents SHOULD offer a dedicated API to start an exit span. +This API sets the `exit` flag to `true` and returns `null` or a noop span in case the parent already represents an `exit` span. +This helps with the automatic inference of [`context.destination.service.resource`](tracing-spans-destination.md#contextdestinationserviceresource) +without users having to specify any destination field. + #### Context propagation As a general rule, when agents are tracing an exit span where the downstream service is known not to continue the trace, From 7ab424b21e9262b5f148be78a9380cce75c7f64e Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 16:31:23 +0200 Subject: [PATCH 14/46] Add exit_span_min_duration spec --- specs/agents/README.md | 2 +- specs/agents/tracing-spans-drop-fast-exit.md | 80 +++++++++++++++++++ specs/agents/tracing-spans-drop-fast.md | 5 -- specs/agents/tracing-spans-dropped-stats.md | 2 +- .../tracing-spans-handling-huge-traces.md | 4 +- 5 files changed, 84 insertions(+), 9 deletions(-) create mode 100644 specs/agents/tracing-spans-drop-fast-exit.md delete mode 100644 specs/agents/tracing-spans-drop-fast.md diff --git a/specs/agents/README.md b/specs/agents/README.md index 415559d8..2da49bde 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -43,7 +43,7 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Handling huge traces](tracing-spans-handling-huge-traces.md) - [Hard limit on number of spans to collect](tracing-spans-limit.md) - [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) - - [Dropping fast spans](tracing-spans-drop-fast.md) + - [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) - [Compressing spans](tracing-spans-compress.md) - [Sampling](tracing-sampling.md) - [Distributed tracing](tracing-distributed-tracing.md) diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md new file mode 100644 index 00000000..962af5f7 --- /dev/null +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -0,0 +1,80 @@ +[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) + +# Dropping fast exit spans + +If an exit span was really fast, chances are that it's not relevant for analyzing latency issues. +Therefore, agents SHOULD implement the strategy laid out in this section to let users choose the level of detail/cost tradeoff that makes sense for them. + +## `exit_span_min_duration` configuration + +Sets the minimum duration of exit spans. +Exit spans that execute faster than this threshold are attempted to be discarded. + +The attempt fails if they lead up to a span that can't be discarded. +Spans that propagate the trace context to downstream services, +such as outgoing HTTP requests, +can't be discarded. +However, external calls that don't propagate context, +such as calls to a database, can be discarded using this threshold. + +Additionally, spans that lead to an error can't be discarded. + +| | | +|----------------|------------| +| Type | `duration` | +| Default | `1ms` | +| Central config | `true` | + +TODO: should we introduce µs granularity for this config option? +Adding `us` to all `duration`-typed options would create compatibility issues. +So we probably want to support `us` for this option only. + +## Interplay with span compression + +If an agent implements [span compression](tracing-spans-compress.md), +the limit applies to the [composite span](tracing-spans-compress.md#composite-span). + +For example, if 10 Redis calls are compressed into a single composite span whose total duration is lower than `exit_span_min_duration`, +it will be dropped. +If, on the other hand, the individual Redis calls are below the threshold, +but the sum of their durations is above it, the composite span will not be dropped. + +## Limitations + +The limitations are based on the premise that the `parent_id` of each span and transaction that's stored in Elasticsearch +should point to another valid transaction or span that's present in the Elasticsearch index. + +A span that refers to a missing span via is `parent_id` is also known as an "orphaned span". + +### Spans that propagate context to downstream services can't be discarded + +We only know whether to discard after the call has ended. +At that point, +the trace has already continued on the downstream service. +Discarding the span for the external request would orphan the transaction of the downstream call. + +Propagating the trace context to downstream services is also known as out-of-process context propagation. + +## Implementation + +### `discardable` flag + +Spans store an additional `discardable` flag in order to determine whether a span can be discarded. +The default value is `true` for [exit spans](tracing-spans.md#exit-spans) and `false` for any other span. + +According to the [limitations](#Limitations), +there are certain situations where the `discardable` flag of a span is set to `false`: +- When an error is reported for this span +- On out-of-process context propagation + +### Determining whether to report a span + +If the span's duration is less than `exit_span_min_duration` and the span is discardable (`discardable=true`), +the `span_count.dropped` count is incremented, and the span will not be reported. +We're deliberately using the same dropped counter we also use when dropping spans due to [`transaction_max_spans`](tracing-spans-limit.md#configuration-option-transaction_max_spans). +This ensures that a dropped fast span doesn't consume from the max spans limit. + +### Metric collection + +To reduce the data loss, agents [collect statistics about dropped spans](tracing-spans-dropped-stats.md). +Dropped spans contribute to [breakdown metrics](https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.ondan294nbpt) the same way as non-discarded spans. diff --git a/specs/agents/tracing-spans-drop-fast.md b/specs/agents/tracing-spans-drop-fast.md deleted file mode 100644 index 02528787..00000000 --- a/specs/agents/tracing-spans-drop-fast.md +++ /dev/null @@ -1,5 +0,0 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Dropping fast spans](tracing-spans-drop-fast.md) - -## Dropping fast spans - -TODO [`span_min_duration`](https://github.com/elastic/apm/pull/314) is viable again when tracking stats for fast spans [`tracing-spans-dropped-stats.md`](tracing-spans-dropped-stats.md). \ No newline at end of file diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/tracing-spans-dropped-stats.md index abcd092b..d2da550a 100644 --- a/specs/agents/tracing-spans-dropped-stats.md +++ b/specs/agents/tracing-spans-dropped-stats.md @@ -2,7 +2,7 @@ ## Collecting statistics about dropped spans -To still retain some information about dropped spans (for example due to [`transaction_max_spans`](tracing-spans-limit.md) or [`span_min_duration`](tracing-spans-drop-fast.md)), +To still retain some information about dropped spans (for example due to [`transaction_max_spans`](tracing-spans-limit.md) or [`exit_span_min_duration`](tracing-spans-drop-fast-exit.md)), agents SHOULD collect statistics on the corresponding transaction about dropped spans. These statistics MUST only be sent for sampled transactions. diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index b96c4dc9..bd7788fa 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -21,7 +21,7 @@ However, they aim to provide a better tradeoff between cost and insight by not c This is the last line of defense that comes with the highest amount of data loss. - [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) \ Makes sure even if drop spans, we at least have stats about them. -- [Dropping fast spans](tracing-spans-drop-fast.md) \ +- [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) \ If a span was blazingly fast, it's probably not worth the cost to send and store it. - [Compressing spans](tracing-spans-compress.md) \ If there are a bunch of very similar spans, we can represent them in a single document - a composite span. @@ -34,7 +34,7 @@ if (span.transaction.spanCount > transaction_max_spans) { // collect statistics for dropped spans } else if (compression possible) { // apply compression -} else if (span.duration < span_min_duration) { +} else if (span.duration < exit_span_min_duration) { // drop span // collect statistics for dropped spans } else { From bcd4a6dff82bbd2260db7ca00a6718a393026c91 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 16:44:59 +0200 Subject: [PATCH 15/46] Apply suggestions from code review Co-authored-by: Sergey Kleyman --- specs/agents/tracing-spans-dropped-stats.md | 2 +- specs/agents/tracing-spans-handling-huge-traces.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/tracing-spans-dropped-stats.md index d2da550a..57cb8347 100644 --- a/specs/agents/tracing-spans-dropped-stats.md +++ b/specs/agents/tracing-spans-dropped-stats.md @@ -18,7 +18,7 @@ This also allows the transaction details view (aka. waterfall) to show a summary ### Data model -This is an example of the statistics that are added do the `transaction` events sent via the intake v2 protocol. +This is an example of the statistics that are added to the `transaction` events sent via the intake v2 protocol. ```json { diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index bd7788fa..5442e5bb 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -13,7 +13,7 @@ Instrumenting applications that make lots of requests (such as 10k+) to backends Agents can implement several strategies to mitigate these issues. These strategies are designed to capture significant information about relevant spans while at the same time limiting the trace to a manageable size. -Applying any of these strategies inevitably lead to a loss of information. +Applying any of these strategies inevitably leads to a loss of information. However, they aim to provide a better tradeoff between cost and insight by not capturing or summarizing less relevant data. - [Hard limit on number of spans to collect](tracing-spans-limit.md) \ @@ -40,4 +40,4 @@ if (span.transaction.spanCount > transaction_max_spans) { } else { // report span } -``` \ No newline at end of file +``` From 834ac8b730f231e19c72d3d6a0141705f264c89a Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Mon, 5 Jul 2021 16:45:57 +0200 Subject: [PATCH 16/46] Fix links, add clarification to max duration --- specs/agents/tracing-spans-compress.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index d770c191..06ba729d 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -18,7 +18,7 @@ with very little loss of information. One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server. This strategy detects consecutive spans that hold the same information (except for the duration) -and creates a single [composite span](tracing-spans-compress.md#composite-span). +and creates a single [composite span](#composite-span). ``` [ ] @@ -27,7 +27,7 @@ GET /users 10x SELECT FROM users ``` -Two spans are considered to be an exact match if they are of the [same kind](consecutive-same-kind-compression-strategy) and if their span names are equal: +Two spans are considered to be an exact match if they are of the [same kind](#consecutive-same-kind-compression-strategy) and if their span names are equal: - `type` - `subtype` - `destination.service.resource` @@ -65,6 +65,7 @@ The rest of the context, such as the `db.statement` will be determined by the fi #### Configuration option `same_kind_compression_max_duration` Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span. +This option does not apply to [composite spans](#composite-span). This reduces the collection, processing, and storage overhead, and removes clutter from the UI. The tradeoff is that the statement information will not be collected. @@ -126,7 +127,7 @@ APM Server will take `composite.count` into account when tracking span destinati #### Eligibility for compression A span is eligible for compression if all the following conditions are met -- It's an [exit span](https://github.com/elastic/apm/blob/master/specs/agents/tracing-spans-destination.md#contextdestinationserviceresource) +- It's an [exit span](tracing-spans.md#exit-spans) - The trace context of this span has not been propagated to a downstream service The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. From 42663a20893db518eb9fefb0fdfd260e8d188fac Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 6 Jul 2021 08:10:51 +0200 Subject: [PATCH 17/46] Dropping fast spans requires stats --- specs/agents/tracing-spans-drop-fast-exit.md | 1 + 1 file changed, 1 insertion(+) diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md index 962af5f7..3e322130 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -4,6 +4,7 @@ If an exit span was really fast, chances are that it's not relevant for analyzing latency issues. Therefore, agents SHOULD implement the strategy laid out in this section to let users choose the level of detail/cost tradeoff that makes sense for them. +If an agent implements this strategy, it MUST also implement [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md). ## `exit_span_min_duration` configuration From 582865139a50e0b14830d5d79e137d79b756ba08 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Tue, 6 Jul 2021 09:23:58 +0200 Subject: [PATCH 18/46] Rework transaction_max_spans implementation logic --- specs/agents/tracing-spans-limit.md | 77 +++++++++++++++++++++++------ specs/agents/tracing-spans.md | 15 ------ 2 files changed, 61 insertions(+), 31 deletions(-) diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md index 5553680c..2b4aa8aa 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/tracing-spans-limit.md @@ -22,34 +22,79 @@ Setting an upper limit will prevent overloading the agent and the APM server wit ### Implementation +#### Span count + +When a span is put in the agent's reporter queue, a counter should be incremented on its transaction, in order to later identify the _expected_ number of spans. +In this way we can identify data loss, e.g. because events have been dropped. + +This counter SHOULD internally be named `reported` and MUST be mapped to `span_count.started` in the intake API. +The word `started` is a misnomer but needs to be used for backward compatibility. +The rest of the spec will refer to this field as `span_count.reported`. + +When a span is dropped, it is not reported to the APM Server, +instead another counter is incremented to track the number of spans dropped. +In this case the above mentioned counter for `reported` spans is not incremented. + +```json +"span_count": { + "reported": 500, + "dropped": 42 +} +``` + +The total number of spans that an agent created within a transaction is equal to `span_count.started + span_count.dropped`. + +#### Checking the limit + Before creating a span, agents must determine whether creating that span would exceed the span limit. -The limit is reached when the total number of created spans minus the number of dropped spans is greater or equals to the max number of spans. +The limit is reached when the number of reported spans is greater or equal to the max number of spans. In other words, the limit is reached if this condition is true: - span_count.total - span_count.dropped >= transaction_max_spans - -The `span_count.total` counter is not part of the intake API, -but it helps agents to determine whether the limit has been reached. -It reflects the total amount of started spans within a transaction. + span_count.reported >= transaction_max_spans + +On span end, agents that support the concurrent creation of spans need to check the condition again. +That is because any number of spans may be started before any of them end. +Agents SHOULD guard against race conditions and SHOULD prefer lock-free CAS loops over using locks. + +Example with lock: +```java +boolean report +lock() +report = span_count.reported < transaction_max_spans +if (report) { + span_count.reported++ +} +unlock() +``` + +Example CAS loop: +```java +boolean report +while (true) { + int reported = span_count.reported.atomic_get() + report = reported < transaction_max_spans + if (report && !span_count.reported.compareAndSet(reported, reported + 1)) { + // race condition - retry + continue + } + break +} +``` + +#### Configuration snapshot To ensure consistent behavior within one transaction, the `transaction_max_spans` option should be read once on transaction start. Even if the option is changed via remote config during the lifetime of a transaction, the value that has been read at the start of the transaction should be used. -Note that it's not enough to just consider this condition on span start: - - span_count.sent >= transaction_max_spans - -That's because there may be any number of concurrent spans that are started but not yet ended. -While the condition could potentially be evaluated on span end, -it's preferable to know at the start of the span whether the span should be dropped. -The reason being that agents can omit heavy operations, such as capturing a request body. - -### Metric collection +#### Metric collection Even though we can determine whether to drop a span before starting it, it's not legal to return a `null` or noop span in that case. That's because we're [collecting statistics about dropped spans](tracing-spans-dropped-stats.md) as well as [breakdown metrics](https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.ondan294nbpt) even for spans that exceed `transaction_max_spans`. + +For spans that are known to be dropped upfront, Agents SHOULD NOT collect information that is expensive to get and not needed for metrics collection. +This includes capturing headers, request bodies, and summarizing SQL statements, for example. diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index f4fcee63..1e8f9482 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -56,21 +56,6 @@ The documentation should clarify that spans with `unknown` outcomes are ignored Spans may have an associated stack trace, in order to locate the associated source code that caused the span to occur. If there are many spans being collected this can cause a significant amount of overhead in the application, due to the capture, rendering, and transmission of potentially large stack traces. It is possible to limit the recording of span stack traces to only spans that are slower than a specified duration, using the config variable `ELASTIC_APM_SPAN_FRAMES_MIN_DURATION`. -### Span count - -When a span is started a counter should be incremented on its transaction, in order to later identify the _expected_ number of spans. In this way we can identify data loss, e.g. because events have been dropped, or because of instrumentation errors. - -To handle edge cases where many spans are captured within a single transaction, the agent should enable the user to start dropping spans when the associated transaction exeeds a configurable number of spans. When a span is dropped, it is not reported to the APM Server, but instead another counter is incremented to track the number of spans dropped. In this case the above mentioned counter for started spans is not incremented. - -```json -"span_count": { - "started": 500, - "dropped": 42 -} -``` - -Here's how the limit can be configured for [Node.js](https://www.elastic.co/guide/en/apm/agent/nodejs/current/agent-api.html#transaction-max-spans) and [Python](https://www.elastic.co/guide/en/apm/agent/python/current/configuration.html#config-transaction-max-spans). - ### Exit spans Exit spans are spans that describe a call to an external service, From f260ee54e9479694e739a70394a7e4bd53ceb247 Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Wed, 7 Jul 2021 09:08:01 +0200 Subject: [PATCH 19/46] Improve transaction_max_spans: no CAS --- specs/agents/tracing-spans-limit.md | 35 +++++++++++------------------ 1 file changed, 13 insertions(+), 22 deletions(-) diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md index 2b4aa8aa..f402f749 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/tracing-spans-limit.md @@ -6,6 +6,8 @@ This is the last line of defense that comes with the highest amount of data loss This strategy MUST be implemented by all agents. Ideally, the other mechanisms limit the amount of spans enough so that the hard limit does not kick in. +Agents SHOULD also [collect statistics about dropped spans](tracing-spans-dropped-stats.md) when implementing this spec. + ### Configuration option `transaction_max_spans` Limits the amount of spans that are recorded per transaction. @@ -47,7 +49,7 @@ The total number of spans that an agent created within a transaction is equal to #### Checking the limit Before creating a span, -agents must determine whether creating that span would exceed the span limit. +agents must determine whether that span would exceed the span limit. The limit is reached when the number of reported spans is greater or equal to the max number of spans. In other words, the limit is reached if this condition is true: @@ -55,32 +57,21 @@ In other words, the limit is reached if this condition is true: On span end, agents that support the concurrent creation of spans need to check the condition again. That is because any number of spans may be started before any of them end. -Agents SHOULD guard against race conditions and SHOULD prefer lock-free CAS loops over using locks. -Example with lock: ```java -boolean report -lock() -report = span_count.reported < transaction_max_spans -if (report) { - span_count.reported++ +if (atomic_get(transaction.span_count.eligible_for_reporting) <= transaction_max_spans // optional optimization + && atomic_get_and_increment(transaction.span_count.eligible_for_reporting) <= transaction_max_spans ) { + should_be_reported = true + atomic_increment(transaction.span_count.reported) +} else { + should_be_reported = false + atomic_increment(transaction.span_count.dropped) + transaction.track_dropped_stats(this) } -unlock() ``` -Example CAS loop: -```java -boolean report -while (true) { - int reported = span_count.reported.atomic_get() - report = reported < transaction_max_spans - if (report && !span_count.reported.compareAndSet(reported, reported + 1)) { - // race condition - retry - continue - } - break -} -``` +`eligible_for_reporting` is another counter in the span_count object, but it's not reported to APM Server. +It's similar to `reported` but the value may be higher. #### Configuration snapshot From 1f3cc6b33a4f8ce9d5837665c16f8d9b7374023d Mon Sep 17 00:00:00 2001 From: Felix Barnsteiner Date: Wed, 7 Jul 2021 09:46:03 +0200 Subject: [PATCH 20/46] Apply suggestions from code review Co-authored-by: Sergey Kleyman --- specs/agents/tracing-spans-limit.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/tracing-spans-limit.md index f402f749..6c8558f1 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/tracing-spans-limit.md @@ -39,7 +39,7 @@ In this case the above mentioned counter for `reported` spans is not incremented ```json "span_count": { - "reported": 500, + "started": 500, "dropped": 42 } ``` @@ -53,7 +53,7 @@ agents must determine whether that span would exceed the span limit. The limit is reached when the number of reported spans is greater or equal to the max number of spans. In other words, the limit is reached if this condition is true: - span_count.reported >= transaction_max_spans + atomic_get(transaction.span_count.eligible_for_reporting) >= transaction_max_spans On span end, agents that support the concurrent creation of spans need to check the condition again. That is because any number of spans may be started before any of them end. From bb1bcdeebf3147eff5aeb41dc8a93ca704b05d80 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Tue, 13 Jul 2021 10:41:03 +0300 Subject: [PATCH 21/46] Update specs/agents/tracing-spans-compress.md --- specs/agents/tracing-spans-compress.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 06ba729d..8885690d 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -153,16 +153,15 @@ A buffered span gets reported when 2. a non-compressible sibling ends ```java -void onSpanEnd() { - if (isCompressionEligible()) { - if (parent.hasBufferedSpan()) { - parent.tryCompress(this) - } else { - parent.buffered = this +void onChildSpanEnd(Span child) { + if (child.isCompressionEligible()) { + if (!tryCompress(child)) { + report(buffered) + buffered = child } } else { report(buffered) - report(this) + report(child) } } ``` From e6b50d2eaac8130be78eb49a4d8013a195fe3704 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Tue, 13 Jul 2021 10:41:15 +0300 Subject: [PATCH 22/46] Update specs/agents/tracing-spans-compress.md --- specs/agents/tracing-spans-compress.md | 52 ++++++++++++++++++++------ 1 file changed, 41 insertions(+), 11 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8885690d..112b17be 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -177,20 +177,50 @@ If the spans are of the same kind, and have the same name, we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy). ```java -void tryCompress(Span child) { - if (buffered.isSameKind(child)) { +bool tryCompress(Span child) { + if (buffered == null) { + buffered = child + return true + } + + if (!buffered.isSameKind(child)) { + return false + } + + return buffered.isComposite() ? tryCompressWithComposite(child) : tryCompressWithRegular(child); +} + +bool tryCompressWithRegular(Span child) { + if (buffered.name == child.name) { + buffered.composite.exactMatch = true + return true + } + + if (buffered.duration <= same_kind_compression_max_duration && child.duration <= same_kind_compression_max_duration) { + buffered.composite.exactMatch = false + buffered.name = "Calls to $buffered.destination.service.resource" + return true + } + + return false +} + +bool tryCompressWithComposite(Span child) { + if (buffered.composite.exactMatch) { if (buffered.name == child.name) { - buffered.compress(child, exactMatch: true) - return - } else if ( (buffered.duration <= same_kind_compression_max_duration || buffered.composite.count > 1) - && child.duration <= same_kind_compression_max_duration) { - buffered.name = "Calls to $buffered.destination.service.resource" - buffered.compress(child, exactMatch: false) - return + buffered.compress(child) + return true } + + return false + } + + if (child.duration <= same_kind_compression_max_duration) { + buffered.compress(child) + return true } - report(buffered) - buffered = child + + return false } ``` From 9ba8957518ce6ce215e0c6e1d83fb240ed3e0225 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Tue, 13 Jul 2021 16:48:15 +0300 Subject: [PATCH 23/46] Update specs/agents/tracing-spans-handling-huge-traces.md Co-authored-by: Trent Mick --- specs/agents/tracing-spans-handling-huge-traces.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/tracing-spans-handling-huge-traces.md index 5442e5bb..80bca643 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/tracing-spans-handling-huge-traces.md @@ -20,7 +20,7 @@ However, they aim to provide a better tradeoff between cost and insight by not c Even after applying the most advanced strategies, there must always be a hard limit on the number of spans we collect. This is the last line of defense that comes with the highest amount of data loss. - [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) \ - Makes sure even if drop spans, we at least have stats about them. + Makes sure even if dropping spans, we at least have stats about them. - [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) \ If a span was blazingly fast, it's probably not worth the cost to send and store it. - [Compressing spans](tracing-spans-compress.md) \ From b20c102f80dc28ab2f1f96be3ae8f335690f4ac3 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 08:54:23 +0300 Subject: [PATCH 24/46] Renamed same_kind_compression_max_duration config option to span_compression_same_kind_max_duration --- specs/agents/tracing-spans-compress.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 112b17be..aef70b30 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -62,7 +62,7 @@ When applying this compression strategy, the `span.name` is set to `Calls to $sp The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span. -#### Configuration option `same_kind_compression_max_duration` +#### Configuration option `span_compression_same_kind_max_duration` Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span. This option does not apply to [composite spans](#composite-span). @@ -170,7 +170,7 @@ void onChildSpanEnd(Span child) { On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling. -If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `same_kind_compression_max_duration`, +If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `span_compression_same_kind_max_duration`, we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy). If the spans are of the same kind, and have the same name, @@ -196,7 +196,7 @@ bool tryCompressWithRegular(Span child) { return true } - if (buffered.duration <= same_kind_compression_max_duration && child.duration <= same_kind_compression_max_duration) { + if (buffered.duration <= span_compression_same_kind_max_duration && child.duration <= span_compression_same_kind_max_duration) { buffered.composite.exactMatch = false buffered.name = "Calls to $buffered.destination.service.resource" return true @@ -215,7 +215,7 @@ bool tryCompressWithComposite(Span child) { return false } - if (child.duration <= same_kind_compression_max_duration) { + if (child.duration <= span_compression_same_kind_max_duration) { buffered.compress(child) return true } From 51db9498255636893e2290be2d97126991d45aeb Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 09:07:37 +0300 Subject: [PATCH 25/46] Added span_compression_same_kind_max_duration config option --- specs/agents/tracing-spans-compress.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index aef70b30..99520e49 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -33,6 +33,19 @@ Two spans are considered to be an exact match if they are of the [same kind](#co - `destination.service.resource` - `name` +#### Configuration option `span_compression_exact_match_max_duration` + +Consecutive spans that are exact match and that are under this threshold will be compressed into a single composite span. +This option does not apply to [composite spans](#composite-span). +This reduces the collection, processing, and storage overhead, and removes clutter from the UI. +The tradeoff is that the DB statements of all the compressed spans will not be collected. + +| | | +|----------------|----------| +| Type | `duration`| +| Default | `5ms` | +| Dynamic | `true` | + ### Consecutive-Same-Kind compression strategy Another pattern that often occurs is a high amount of alternating queries to the same backend. @@ -61,13 +74,12 @@ boolean isSameKind(Span other) { When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`. The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span. - #### Configuration option `span_compression_same_kind_max_duration` Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span. This option does not apply to [composite spans](#composite-span). This reduces the collection, processing, and storage overhead, and removes clutter from the UI. -The tradeoff is that the statement information will not be collected. +The tradeoff is that the DB statements of all the compressed spans will not be collected. | | | |----------------|----------| From 473bb4d9c600b30b5c48d8fcd78b454dd43edd97 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 09:13:38 +0300 Subject: [PATCH 26/46] Added span_compression_enabled config option --- specs/agents/tracing-spans-compress.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 99520e49..8a9f5d73 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -13,6 +13,18 @@ with very little loss of information. - Potential to re-use span objects, significantly reducing allocations - Downstream effects like reducing impact on APM Server, ES storage, and UI performance +#### Configuration option `span_compression_enabled` + +Setting this option to true will enable span compression feature. +Span compression reduces the collection, processing, and storage overhead, and removes clutter from the UI. +The tradeoff is that some information such as DB statements of all the compressed spans will not be collected. + +| | | +|----------------|----------| +| Type | `boolean`| +| Default | `false` | +| Dynamic | `true` | + ### Consecutive-Exact-Match compression strategy From 00dcfa8eb602d346454c288634e6eec3f77de7e8 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 09:15:27 +0300 Subject: [PATCH 27/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com> --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8a9f5d73..76b61a1c 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -7,7 +7,7 @@ agents SHOULD implement the strategies laid out in this section to avoid sending While compressing multiple similar spans into a single composite span can't fully eliminate the collection overhead, it can significantly reduce the impact on the following areas, -with very little loss of information. +with very little loss of information: - Agent reporter queue utilization - Capturing stack traces, serialization, compression, and sending events to APM Server - Potential to re-use span objects, significantly reducing allocations From ccf2aa4cafeab6e97e94877ea1a12d4ba42dcb22 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 09:26:45 +0300 Subject: [PATCH 28/46] Changed end to sum.us in composite sub-object --- specs/agents/tracing-spans-compress.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8a9f5d73..aa685eae 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -110,13 +110,12 @@ The `timestamp` and `duration` have slightly similar semantics, and they define properties under the `composite` context. - `timestamp`: The start timestamp of the first span. -- `duration`: The sum of durations of all spans. +- `duration`: gross duration (i.e., __ - __). - `composite` - `count`: The number of compressed spans this composite span represents. The minimum count is 2 as a composite span represents at least two spans. - - `end`: The end timestamp of the last compressed span. - The net duration of all compressed spans is equal to the composite spans' `duration`. - The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`. + - `sum.us`: sum of durations of all compressed spans this composite span represents in microseconds. + Thus `sum.us` is the net duration of all the compressed spans while `duration` is the gross duration (including "whitespace" between the spans). - `exact_match`: A boolean flag indicating whether the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) (`false`) or the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) (`true`) has been applied. From f711c073b6d6a1b479b52e5101b7c895e5566495 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 10:54:59 +0300 Subject: [PATCH 29/46] Replaced exact_match bool with compression_strategy enum --- specs/agents/tracing-spans-compress.md | 136 +++++++++++++------------ 1 file changed, 73 insertions(+), 63 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8395783b..824c627c 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -116,27 +116,9 @@ and they define properties under the `composite` context. The minimum count is 2 as a composite span represents at least two spans. - `sum.us`: sum of durations of all compressed spans this composite span represents in microseconds. Thus `sum.us` is the net duration of all the compressed spans while `duration` is the gross duration (including "whitespace" between the spans). - - `exact_match`: A boolean flag indicating whether the - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) (`false`) or the - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) (`true`) has been applied. - -#### Turning compressed spans into a composite span - -Spans have a `compress` method. -The first time `compress` is called on a regular span, it becomes a composite span, -incorporating the new span by updating the count and end timestamp. - -```java -void compress(Span other, boolean exactMatch) { - if (compressed.count == 0) { - compressed.count = 2 - } else { - compressedCount++ - } - compressed.exactMatch = compressed.exactMatch && exactMatch - endTimestamp = max(endTimestamp, other.endTimestamp) -} -``` + - `compression_strategy`: A string value indicating which compression strategy was used. The valid values are: + - `exact_match` - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) + - `same_kind` - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) #### Effects on metric processing @@ -176,75 +158,103 @@ A buffered span gets reported when 2. a non-compressible sibling ends ```java -void onChildSpanEnd(Span child) { - if (child.isCompressionEligible()) { - if (!tryCompress(child)) { +void onEnd() { + if (buffered != null) { + report(buffered) + } +} + +void onChildEnd(Span child) { + if (!child.isCompressionEligible()) { + if (buffered != null) { report(buffered) - buffered = child + buffered = null } - } else { - report(buffered) report(child) + return + } + + if (buffered == null) { + buffered = child + return + } + + if (!buffered.tryToCompress(child)) { + report(buffered) + buffered = child } } ``` -#### Compression - -On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling. +#### Turning compressed spans into a composite span -If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `span_compression_same_kind_max_duration`, -we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy). +Spans have `tryToCompress` method that is called on a span buffered by its parent. +On the first call the span checks if it can be compressed with the given sibling and it selects the best compression strategy. +Note that the compression strategy selected only once based on the first two spans of the sequence. +The compression strategy cannot be changed by the rest the spans in the sequence. +So when the current sibling span cannot be added to the ongoing sequence under the selected compression strategy +then the ongoing is terminated, it is sent out as a composite span and the current sibling span is buffered. -If the spans are of the same kind, and have the same name, +If the spans are of the same kind, and have the same name and both spans `duration` <= `span_compression_exact_match_max_duration`, we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy). +Note that if the spans are _exact match_ +but duration threshold requirement is not satisfied we just stop compression sequence. +In particular it means that the implementation should not proceed to try _same kind_ strategy. +Otherwise user would have to lower both `span_compression_exact_match_max_duration` and `span_compression_same_kind_max_duration` +to prevent longer _exact match_ spans from being compressed. -```java -bool tryCompress(Span child) { - if (buffered == null) { - buffered = child - return true - } +If the spans are of the same kind but have different span names and both spans `duration` <= `span_compression_same_kind_max_duration`, +we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy). - if (!buffered.isSameKind(child)) { +```java +bool tryToCompress(Span sibling) { + isAlreadyComposite = composite != null + canBeCompressed = isAlreadyComposite ? tryToCompressComposite(sibling) : tryToCompressRegular(sibling) + if (!canBeCompressed) { return false } - - return buffered.isComposite() ? tryCompressWithComposite(child) : tryCompressWithRegular(child); -} - -bool tryCompressWithRegular(Span child) { - if (buffered.name == child.name) { - buffered.composite.exactMatch = true - return true + + if (!isAlreadyComposite) { + composite.count = 1 + composite.sumUs = duration } + + ++composite.count + composite.sumUs += other.duration + return true +} - if (buffered.duration <= span_compression_same_kind_max_duration && child.duration <= span_compression_same_kind_max_duration) { - buffered.composite.exactMatch = false - buffered.name = "Calls to $buffered.destination.service.resource" - return true +bool tryToCompressRegular(Span sibling) { + if (!isSameKind(sibling)) { + return false } - return false -} - -bool tryCompressWithComposite(Span child) { - if (buffered.composite.exactMatch) { - if (buffered.name == child.name) { - buffered.compress(child) + if (name == sibling.name) { + if (duration <= span_compression_exact_match_max_duration && sibling.duration <= span_compression_exact_match_max_duration) { + composite.compressionStrategy = "exact_match" return true } - return false } - if (child.duration <= span_compression_same_kind_max_duration) { - buffered.compress(child) + if (duration <= span_compression_same_kind_max_duration && sibling.duration <= span_compression_same_kind_max_duration) { + composite.compressionStrategy = "same_kind" + name = "Calls to " + destination.service.resource return true } - + return false } + +bool tryToCompressComposite(Span sibling) { + switch (composite.compressionStrategy) { + case "exact_match": + return name == sibling.name && sibling.duration <= span_compression_exact_match_max_duration + + case "same_kind": + return sibling.duration <= span_compression_same_kind_max_duration + } +} ``` #### Concurrency From df344a25a7ea285095c18d51c776c6f3d6b42efa Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 10:56:44 +0300 Subject: [PATCH 30/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com> --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 824c627c..6c8c5c05 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -266,7 +266,7 @@ However, if there can be a situation where multiple spans may end concurrently, To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm) paired with retry loops over blocking algorithms that use mutexes or locks. -In particular, operations that work with the buffer require special attention. +In particular, operations that work with the buffer require special attention: - Setting a span into the buffer must be handled atomically. - Retrieving a span from the buffer must be handled atomically. Retrieving includes atomically getting and clearing the buffer. From 98a5bd97c4f5e43032b17836b80f98177cb13b8f Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 11:34:34 +0300 Subject: [PATCH 31/46] Added outcome requirement to eligible for compression --- specs/agents/tracing-spans-compress.md | 2 ++ specs/agents/tracing-spans-drop-fast-exit.md | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 6c8c5c05..22b30aa7 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -134,6 +134,8 @@ APM Server will take `composite.count` into account when tracking span destinati A span is eligible for compression if all the following conditions are met - It's an [exit span](tracing-spans.md#exit-spans) - The trace context of this span has not been propagated to a downstream service +- If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. + It means spans with outcome indicating an issue of potential interest should not be compressed. The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view. diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md index 3e322130..616a30b4 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -65,7 +65,8 @@ The default value is `true` for [exit spans](tracing-spans.md#exit-spans) and `f According to the [limitations](#Limitations), there are certain situations where the `discardable` flag of a span is set to `false`: -- When an error is reported for this span +- the span has `outcome` (i.e., `outcome` is present and it's not `null`) `outcome` is not `success`. + So spans with outcome indicating an issue of potential interest are not discardable - On out-of-process context propagation ### Determining whether to report a span From ef501a3f306b3cbf38cb64e4ce77abdd58a774c0 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 11:39:06 +0300 Subject: [PATCH 32/46] Added outcome requirement to eligible for compression PART 2 Updated isCompressionEligible() pseudo-code --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 22b30aa7..8792d084 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -142,7 +142,7 @@ This would orphan the sub-graph started by the downstream service and cause it t ```java boolean isCompressionEligible() { - return exit && !context.hasPropagated + return exit && !context.hasPropagated && (outcome == null || outcome == "success") } ``` From 798d270ce938f13a2cc0c2b3d5db164e13433a3f Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 11:45:41 +0300 Subject: [PATCH 33/46] Added links from tracing-spans.md to tracing-spans-compress.md --- specs/agents/tracing-spans.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index 1e8f9482..29c75337 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -83,7 +83,7 @@ For example, an HTTP exit span may have child spans with the `action` `request`, These spans MUST NOT have any destination context, so that there's no effect on destination metrics. Most agents would want to treat exit spans as leaf spans, though. -This brings the benefit of being able to compress repetitive exit spans (TODO link to span compression spec once available), +This brings the benefit of being able to [compress](tracing-spans-compress.md) repetitive exit spans, as span compression is only applicable to leaf spans. Agents MAY implement mechanisms to prevent the creation of child spans of exit spans. @@ -108,7 +108,7 @@ However, when tracing a regular outgoing HTTP request (one that's not initiated and it's unknown whether the downsteam service continues the trace, the trace headers should be added. -The reason is that spans cannot be compressed (TODO link to span compression spec once available) if the context has been propagated, as it may lead to orphaned transactions. +The reason is that spans cannot be [compressed](tracing-spans-compress.md) if the context has been propagated, as it may lead to orphaned transactions. That means that the `parent.id` of a transaction may refer to a span that's not available because it has been compressed (merged with another span). There can, however, be exceptions to this rule whenever it makes sense. For example, if it's known that the backend system can continue the trace. From 3754297a1be5be35af8dfc5f2bc58e3d0804ea72 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Thu, 15 Jul 2021 11:51:42 +0300 Subject: [PATCH 34/46] Fixed missing isSameKind check in tryToCompressComposite() --- specs/agents/tracing-spans-compress.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8792d084..8f80170c 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -251,10 +251,10 @@ bool tryToCompressRegular(Span sibling) { bool tryToCompressComposite(Span sibling) { switch (composite.compressionStrategy) { case "exact_match": - return name == sibling.name && sibling.duration <= span_compression_exact_match_max_duration + return isSameKind(sibling) && name == sibling.name && sibling.duration <= span_compression_exact_match_max_duration case "same_kind": - return sibling.duration <= span_compression_same_kind_max_duration + return isSameKind(sibling) && sibling.duration <= span_compression_same_kind_max_duration } } ``` From 4df5afb8fa54355a296f94f79bf44880d46c4c6d Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:53:42 +0300 Subject: [PATCH 35/46] Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-drop-fast-exit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md index 616a30b4..9b8b3e90 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -11,7 +11,7 @@ If an agent implements this strategy, it MUST also implement [Collecting statist Sets the minimum duration of exit spans. Exit spans that execute faster than this threshold are attempted to be discarded. -The attempt fails if they lead up to a span that can't be discarded. +In some cases exit spans cannot be discarded. Spans that propagate the trace context to downstream services, such as outgoing HTTP requests, can't be discarded. From 0be6c902c183319786318b1c98ecc93eb5cd97a8 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:54:16 +0300 Subject: [PATCH 36/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 8f80170c..f05170d5 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -137,7 +137,7 @@ A span is eligible for compression if all the following conditions are met - If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. It means spans with outcome indicating an issue of potential interest should not be compressed. -The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. +The second condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view. ```java From 44c3936baa01a0b571641a3d7edd44c88e7fbff8 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:56:38 +0300 Subject: [PATCH 37/46] Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-drop-fast-exit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md index 9b8b3e90..6156621c 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -12,7 +12,7 @@ Sets the minimum duration of exit spans. Exit spans that execute faster than this threshold are attempted to be discarded. In some cases exit spans cannot be discarded. -Spans that propagate the trace context to downstream services, +For example, spans that propagate the trace context to downstream services, such as outgoing HTTP requests, can't be discarded. However, external calls that don't propagate context, From 182d610a42a20015a858901511b4e527f19003c0 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:56:50 +0300 Subject: [PATCH 38/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index f05170d5..87b8c2d9 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -132,7 +132,7 @@ APM Server will take `composite.count` into account when tracking span destinati #### Eligibility for compression A span is eligible for compression if all the following conditions are met -- It's an [exit span](tracing-spans.md#exit-spans) +1. It's an [exit span](tracing-spans.md#exit-spans) - The trace context of this span has not been propagated to a downstream service - If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. It means spans with outcome indicating an issue of potential interest should not be compressed. From a7d728bccb1b86237a5ff4d7a58c3c883c8cd68c Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:57:06 +0300 Subject: [PATCH 39/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 87b8c2d9..26625a5f 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -133,7 +133,7 @@ APM Server will take `composite.count` into account when tracking span destinati A span is eligible for compression if all the following conditions are met 1. It's an [exit span](tracing-spans.md#exit-spans) -- The trace context of this span has not been propagated to a downstream service +2. The trace context of this span has not been propagated to a downstream service - If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. It means spans with outcome indicating an issue of potential interest should not be compressed. From 6b364364e3f6e6bcb0c802c6b6698936da5c3bae Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 06:57:14 +0300 Subject: [PATCH 40/46] Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert --- specs/agents/tracing-spans-compress.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/tracing-spans-compress.md index 26625a5f..734474e8 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/tracing-spans-compress.md @@ -134,7 +134,7 @@ APM Server will take `composite.count` into account when tracking span destinati A span is eligible for compression if all the following conditions are met 1. It's an [exit span](tracing-spans.md#exit-spans) 2. The trace context of this span has not been propagated to a downstream service -- If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. +3. If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`. It means spans with outcome indicating an issue of potential interest should not be compressed. The second condition is important so that we don't remove (compress) a span that may be the parent of a downstream service. From 2a54365eb3f91237fd7ee17ba0f7b393fad78f97 Mon Sep 17 00:00:00 2001 From: Sergey Kleyman Date: Mon, 19 Jul 2021 08:02:21 +0300 Subject: [PATCH 41/46] Removed "Exit span API" requirement from tracing-spans.md --- specs/agents/tracing-spans.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index 29c75337..e25ce8d7 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -91,13 +91,6 @@ For example, agents MAY implement internal (or even public) APIs to mark a span Agents can then prevent the creation of a child span of a leaf/exit span. This can help to drop nested HTTP spans for instrumented calls that use HTTP as the transport layer (for example Elasticsearch). -#### Exit span API - -Agents SHOULD offer a dedicated API to start an exit span. -This API sets the `exit` flag to `true` and returns `null` or a noop span in case the parent already represents an `exit` span. -This helps with the automatic inference of [`context.destination.service.resource`](tracing-spans-destination.md#contextdestinationserviceresource) -without users having to specify any destination field. - #### Context propagation As a general rule, when agents are tracing an exit span where the downstream service is known not to continue the trace, From 990463ef55b67df9512aaebbed1a7646a51050cd Mon Sep 17 00:00:00 2001 From: Alexander Wert Date: Mon, 19 Jul 2021 10:16:44 +0200 Subject: [PATCH 42/46] Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com> --- specs/agents/tracing-spans-drop-fast-exit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/tracing-spans-drop-fast-exit.md index 6156621c..22e6ce1e 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/tracing-spans-drop-fast-exit.md @@ -65,7 +65,7 @@ The default value is `true` for [exit spans](tracing-spans.md#exit-spans) and `f According to the [limitations](#Limitations), there are certain situations where the `discardable` flag of a span is set to `false`: -- the span has `outcome` (i.e., `outcome` is present and it's not `null`) `outcome` is not `success`. +- the span's `outcome` field is set to anything other than `success`. So spans with outcome indicating an issue of potential interest are not discardable - On out-of-process context propagation From a8e1e91e7f0fad9f96cc3f2a5313e8ea8f766d3b Mon Sep 17 00:00:00 2001 From: Alexander Wert Date: Mon, 19 Jul 2021 12:50:52 +0200 Subject: [PATCH 43/46] reafctored file structure for handling huge traces --- specs/agents/README.md | 10 +++---- .../README.md} | 2 -- .../tracing-spans-compress.md | 30 +++++++++---------- .../tracing-spans-drop-fast-exit.md | 2 -- .../tracing-spans-dropped-stats.md | 12 ++++---- .../tracing-spans-limit.md | 16 +++++----- 6 files changed, 31 insertions(+), 41 deletions(-) rename specs/agents/{tracing-spans-handling-huge-traces.md => handling-huge-traces/README.md} (96%) rename specs/agents/{ => handling-huge-traces}/tracing-spans-compress.md (94%) rename specs/agents/{ => handling-huge-traces}/tracing-spans-drop-fast-exit.md (96%) rename specs/agents/{ => handling-huge-traces}/tracing-spans-dropped-stats.md (86%) rename specs/agents/{ => handling-huge-traces}/tracing-spans-limit.md (91%) diff --git a/specs/agents/README.md b/specs/agents/README.md index 2da49bde..23a397af 100644 --- a/specs/agents/README.md +++ b/specs/agents/README.md @@ -40,11 +40,11 @@ You can find details about each of these in the [APM Data Model](https://www.ela - [Transactions](tracing-transactions.md) - [Spans](tracing-spans.md) - [Span destination](tracing-spans-destination.md) - - [Handling huge traces](tracing-spans-handling-huge-traces.md) - - [Hard limit on number of spans to collect](tracing-spans-limit.md) - - [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) - - [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) - - [Compressing spans](tracing-spans-compress.md) + - [Handling huge traces](handling-huge-traces/tracing-spans-handling-huge-traces.md) + - [Hard limit on number of spans to collect](handling-huge-traces/tracing-spans-limit.md) + - [Collecting statistics about dropped spans](handling-huge-traces/tracing-spans-dropped-stats.md) + - [Dropping fast exit spans](handling-huge-traces/tracing-spans-drop-fast-exit.md) + - [Compressing spans](handling-huge-traces/tracing-spans-compress.md) - [Sampling](tracing-sampling.md) - [Distributed tracing](tracing-distributed-tracing.md) - [Tracer API](tracing-api.md) diff --git a/specs/agents/tracing-spans-handling-huge-traces.md b/specs/agents/handling-huge-traces/README.md similarity index 96% rename from specs/agents/tracing-spans-handling-huge-traces.md rename to specs/agents/handling-huge-traces/README.md index 80bca643..297e18bc 100644 --- a/specs/agents/tracing-spans-handling-huge-traces.md +++ b/specs/agents/handling-huge-traces/README.md @@ -1,5 +1,3 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) - # Handling huge traces Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues: diff --git a/specs/agents/tracing-spans-compress.md b/specs/agents/handling-huge-traces/tracing-spans-compress.md similarity index 94% rename from specs/agents/tracing-spans-compress.md rename to specs/agents/handling-huge-traces/tracing-spans-compress.md index 734474e8..782edaab 100644 --- a/specs/agents/tracing-spans-compress.md +++ b/specs/agents/handling-huge-traces/tracing-spans-compress.md @@ -1,6 +1,4 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Compressing spans](tracing-spans-compress.md) - -## Compressing spans +# Compressing spans To mitigate the potential flood of spans to a backend, agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans. @@ -13,7 +11,7 @@ with very little loss of information: - Potential to re-use span objects, significantly reducing allocations - Downstream effects like reducing impact on APM Server, ES storage, and UI performance -#### Configuration option `span_compression_enabled` +### Configuration option `span_compression_enabled` Setting this option to true will enable span compression feature. Span compression reduces the collection, processing, and storage overhead, and removes clutter from the UI. @@ -26,7 +24,7 @@ The tradeoff is that some information such as DB statements of all the compresse | Dynamic | `true` | -### Consecutive-Exact-Match compression strategy +## Consecutive-Exact-Match compression strategy One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server. This strategy detects consecutive spans that hold the same information (except for the duration) @@ -45,7 +43,7 @@ Two spans are considered to be an exact match if they are of the [same kind](#co - `destination.service.resource` - `name` -#### Configuration option `span_compression_exact_match_max_duration` +### Configuration option `span_compression_exact_match_max_duration` Consecutive spans that are exact match and that are under this threshold will be compressed into a single composite span. This option does not apply to [composite spans](#composite-span). @@ -58,7 +56,7 @@ The tradeoff is that the DB statements of all the compressed spans will not be c | Default | `5ms` | | Dynamic | `true` | -### Consecutive-Same-Kind compression strategy +## Consecutive-Same-Kind compression strategy Another pattern that often occurs is a high amount of alternating queries to the same backend. Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead. @@ -86,7 +84,7 @@ boolean isSameKind(Span other) { When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`. The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span. -#### Configuration option `span_compression_same_kind_max_duration` +### Configuration option `span_compression_same_kind_max_duration` Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span. This option does not apply to [composite spans](#composite-span). @@ -99,12 +97,12 @@ The tradeoff is that the DB statements of all the compressed spans will not be c | Default | `5ms` | | Dynamic | `true` | -### Composite span +## Composite span Compressed spans don't have a physical span document. Instead, multiple compressed spans are represented by a composite span. -#### Data model +### Data model The `timestamp` and `duration` have slightly similar semantics, and they define properties under the `composite` context. @@ -120,16 +118,16 @@ and they define properties under the `composite` context. - `exact_match` - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) - `same_kind` - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) -#### Effects on metric processing +### Effects on metric processing As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource), APM Server tracks span destination metrics. To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted, APM Server will take `composite.count` into account when tracking span destination metrics. -### Compression algorithm +## Compression algorithm -#### Eligibility for compression +### Eligibility for compression A span is eligible for compression if all the following conditions are met 1. It's an [exit span](tracing-spans.md#exit-spans) @@ -146,7 +144,7 @@ boolean isCompressionEligible() { } ``` -#### Span buffering +### Span buffering Non-compression-eligible spans may be reported immediately after they have ended. When a compression-eligible span ends, it does not immediately get reported. @@ -188,7 +186,7 @@ void onChildEnd(Span child) { } ``` -#### Turning compressed spans into a composite span +### Turning compressed spans into a composite span Spans have `tryToCompress` method that is called on a span buffered by its parent. On the first call the span checks if it can be compressed with the given sibling and it selects the best compression strategy. @@ -259,7 +257,7 @@ bool tryToCompressComposite(Span sibling) { } ``` -#### Concurrency +### Concurrency The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise. Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency. diff --git a/specs/agents/tracing-spans-drop-fast-exit.md b/specs/agents/handling-huge-traces/tracing-spans-drop-fast-exit.md similarity index 96% rename from specs/agents/tracing-spans-drop-fast-exit.md rename to specs/agents/handling-huge-traces/tracing-spans-drop-fast-exit.md index 22e6ce1e..c75d7b6f 100644 --- a/specs/agents/tracing-spans-drop-fast-exit.md +++ b/specs/agents/handling-huge-traces/tracing-spans-drop-fast-exit.md @@ -1,5 +1,3 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) - # Dropping fast exit spans If an exit span was really fast, chances are that it's not relevant for analyzing latency issues. diff --git a/specs/agents/tracing-spans-dropped-stats.md b/specs/agents/handling-huge-traces/tracing-spans-dropped-stats.md similarity index 86% rename from specs/agents/tracing-spans-dropped-stats.md rename to specs/agents/handling-huge-traces/tracing-spans-dropped-stats.md index 57cb8347..1fa2cf80 100644 --- a/specs/agents/tracing-spans-dropped-stats.md +++ b/specs/agents/handling-huge-traces/tracing-spans-dropped-stats.md @@ -1,12 +1,10 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) - -## Collecting statistics about dropped spans +# Collecting statistics about dropped spans To still retain some information about dropped spans (for example due to [`transaction_max_spans`](tracing-spans-limit.md) or [`exit_span_min_duration`](tracing-spans-drop-fast-exit.md)), agents SHOULD collect statistics on the corresponding transaction about dropped spans. These statistics MUST only be sent for sampled transactions. -### Use cases +## Use cases This allows APM Server to consider these metrics for the service destination metrics. In practice, @@ -16,7 +14,7 @@ even if most of the spans are dropped. This also allows the transaction details view (aka. waterfall) to show a summary of the dropped spans. -### Data model +## Data model This is an example of the statistics that are added to the `transaction` events sent via the intake v2 protocol. @@ -43,13 +41,13 @@ This is an example of the statistics that are added to the `transaction` events } ``` -### Limits +## Limits To avoid the structures from growing without bounds (which is only expected in pathological cases), agents MUST limit the size of the `dropped_spans_stats` to 128 entries per transaction. Any entries that would exceed the limit are silently dropped. -### Effects on destination service metrics +## Effects on destination service metrics As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource), APM Server tracks span destination metrics. diff --git a/specs/agents/tracing-spans-limit.md b/specs/agents/handling-huge-traces/tracing-spans-limit.md similarity index 91% rename from specs/agents/tracing-spans-limit.md rename to specs/agents/handling-huge-traces/tracing-spans-limit.md index 6c8558f1..42bccffb 100644 --- a/specs/agents/tracing-spans-limit.md +++ b/specs/agents/handling-huge-traces/tracing-spans-limit.md @@ -1,6 +1,4 @@ -[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Hard limit on number of spans to collect](tracing-spans-limit.md) - -## Hard limit on number of spans to collect +# Hard limit on number of spans to collect This is the last line of defense that comes with the highest amount of data loss. This strategy MUST be implemented by all agents. @@ -8,7 +6,7 @@ Ideally, the other mechanisms limit the amount of spans enough so that the hard Agents SHOULD also [collect statistics about dropped spans](tracing-spans-dropped-stats.md) when implementing this spec. -### Configuration option `transaction_max_spans` +## Configuration option `transaction_max_spans` Limits the amount of spans that are recorded per transaction. @@ -22,9 +20,9 @@ Setting an upper limit will prevent overloading the agent and the APM server wit | Default | `500` | | Dynamic | `true` | -### Implementation +## Implementation -#### Span count +### Span count When a span is put in the agent's reporter queue, a counter should be incremented on its transaction, in order to later identify the _expected_ number of spans. In this way we can identify data loss, e.g. because events have been dropped. @@ -46,7 +44,7 @@ In this case the above mentioned counter for `reported` spans is not incremented The total number of spans that an agent created within a transaction is equal to `span_count.started + span_count.dropped`. -#### Checking the limit +### Checking the limit Before creating a span, agents must determine whether that span would exceed the span limit. @@ -73,14 +71,14 @@ if (atomic_get(transaction.span_count.eligible_for_reporting) <= transaction_max `eligible_for_reporting` is another counter in the span_count object, but it's not reported to APM Server. It's similar to `reported` but the value may be higher. -#### Configuration snapshot +### Configuration snapshot To ensure consistent behavior within one transaction, the `transaction_max_spans` option should be read once on transaction start. Even if the option is changed via remote config during the lifetime of a transaction, the value that has been read at the start of the transaction should be used. -#### Metric collection +### Metric collection Even though we can determine whether to drop a span before starting it, it's not legal to return a `null` or noop span in that case. That's because we're [collecting statistics about dropped spans](tracing-spans-dropped-stats.md) as well as From 48b08c919720f8f9268a45c5ecfbe11023678fb0 Mon Sep 17 00:00:00 2001 From: Alexander Wert Date: Mon, 19 Jul 2021 13:57:34 +0200 Subject: [PATCH 44/46] Update specs/agents/tracing-spans-destination.md --- specs/agents/tracing-spans-destination.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans-destination.md b/specs/agents/tracing-spans-destination.md index c6034761..1336b70a 100644 --- a/specs/agents/tracing-spans-destination.md +++ b/specs/agents/tracing-spans-destination.md @@ -85,7 +85,7 @@ A user-supplied value MUST have the highest precedence, regardless if it was set **Value** -For all [exit spans](tracing-spans.md#exit-spans), unless the `context.destination.service.resource` field was set by the user to `null` or an empty +For all [exit spans](handling-huge-traces/tracing-spans.md#exit-spans), unless the `context.destination.service.resource` field was set by the user to `null` or an empty string through API, agents MUST infer the value of this field based on properties that are set on the span. If no value is set to the `context.destination.service.resource` field, the logic for automatically inferring From 971c96f379ed164d8244b186c05b4636579e3737 Mon Sep 17 00:00:00 2001 From: Alexander Wert Date: Mon, 19 Jul 2021 13:57:42 +0200 Subject: [PATCH 45/46] Update specs/agents/tracing-spans.md --- specs/agents/tracing-spans.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index e25ce8d7..2a3faa0a 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -83,7 +83,7 @@ For example, an HTTP exit span may have child spans with the `action` `request`, These spans MUST NOT have any destination context, so that there's no effect on destination metrics. Most agents would want to treat exit spans as leaf spans, though. -This brings the benefit of being able to [compress](tracing-spans-compress.md) repetitive exit spans, +This brings the benefit of being able to [compress](handling-huge-traces/tracing-spans-compress.md) repetitive exit spans, as span compression is only applicable to leaf spans. Agents MAY implement mechanisms to prevent the creation of child spans of exit spans. From 68215017f3a669661ea504cb325d6b7acd6fb120 Mon Sep 17 00:00:00 2001 From: Alexander Wert Date: Mon, 19 Jul 2021 13:57:49 +0200 Subject: [PATCH 46/46] Update specs/agents/tracing-spans.md --- specs/agents/tracing-spans.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md index 2a3faa0a..0b2fb31a 100644 --- a/specs/agents/tracing-spans.md +++ b/specs/agents/tracing-spans.md @@ -101,7 +101,7 @@ However, when tracing a regular outgoing HTTP request (one that's not initiated and it's unknown whether the downsteam service continues the trace, the trace headers should be added. -The reason is that spans cannot be [compressed](tracing-spans-compress.md) if the context has been propagated, as it may lead to orphaned transactions. +The reason is that spans cannot be [compressed](handling-huge-traces/tracing-spans-compress.md) if the context has been propagated, as it may lead to orphaned transactions. That means that the `parent.id` of a transaction may refer to a span that's not available because it has been compressed (merged with another span). There can, however, be exceptions to this rule whenever it makes sense. For example, if it's known that the backend system can continue the trace.