Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling huge tracing specs #453

Merged
merged 50 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2bd13cf
First draft of handling huge tracing specs
felixbarny Jun 21, 2021
72d384d
Apply suggestions from code review
felixbarny Jun 22, 2021
411f529
Implement suggestions
felixbarny Jun 22, 2021
fd3d879
Update specs/agents/tracing-spans-compress.md
felixbarny Jun 22, 2021
42ad300
Pseudo code for how the strategies work in combination
felixbarny Jun 22, 2021
ae09511
Add composite.exact_match flag
felixbarny Jun 22, 2021
db17364
Apply suggestions from code review
felixbarny Jun 24, 2021
a81d78f
Add breadcrumbs
felixbarny Jun 30, 2021
af969da
Add missing table of contents link to AWS tracing spec file
trentm Jun 29, 2021
f5c010a
Some clarifications for the destination APIs (#452)
eyalkoren Jun 30, 2021
5916a63
Add limit to dropped_spans_stats
felixbarny Jul 5, 2021
ccf4349
Add implementation section to transaction_max_spans
felixbarny Jul 5, 2021
9790529
Merge remote-tracking branch 'origin/master' into compressed-spans
felixbarny Jul 5, 2021
b318ae6
Move exit span definition from destination spec to span spec
felixbarny Jul 5, 2021
7ab424b
Add exit_span_min_duration spec
felixbarny Jul 5, 2021
bcd4a6d
Apply suggestions from code review
felixbarny Jul 5, 2021
834ac8b
Fix links, add clarification to max duration
felixbarny Jul 5, 2021
42663a2
Dropping fast spans requires stats
felixbarny Jul 6, 2021
5828651
Rework transaction_max_spans implementation logic
felixbarny Jul 6, 2021
f260ee5
Improve transaction_max_spans: no CAS
felixbarny Jul 7, 2021
1f3cc6b
Apply suggestions from code review
felixbarny Jul 7, 2021
bb1bcde
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
e6b50d2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
9ba8957
Update specs/agents/tracing-spans-handling-huge-traces.md
SergeyKleyman Jul 13, 2021
b20c102
Renamed same_kind_compression_max_duration config option
SergeyKleyman Jul 15, 2021
51db949
Added span_compression_same_kind_max_duration config option
SergeyKleyman Jul 15, 2021
473bb4d
Added span_compression_enabled config option
SergeyKleyman Jul 15, 2021
00dcfa8
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
ccf2aa4
Changed end to sum.us in composite sub-object
SergeyKleyman Jul 15, 2021
a046548
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 15, 2021
f711c07
Replaced exact_match bool with compression_strategy enum
SergeyKleyman Jul 15, 2021
df344a2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
98a5bd9
Added outcome requirement to eligible for compression
SergeyKleyman Jul 15, 2021
ef501a3
Added outcome requirement to eligible for compression PART 2
SergeyKleyman Jul 15, 2021
798d270
Added links from tracing-spans.md to tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
3754297
Fixed missing isSameKind check in tryToCompressComposite()
SergeyKleyman Jul 15, 2021
4df5afb
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
0be6c90
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
44c3936
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
182d610
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
a7d728b
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
6b36436
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
2a54365
Removed "Exit span API" requirement from tracing-spans.md
SergeyKleyman Jul 19, 2021
d3a4453
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 19, 2021
990463e
Update specs/agents/tracing-spans-drop-fast-exit.md
AlexanderWert Jul 19, 2021
a8e1e91
reafctored file structure for handling huge traces
AlexanderWert Jul 19, 2021
916d1fa
Merge commit 'b338fe9e1539180b05ce57ac0cfb8f3c18aa9b88'
AlexanderWert Jul 19, 2021
48b08c9
Update specs/agents/tracing-spans-destination.md
AlexanderWert Jul 19, 2021
971c96f
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
6821501
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions specs/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ You can find details about each of these in the [APM Data Model](https://www.ela
- [Transactions](tracing-transactions.md)
- [Spans](tracing-spans.md)
- [Span destination](tracing-spans-destination.md)
- [Handling huge traces](tracing-spans-handling-huge-traces.md)
- [Hard limit on number of spans to collect](tracing-spans-limit.md)
- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md)
- [Dropping fast exit spans](tracing-spans-drop-fast-exit.md)
- [Compressing spans](tracing-spans-compress.md)
- [Sampling](tracing-sampling.md)
- [Distributed tracing](tracing-distributed-tracing.md)
- [Tracer API](tracing-api.md)
Expand Down
211 changes: 211 additions & 0 deletions specs/agents/tracing-spans-compress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Compressing spans](tracing-spans-compress.md)

## Compressing spans

To mitigate the potential flood of spans to a backend,
agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans.

While compressing multiple similar spans into a single composite span can't fully eliminate the collection overhead,
it can significantly reduce the impact on the following areas,
with very little loss of information.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- Agent reporter queue utilization
- Capturing stack traces, serialization, compression, and sending events to APM Server
- Potential to re-use span objects, significantly reducing allocations
- Downstream effects like reducing impact on APM Server, ES storage, and UI performance


### Consecutive-Exact-Match compression strategy

One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server.
This strategy detects consecutive spans that hold the same information (except for the duration)
and creates a single [composite span](#composite-span).

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x SELECT FROM users
```

Two spans are considered to be an exact match if they are of the [same kind](#consecutive-same-kind-compression-strategy) and if their span names are equal:
- `type`
- `subtype`
- `destination.service.resource`
- `name`

### Consecutive-Same-Kind compression strategy

Another pattern that often occurs is a high amount of alternating queries to the same backend.
Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead.

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x Calls to mysql
```

Two spans are considered to be of the same type if the following properties are equal:
- `type`
- `subtype`
- `destination.service.resource`

```java
boolean isSameKind(Span other) {
return type == other.type
&& subtype == other.subtype
&& destination.service.resource == other.destination.service.resource
}
```

When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`.
The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span.


#### Configuration option `same_kind_compression_max_duration`
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
This option does not apply to [composite spans](#composite-span).
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
The tradeoff is that the statement information will not be collected.

| | |
|----------------|----------|
| Type | `duration`|
| Default | `5ms` |
| Dynamic | `true` |

### Composite span

Compressed spans don't have a physical span document.
Instead, multiple compressed spans are represented by a composite span.

#### Data model

The `timestamp` and `duration` have slightly similar semantics,
and they define properties under the `composite` context.

- `timestamp`: The start timestamp of the first span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- `duration`: The sum of durations of all spans.
- `composite`
- `count`: The number of compressed spans this composite span represents.
The minimum count is 2 as a composite span represents at least two spans.
- `end`: The end timestamp of the last compressed span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
The net duration of all compressed spans is equal to the composite spans' `duration`.
The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- `exact_match`: A boolean flag indicating whether the
[Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy) (`false`) or the
[Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy) (`true`) has been applied.
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved

#### Turning compressed spans into a composite span

Spans have a `compress` method.
The first time `compress` is called on a regular span, it becomes a composite span,
incorporating the new span by updating the count and end timestamp.

```java
void compress(Span other, boolean exactMatch) {
if (compressed.count == 0) {
compressed.count = 2
} else {
compressedCount++
}
compressed.exactMatch = compressed.exactMatch && exactMatch
endTimestamp = max(endTimestamp, other.endTimestamp)
}
```

#### Effects on metric processing

As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
APM Server tracks span destination metrics.
To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted,
APM Server will take `composite.count` into account when tracking span destination metrics.

### Compression algorithm

#### Eligibility for compression

A span is eligible for compression if all the following conditions are met
- It's an [exit span](tracing-spans.md#exit-spans)
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- The trace context of this span has not been propagated to a downstream service
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view.

```java
boolean isCompressionEligible() {
return exit && !context.hasPropagated
}
```

#### Span buffering

Non-compression-eligible spans may be reported immediately after they have ended.
When a compression-eligible span ends, it does not immediately get reported.
Instead, the span is buffered within its parent.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
A span/transaction can buffer at most one child span.

Span buffering allows to "look back" one span when determining whether a given span should be compressed.

A buffered span gets reported when
1. its parent ends
2. a non-compressible sibling ends

```java
void onSpanEnd() {
if (isCompressionEligible()) {
if (parent.hasBufferedSpan()) {
parent.tryCompress(this)
} else {
parent.buffered = this
}
} else {
report(buffered)
report(this)
}
}
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
```

#### Compression

On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling.

If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `same_kind_compression_max_duration`,
we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy).

If the spans are of the same kind, and have the same name,
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy).

```java
void tryCompress(Span child) {
if (buffered.isSameKind(child)) {
if (buffered.name == child.name) {
buffered.compress(child, exactMatch: true)
return
} else if ( (buffered.duration <= same_kind_compression_max_duration || buffered.composite.count > 1)
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
&& child.duration <= same_kind_compression_max_duration) {
buffered.name = "Calls to $buffered.destination.service.resource"
buffered.compress(child, exactMatch: false)
return
}
}
report(buffered)
buffered = child
}
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
```

#### Concurrency

The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise.
Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency.

However, if there can be a situation where multiple spans may end concurrently, agents MUST guard against race conditions.
To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm)
paired with retry loops over blocking algorithms that use mutexes or locks.

In particular, operations that work with the buffer require special attention.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- Setting a span into the buffer must be handled atomically.
- Retrieving a span from the buffer must be handled atomically.
Retrieving includes atomically getting and clearing the buffer.
This makes sure that only one thread can compare span properties and call mutating methods, such as `compress` at a time.
12 changes: 1 addition & 11 deletions specs/agents/tracing-spans-destination.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,21 +83,11 @@ providing a way to manually disable the automatic setting/inference of this fiel
from a service map or an external service from the dependencies table).
A user-supplied value MUST have the highest precedence, regardless if it was set before or after the automatic setting is invoked.

To allow for automatic inference,
without users having to specify any destination field,
agents SHOULD offer a dedicated API to start an exit span.
This API sets the `exit` flag to `true` and returns `null` or a noop span in case the parent already represents an `exit` span.

**Value**

For all exit spans, unless the `context.destination.service.resource` field was set by the user to `null` or an empty
For all [exit spans](tracing-spans.md#exit-spans), unless the `context.destination.service.resource` field was set by the user to `null` or an empty
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
string through API, agents MUST infer the value of this field based on properties that are set on the span.

This is how to determine whether a span is an exit span:
```groovy
exit = exit || context.destination || context.db || context.message || context.http
```

If no value is set to the `context.destination.service.resource` field, the logic for automatically inferring
it MUST be the following:
```groovy
Expand Down
81 changes: 81 additions & 0 deletions specs/agents/tracing-spans-drop-fast-exit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Dropping fast exit spans](tracing-spans-drop-fast-exit.md)

# Dropping fast exit spans

If an exit span was really fast, chances are that it's not relevant for analyzing latency issues.
Therefore, agents SHOULD implement the strategy laid out in this section to let users choose the level of detail/cost tradeoff that makes sense for them.
If an agent implements this strategy, it MUST also implement [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md).

## `exit_span_min_duration` configuration

Sets the minimum duration of exit spans.
Exit spans that execute faster than this threshold are attempted to be discarded.

The attempt fails if they lead up to a span that can't be discarded.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
Spans that propagate the trace context to downstream services,
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
such as outgoing HTTP requests,
can't be discarded.
However, external calls that don't propagate context,
such as calls to a database, can be discarded using this threshold.

Additionally, spans that lead to an error can't be discarded.

| | |
|----------------|------------|
| Type | `duration` |
| Default | `1ms` |
| Central config | `true` |

TODO: should we introduce µs granularity for this config option?
Adding `us` to all `duration`-typed options would create compatibility issues.
So we probably want to support `us` for this option only.
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved

## Interplay with span compression

If an agent implements [span compression](tracing-spans-compress.md),
the limit applies to the [composite span](tracing-spans-compress.md#composite-span).

For example, if 10 Redis calls are compressed into a single composite span whose total duration is lower than `exit_span_min_duration`,
it will be dropped.
If, on the other hand, the individual Redis calls are below the threshold,
but the sum of their durations is above it, the composite span will not be dropped.

## Limitations

The limitations are based on the premise that the `parent_id` of each span and transaction that's stored in Elasticsearch
should point to another valid transaction or span that's present in the Elasticsearch index.

A span that refers to a missing span via is `parent_id` is also known as an "orphaned span".

### Spans that propagate context to downstream services can't be discarded

We only know whether to discard after the call has ended.
At that point,
the trace has already continued on the downstream service.
Discarding the span for the external request would orphan the transaction of the downstream call.

Propagating the trace context to downstream services is also known as out-of-process context propagation.

## Implementation

### `discardable` flag

Spans store an additional `discardable` flag in order to determine whether a span can be discarded.
The default value is `true` for [exit spans](tracing-spans.md#exit-spans) and `false` for any other span.

According to the [limitations](#Limitations),
there are certain situations where the `discardable` flag of a span is set to `false`:
- When an error is reported for this span
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
- On out-of-process context propagation

### Determining whether to report a span

If the span's duration is less than `exit_span_min_duration` and the span is discardable (`discardable=true`),
the `span_count.dropped` count is incremented, and the span will not be reported.
We're deliberately using the same dropped counter we also use when dropping spans due to [`transaction_max_spans`](tracing-spans-limit.md#configuration-option-transaction_max_spans).
This ensures that a dropped fast span doesn't consume from the max spans limit.

### Metric collection

To reduce the data loss, agents [collect statistics about dropped spans](tracing-spans-dropped-stats.md).
Dropped spans contribute to [breakdown metrics](https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.ondan294nbpt) the same way as non-discarded spans.
57 changes: 57 additions & 0 deletions specs/agents/tracing-spans-dropped-stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
[Agent spec home](README.md) > [Handling huge traces](tracing-spans-handling-huge-traces.md) > [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md)

## Collecting statistics about dropped spans

To still retain some information about dropped spans (for example due to [`transaction_max_spans`](tracing-spans-limit.md) or [`exit_span_min_duration`](tracing-spans-drop-fast-exit.md)),
agents SHOULD collect statistics on the corresponding transaction about dropped spans.
These statistics MUST only be sent for sampled transactions.

### Use cases

This allows APM Server to consider these metrics for the service destination metrics.
In practice,
this means that the service map, the dependencies table,
and the backend details view can show accurate throughput statistics for backends like Redis,
even if most of the spans are dropped.

This also allows the transaction details view (aka. waterfall) to show a summary of the dropped spans.

### Data model

This is an example of the statistics that are added to the `transaction` events sent via the intake v2 protocol.

```json
{
"dropped_spans_stats": [
{
"type": "external",
"subtype": "http",
"destination_service_resource": "example.com:443",
"outcome": "failure",
basepi marked this conversation as resolved.
Show resolved Hide resolved
"count": 28,
"duration.sum.us": 123456
},
{
"type": "db",
"subtype": "mysql",
"destination_service_resource": "mysql",
"outcome": "success",
"count": 81,
"duration.sum.us": 9876543
}
]
}
```

### Limits

To avoid the structures from growing without bounds (which is only expected in pathological cases),
agents MUST limit the size of the `dropped_spans_stats` to 128 entries per transaction.
Any entries that would exceed the limit are silently dropped.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

### Effects on destination service metrics

As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
APM Server tracks span destination metrics.
To avoid dropped spans to skew latency metrics and cause throughput metrics to be under-counted,
APM Server will take `dropped_spans_stats` into account when tracking span destination metrics.
Loading