Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling huge tracing specs #453

Merged
merged 50 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2bd13cf
First draft of handling huge tracing specs
felixbarny Jun 21, 2021
72d384d
Apply suggestions from code review
felixbarny Jun 22, 2021
411f529
Implement suggestions
felixbarny Jun 22, 2021
fd3d879
Update specs/agents/tracing-spans-compress.md
felixbarny Jun 22, 2021
42ad300
Pseudo code for how the strategies work in combination
felixbarny Jun 22, 2021
ae09511
Add composite.exact_match flag
felixbarny Jun 22, 2021
db17364
Apply suggestions from code review
felixbarny Jun 24, 2021
a81d78f
Add breadcrumbs
felixbarny Jun 30, 2021
af969da
Add missing table of contents link to AWS tracing spec file
trentm Jun 29, 2021
f5c010a
Some clarifications for the destination APIs (#452)
eyalkoren Jun 30, 2021
5916a63
Add limit to dropped_spans_stats
felixbarny Jul 5, 2021
ccf4349
Add implementation section to transaction_max_spans
felixbarny Jul 5, 2021
9790529
Merge remote-tracking branch 'origin/master' into compressed-spans
felixbarny Jul 5, 2021
b318ae6
Move exit span definition from destination spec to span spec
felixbarny Jul 5, 2021
7ab424b
Add exit_span_min_duration spec
felixbarny Jul 5, 2021
bcd4a6d
Apply suggestions from code review
felixbarny Jul 5, 2021
834ac8b
Fix links, add clarification to max duration
felixbarny Jul 5, 2021
42663a2
Dropping fast spans requires stats
felixbarny Jul 6, 2021
5828651
Rework transaction_max_spans implementation logic
felixbarny Jul 6, 2021
f260ee5
Improve transaction_max_spans: no CAS
felixbarny Jul 7, 2021
1f3cc6b
Apply suggestions from code review
felixbarny Jul 7, 2021
bb1bcde
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
e6b50d2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
9ba8957
Update specs/agents/tracing-spans-handling-huge-traces.md
SergeyKleyman Jul 13, 2021
b20c102
Renamed same_kind_compression_max_duration config option
SergeyKleyman Jul 15, 2021
51db949
Added span_compression_same_kind_max_duration config option
SergeyKleyman Jul 15, 2021
473bb4d
Added span_compression_enabled config option
SergeyKleyman Jul 15, 2021
00dcfa8
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
ccf2aa4
Changed end to sum.us in composite sub-object
SergeyKleyman Jul 15, 2021
a046548
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 15, 2021
f711c07
Replaced exact_match bool with compression_strategy enum
SergeyKleyman Jul 15, 2021
df344a2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
98a5bd9
Added outcome requirement to eligible for compression
SergeyKleyman Jul 15, 2021
ef501a3
Added outcome requirement to eligible for compression PART 2
SergeyKleyman Jul 15, 2021
798d270
Added links from tracing-spans.md to tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
3754297
Fixed missing isSameKind check in tryToCompressComposite()
SergeyKleyman Jul 15, 2021
4df5afb
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
0be6c90
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
44c3936
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
182d610
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
a7d728b
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
6b36436
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
2a54365
Removed "Exit span API" requirement from tracing-spans.md
SergeyKleyman Jul 19, 2021
d3a4453
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 19, 2021
990463e
Update specs/agents/tracing-spans-drop-fast-exit.md
AlexanderWert Jul 19, 2021
a8e1e91
reafctored file structure for handling huge traces
AlexanderWert Jul 19, 2021
916d1fa
Merge commit 'b338fe9e1539180b05ce57ac0cfb8f3c18aa9b88'
AlexanderWert Jul 19, 2021
48b08c9
Update specs/agents/tracing-spans-destination.md
AlexanderWert Jul 19, 2021
971c96f
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
6821501
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions specs/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ You can find details about each of these in the [APM Data Model](https://www.ela
- [Transactions](tracing-transactions.md)
- [Spans](tracing-spans.md)
- [Span destination](tracing-spans-destination.md)
- [Handling huge traces](handling-huge-traces/tracing-spans-handling-huge-traces.md)
- [Hard limit on number of spans to collect](handling-huge-traces/tracing-spans-limit.md)
- [Collecting statistics about dropped spans](handling-huge-traces/tracing-spans-dropped-stats.md)
- [Dropping fast exit spans](handling-huge-traces/tracing-spans-drop-fast-exit.md)
- [Compressing spans](handling-huge-traces/tracing-spans-compress.md)
- [Sampling](tracing-sampling.md)
- [Distributed tracing](tracing-distributed-tracing.md)
- [Tracer API](tracing-api.md)
Expand Down
41 changes: 41 additions & 0 deletions specs/agents/handling-huge-traces/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Handling huge traces

Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues:
- A significant performance impact in the target application.
For example due to high allocation rate, network traffic, garbage collection, additional CPU cycles for serializing, compressing and sending spans, etc.
- Dropping of events in agents or APM Server due to exhausted queues.
- High load on the APM Server.
- High storage costs.
- Decreased performance of the Elastic APM UI due to slow searches and rendering of huge traces.
- Loss of clarity and overview (--> decreased user experience) in the UI when analyzing the traces.

Agents can implement several strategies to mitigate these issues.
These strategies are designed to capture significant information about relevant spans while at the same time limiting the trace to a manageable size.
Applying any of these strategies inevitably leads to a loss of information.
However, they aim to provide a better tradeoff between cost and insight by not capturing or summarizing less relevant data.

- [Hard limit on number of spans to collect](tracing-spans-limit.md) \
Even after applying the most advanced strategies, there must always be a hard limit on the number of spans we collect.
This is the last line of defense that comes with the highest amount of data loss.
- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) \
Makes sure even if dropping spans, we at least have stats about them.
- [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) \
If a span was blazingly fast, it's probably not worth the cost to send and store it.
- [Compressing spans](tracing-spans-compress.md) \
If there are a bunch of very similar spans, we can represent them in a single document - a composite span.

In a nutshell, this is how the different settings work in combination:

```java
if (span.transaction.spanCount > transaction_max_spans) {
// drop span
// collect statistics for dropped spans
} else if (compression possible) {
// apply compression
} else if (span.duration < exit_span_min_duration) {
// drop span
// collect statistics for dropped spans
} else {
// report span
}
```
273 changes: 273 additions & 0 deletions specs/agents/handling-huge-traces/tracing-spans-compress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
# Compressing spans

To mitigate the potential flood of spans to a backend,
agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans.

While compressing multiple similar spans into a single composite span can't fully eliminate the collection overhead,
it can significantly reduce the impact on the following areas,
with very little loss of information:
- Agent reporter queue utilization
- Capturing stack traces, serialization, compression, and sending events to APM Server
- Potential to re-use span objects, significantly reducing allocations
- Downstream effects like reducing impact on APM Server, ES storage, and UI performance

### Configuration option `span_compression_enabled`

Setting this option to true will enable span compression feature.
Span compression reduces the collection, processing, and storage overhead, and removes clutter from the UI.
The tradeoff is that some information such as DB statements of all the compressed spans will not be collected.

| | |
|----------------|----------|
| Type | `boolean`|
| Default | `false` |
| Dynamic | `true` |


## Consecutive-Exact-Match compression strategy

One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server.
This strategy detects consecutive spans that hold the same information (except for the duration)
and creates a single [composite span](#composite-span).

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x SELECT FROM users
```

Two spans are considered to be an exact match if they are of the [same kind](#consecutive-same-kind-compression-strategy) and if their span names are equal:
- `type`
- `subtype`
- `destination.service.resource`
- `name`

### Configuration option `span_compression_exact_match_max_duration`

Consecutive spans that are exact match and that are under this threshold will be compressed into a single composite span.
This option does not apply to [composite spans](#composite-span).
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
The tradeoff is that the DB statements of all the compressed spans will not be collected.

| | |
|----------------|----------|
| Type | `duration`|
| Default | `5ms` |
| Dynamic | `true` |

## Consecutive-Same-Kind compression strategy

Another pattern that often occurs is a high amount of alternating queries to the same backend.
Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead.

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x Calls to mysql
```

Two spans are considered to be of the same type if the following properties are equal:
- `type`
- `subtype`
- `destination.service.resource`

```java
boolean isSameKind(Span other) {
return type == other.type
&& subtype == other.subtype
&& destination.service.resource == other.destination.service.resource
}
```

When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`.
The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span.

### Configuration option `span_compression_same_kind_max_duration`

Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span.
This option does not apply to [composite spans](#composite-span).
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
The tradeoff is that the DB statements of all the compressed spans will not be collected.

| | |
|----------------|----------|
| Type | `duration`|
| Default | `5ms` |
| Dynamic | `true` |

## Composite span

Compressed spans don't have a physical span document.
Instead, multiple compressed spans are represented by a composite span.

### Data model

The `timestamp` and `duration` have slightly similar semantics,
and they define properties under the `composite` context.

- `timestamp`: The start timestamp of the first span.
- `duration`: gross duration (i.e., _<last compressed span's end timestamp>_ - _<first compressed span's start timestamp>_).
- `composite`
- `count`: The number of compressed spans this composite span represents.
The minimum count is 2 as a composite span represents at least two spans.
- `sum.us`: sum of durations of all compressed spans this composite span represents in microseconds.
Thus `sum.us` is the net duration of all the compressed spans while `duration` is the gross duration (including "whitespace" between the spans).
- `compression_strategy`: A string value indicating which compression strategy was used. The valid values are:
- `exact_match` - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy)
- `same_kind` - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy)

### Effects on metric processing

As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
APM Server tracks span destination metrics.
To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted,
APM Server will take `composite.count` into account when tracking span destination metrics.

## Compression algorithm

### Eligibility for compression

A span is eligible for compression if all the following conditions are met
1. It's an [exit span](tracing-spans.md#exit-spans)
2. The trace context of this span has not been propagated to a downstream service
3. If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`.
It means spans with outcome indicating an issue of potential interest should not be compressed.

The second condition is important so that we don't remove (compress) a span that may be the parent of a downstream service.
This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view.

```java
boolean isCompressionEligible() {
return exit && !context.hasPropagated && (outcome == null || outcome == "success")
}
```

### Span buffering

Non-compression-eligible spans may be reported immediately after they have ended.
When a compression-eligible span ends, it does not immediately get reported.
Instead, the span is buffered within its parent.
A span/transaction can buffer at most one child span.

Span buffering allows to "look back" one span when determining whether a given span should be compressed.

A buffered span gets reported when
1. its parent ends
2. a non-compressible sibling ends

```java
void onEnd() {
if (buffered != null) {
report(buffered)
}
}

void onChildEnd(Span child) {
if (!child.isCompressionEligible()) {
if (buffered != null) {
report(buffered)
buffered = null
}
report(child)
return
}

if (buffered == null) {
buffered = child
return
}

if (!buffered.tryToCompress(child)) {
report(buffered)
buffered = child
}
}
```

### Turning compressed spans into a composite span

Spans have `tryToCompress` method that is called on a span buffered by its parent.
On the first call the span checks if it can be compressed with the given sibling and it selects the best compression strategy.
Note that the compression strategy selected only once based on the first two spans of the sequence.
The compression strategy cannot be changed by the rest the spans in the sequence.
So when the current sibling span cannot be added to the ongoing sequence under the selected compression strategy
then the ongoing is terminated, it is sent out as a composite span and the current sibling span is buffered.

If the spans are of the same kind, and have the same name and both spans `duration` <= `span_compression_exact_match_max_duration`,
we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy).
Note that if the spans are _exact match_
but duration threshold requirement is not satisfied we just stop compression sequence.
In particular it means that the implementation should not proceed to try _same kind_ strategy.
Otherwise user would have to lower both `span_compression_exact_match_max_duration` and `span_compression_same_kind_max_duration`
to prevent longer _exact match_ spans from being compressed.

If the spans are of the same kind but have different span names and both spans `duration` <= `span_compression_same_kind_max_duration`,
we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy).

```java
bool tryToCompress(Span sibling) {
isAlreadyComposite = composite != null
canBeCompressed = isAlreadyComposite ? tryToCompressComposite(sibling) : tryToCompressRegular(sibling)
if (!canBeCompressed) {
return false
}

if (!isAlreadyComposite) {
composite.count = 1
composite.sumUs = duration
}

++composite.count
composite.sumUs += other.duration
return true
}

bool tryToCompressRegular(Span sibling) {
if (!isSameKind(sibling)) {
return false
}

if (name == sibling.name) {
if (duration <= span_compression_exact_match_max_duration && sibling.duration <= span_compression_exact_match_max_duration) {
composite.compressionStrategy = "exact_match"
return true
}
return false
}

if (duration <= span_compression_same_kind_max_duration && sibling.duration <= span_compression_same_kind_max_duration) {
composite.compressionStrategy = "same_kind"
name = "Calls to " + destination.service.resource
return true
}

return false
}

bool tryToCompressComposite(Span sibling) {
switch (composite.compressionStrategy) {
case "exact_match":
return isSameKind(sibling) && name == sibling.name && sibling.duration <= span_compression_exact_match_max_duration

case "same_kind":
return isSameKind(sibling) && sibling.duration <= span_compression_same_kind_max_duration
}
}
```

### Concurrency

The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise.
Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency.

However, if there can be a situation where multiple spans may end concurrently, agents MUST guard against race conditions.
To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm)
paired with retry loops over blocking algorithms that use mutexes or locks.

In particular, operations that work with the buffer require special attention:
- Setting a span into the buffer must be handled atomically.
- Retrieving a span from the buffer must be handled atomically.
Retrieving includes atomically getting and clearing the buffer.
This makes sure that only one thread can compare span properties and call mutating methods, such as `compress` at a time.
Loading