Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling huge tracing specs #453

Merged
merged 50 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2bd13cf
First draft of handling huge tracing specs
felixbarny Jun 21, 2021
72d384d
Apply suggestions from code review
felixbarny Jun 22, 2021
411f529
Implement suggestions
felixbarny Jun 22, 2021
fd3d879
Update specs/agents/tracing-spans-compress.md
felixbarny Jun 22, 2021
42ad300
Pseudo code for how the strategies work in combination
felixbarny Jun 22, 2021
ae09511
Add composite.exact_match flag
felixbarny Jun 22, 2021
db17364
Apply suggestions from code review
felixbarny Jun 24, 2021
a81d78f
Add breadcrumbs
felixbarny Jun 30, 2021
af969da
Add missing table of contents link to AWS tracing spec file
trentm Jun 29, 2021
f5c010a
Some clarifications for the destination APIs (#452)
eyalkoren Jun 30, 2021
5916a63
Add limit to dropped_spans_stats
felixbarny Jul 5, 2021
ccf4349
Add implementation section to transaction_max_spans
felixbarny Jul 5, 2021
9790529
Merge remote-tracking branch 'origin/master' into compressed-spans
felixbarny Jul 5, 2021
b318ae6
Move exit span definition from destination spec to span spec
felixbarny Jul 5, 2021
7ab424b
Add exit_span_min_duration spec
felixbarny Jul 5, 2021
bcd4a6d
Apply suggestions from code review
felixbarny Jul 5, 2021
834ac8b
Fix links, add clarification to max duration
felixbarny Jul 5, 2021
42663a2
Dropping fast spans requires stats
felixbarny Jul 6, 2021
5828651
Rework transaction_max_spans implementation logic
felixbarny Jul 6, 2021
f260ee5
Improve transaction_max_spans: no CAS
felixbarny Jul 7, 2021
1f3cc6b
Apply suggestions from code review
felixbarny Jul 7, 2021
bb1bcde
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
e6b50d2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
9ba8957
Update specs/agents/tracing-spans-handling-huge-traces.md
SergeyKleyman Jul 13, 2021
b20c102
Renamed same_kind_compression_max_duration config option
SergeyKleyman Jul 15, 2021
51db949
Added span_compression_same_kind_max_duration config option
SergeyKleyman Jul 15, 2021
473bb4d
Added span_compression_enabled config option
SergeyKleyman Jul 15, 2021
00dcfa8
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
ccf2aa4
Changed end to sum.us in composite sub-object
SergeyKleyman Jul 15, 2021
a046548
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 15, 2021
f711c07
Replaced exact_match bool with compression_strategy enum
SergeyKleyman Jul 15, 2021
df344a2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
98a5bd9
Added outcome requirement to eligible for compression
SergeyKleyman Jul 15, 2021
ef501a3
Added outcome requirement to eligible for compression PART 2
SergeyKleyman Jul 15, 2021
798d270
Added links from tracing-spans.md to tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
3754297
Fixed missing isSameKind check in tryToCompressComposite()
SergeyKleyman Jul 15, 2021
4df5afb
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
0be6c90
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
44c3936
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
182d610
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
a7d728b
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
6b36436
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
2a54365
Removed "Exit span API" requirement from tracing-spans.md
SergeyKleyman Jul 19, 2021
d3a4453
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 19, 2021
990463e
Update specs/agents/tracing-spans-drop-fast-exit.md
AlexanderWert Jul 19, 2021
a8e1e91
reafctored file structure for handling huge traces
AlexanderWert Jul 19, 2021
916d1fa
Merge commit 'b338fe9e1539180b05ce57ac0cfb8f3c18aa9b88'
AlexanderWert Jul 19, 2021
48b08c9
Update specs/agents/tracing-spans-destination.md
AlexanderWert Jul 19, 2021
971c96f
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
6821501
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions specs/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ You can find details about each of these in the [APM Data Model](https://www.ela
- [Transactions](tracing-transactions.md)
- [Spans](tracing-spans.md)
- [Span destination](tracing-spans-destination.md)
- [Handling huge traces](tracing-spans-handling-huge-traces.md)
- [Hard limit on number of spans to collect](tracing-spans-limit.md)
- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md)
- [Dropping fast spans](tracing-spans-drop-fast.md)
- [Compressing spans](tracing-spans-compress.md)
- [Sampling](tracing-sampling.md)
- [Distributed tracing](tracing-distributed-tracing.md)
- [Tracer API](tracing-api.md)
Expand Down
203 changes: 203 additions & 0 deletions specs/agents/tracing-spans-compress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
## Compressing spans

To mitigate the potential flood of spans to a backend,
agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans.

While compressing multiple similar spans into a single composite span can't fully get rid of all the collection overhead,
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
it can significantly reduce the impact on the following areas,
with very little loss of information.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- Agent reporter queue utilization
- Capturing stack traces, serialization, compression, and sending events to APM Server
- Potential to re-use span objects, significantly reducing allocations
- Downstream effects like reducing impact on APM Server, ES storage, and UI performance


### Consecutive-Exact-Match compression strategy

One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server.
This strategy detects consecutive spans that hold the same information (except for the duration)
and creates a single [composite span](tracing-spans-compress.md#composite-span).

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x SELECT FROM users
```

Two spans are considered to be an exact match if they are of the [same kind](consecutive-same-kind-compression-strategy) and if their span names are equal:
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
- `type`
- `subtype`
- `destination.service.resource`
- `name`

### Consecutive-Same-Kind compression strategy

Another pattern that often occurs is a high amount of alternating queries to the same backend.
Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead.

```
[ ]
GET /users
[] [] [] [] [] [] [] [] [] []
10x Calls to mysql
```

Two span are considered to be of the same type if the following properties are equal:
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
- `type`
- `subtype`
- `destination.service.resource`

```java
boolean isSameKind(Span other) {
return type == other.type
&& subtype == other.subtype
&& destination.service.resource == other.destination.service.resource
}
```

When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`.
The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span.


#### Configuration option `same_kind_compression_max_duration`
AlexanderWert marked this conversation as resolved.
Show resolved Hide resolved
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
The tradeoff is that the statement information will not be collected.

| | |
|----------------|----------|
| Type | `duration`|
| Default | `5ms` |
| Dynamic | `true` |

### Composite span

Compressed spans don't have a physical span document.
Instead, multiple compressed spans are represented by a composite span.

#### Data model

The `timestamp` and `duration` have slightly similar semantics,
and they define properties under the `composite` context.

- `timestamp`: The start timestamp of the first span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- `duration`: The sum of durations of all spans.
- `composite`
- `count`: The number of compressed spans this composite span represents.
The minimum count is 2 as a composite span represents at least two spans.
- `end`: The end timestamp of the last compressed span.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
The net duration of all compressed spans is equal to the composite spans' `duration`.
The gross duration (including "whitespace" between the spans) is equal to `compressed.end - timestamp`.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

#### Turning compressed spans into a composite span

Spans have a `compress` method.
The first time `compress` is called on a regular span, it becomes a composite span.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

```java
void compress(Span other) {
if (compressed.count == 0) {
compressed.count = 2
} else {
compressedCount++
}
endTimestamp = max(endTimestamp, other.endTimestamp)
}
```

#### Effects on metric processing

As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
APM Server tracks span destination metrics.
To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted,
APM Server will take `composite.count` into account when tracking span destination metrics.

### Compression algorithm

#### Eligibility for compression

A span is eligible for compression if all the following conditions are met
- It's an exit span
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
- The trace context of this span has not been propagated to a downstream service
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved

The latter condition is important so that we don't remove (compress) a span that may be the parent of a downstream service.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view.

```java
boolean isCompressionEligible() {
return exit && !context.hasPropagated
}
```

#### Span buffering

Non-compression-eligible spans may be reported immediately after they have ended.
When a compression-eligible span ends, it does not immediately get reported.
Instead, the span is buffered within its parent.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
A span/transaction can buffer at most one child span.

Span buffering allows to "look back" one span when determining whether a given span should be compressed.

A buffered span gets reported when
1. its parent ends
2. a non-compressible sibling ends

```java
void onSpanEnd() {
if (isCompressionEligible()) {
if (parent.hasBufferedSpan()) {
parent.tryCompress(this)
} else {
parent.buffered = this
}
} else {
report(buffered)
report(this)
}
}
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
```

#### Compression

On span end, we compare each [compression-eligible](tracing-spans-compress.md#eligibility-for-compression) span to it's previous sibling.

If the spans are of the same kind but have different span names and the compressions-eligible span's `duration` <= `same_kind_compression_max_duration`,
we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy).

If the spans are of the same kind, and have the same name,
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy).

```java
void tryCompress(Span child) {
if (buffered.isSameKind(child)) {
if (buffered.name == child.name) {
buffered.compress(child)
return
} else if (buffered.duration <= same_kind_compression_max_duration
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
&& child.duration <= same_kind_compression_max_duration) {
buffered.name = "Calls to $buffered.destination.service.resource"
buffered.compress(child)
return
}
}
report(buffered)
buffered = child
}
```

#### Concurrency

The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise.
Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency.

However, if there can be a situation where multiple spans may end concurrently, agents MUST guard against race conditions.
To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm)
paired with retry loops over blocking algorithms that use mutexes or locks.

In particular, operations that work with the buffer require special attention.
SergeyKleyman marked this conversation as resolved.
Show resolved Hide resolved
- Setting a span into the buffer must be handled atomically.
- Retrieving a span from the buffer must be handled atomically.
Retrieving includes atomically getting and clearing the buffer.
This makes sure that only one thread can compare span properties and call mutating methods, such as `compress` at a time.
3 changes: 3 additions & 0 deletions specs/agents/tracing-spans-drop-fast.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Dropping fast spans

TODO [`span_min_duration`](https://github.com/elastic/apm/pull/314) is viable again when tracking stats for fast spans [`tracing-spans-dropped-stats.md`](tracing-spans-dropped-stats.md).
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
52 changes: 52 additions & 0 deletions specs/agents/tracing-spans-dropped-stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
## Collecting statistics about dropped spans

To still retain some information about dropped spans (for example due to `transaction_max_spans` or `span_min_duration`),
agents SHOULD collect statistics on the corresponding transaction about dropped spans.
These statistics MUST only be sent for sampled transactions.

### Use cases

This allows APM Server to consider these metrics for the service destination metrics.
In practice,
this means that the service map, the dependencies table,
and the backend details view can show accurate throughput statistics for backends like Redis,
even if most of the spans are dropped.

This also allows the transaction details view (aka. waterfall) to show a summary of the dropped spans.

### Data model

This is an example of the statistics that are added do the `transaction` events sent via the intake v2 protocol.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"dropped_spans_stats": [
{
"type": "external",
"subtype": "http",
"destination_service_resource": "example.com:443",
"outcome": "failure",
basepi marked this conversation as resolved.
Show resolved Hide resolved
"count": 28,
"duration.sum.us": 123456
},
{
"type": "db",
"subtype": "mysql",
"destination_service_resource": "mysql",
"outcome": "success",
"count": 81,
"duration.sum.us": 9876543
}
]
}
```

### Limits
TODO: limit the number of `dropped_spans_stats` elements?
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

### Effects on destination service metrics

As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
APM Server tracks span destination metrics.
To avoid dropped spans to skew latency metrics and cause throughput metrics to be under-counted,
APM Server will take `dropped_spans_stats` into account when tracking span destination metrics.
17 changes: 17 additions & 0 deletions specs/agents/tracing-spans-handling-huge-traces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Handling huge traces

Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues:
- A significant performance impact in the target application.
For example due to high allocation rate, network traffic, garbage collection, additional CPU cycles for serializing, compressing and sending spans, etc.
- Dropping of events in agents or APM Server due to exhausted queues.
- High load on the APM Server.
- High storage costs.
- Decreased performance of the Elastic APM UI due to slow searches and rendering of huge traces.
- Loss of clarity and overview (--> decreased user experience) in the UI when analyzing the traces.

Agents can implement several strategies to mitigate these issues:
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
- [Hard limit on number of spans to collect](tracing-spans-limit.md)
- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md)
- [Dropping fast spans](tracing-spans-drop-fast.md)
- [Compressing spans](tracing-spans-compress.md)

19 changes: 19 additions & 0 deletions specs/agents/tracing-spans-limit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Hard limit on number of spans to collect

This is the last line of defense that comes with the highest amount of data loss.
This strategy MUST be implemented by all agents.
Ideally, the other mechanisms limit the amount of spans enough so that the hard limit does not kick in.

### Configuration option `transaction_max_spans`

Limits the amount of spans that are recorded per transaction.

This is helpful in cases where a transaction creates a very high amount of spans (e.g. thousands of SQL queries).

Setting an upper limit will prevent overloading the agent and the APM server with too much work for such edge cases.

| | |
|----------------|----------|
| Type | `integer`|
| Default | `500` |
| Dynamic | `true` |