Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling huge tracing specs #453

Merged
merged 50 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
2bd13cf
First draft of handling huge tracing specs
felixbarny Jun 21, 2021
72d384d
Apply suggestions from code review
felixbarny Jun 22, 2021
411f529
Implement suggestions
felixbarny Jun 22, 2021
fd3d879
Update specs/agents/tracing-spans-compress.md
felixbarny Jun 22, 2021
42ad300
Pseudo code for how the strategies work in combination
felixbarny Jun 22, 2021
ae09511
Add composite.exact_match flag
felixbarny Jun 22, 2021
db17364
Apply suggestions from code review
felixbarny Jun 24, 2021
a81d78f
Add breadcrumbs
felixbarny Jun 30, 2021
af969da
Add missing table of contents link to AWS tracing spec file
trentm Jun 29, 2021
f5c010a
Some clarifications for the destination APIs (#452)
eyalkoren Jun 30, 2021
5916a63
Add limit to dropped_spans_stats
felixbarny Jul 5, 2021
ccf4349
Add implementation section to transaction_max_spans
felixbarny Jul 5, 2021
9790529
Merge remote-tracking branch 'origin/master' into compressed-spans
felixbarny Jul 5, 2021
b318ae6
Move exit span definition from destination spec to span spec
felixbarny Jul 5, 2021
7ab424b
Add exit_span_min_duration spec
felixbarny Jul 5, 2021
bcd4a6d
Apply suggestions from code review
felixbarny Jul 5, 2021
834ac8b
Fix links, add clarification to max duration
felixbarny Jul 5, 2021
42663a2
Dropping fast spans requires stats
felixbarny Jul 6, 2021
5828651
Rework transaction_max_spans implementation logic
felixbarny Jul 6, 2021
f260ee5
Improve transaction_max_spans: no CAS
felixbarny Jul 7, 2021
1f3cc6b
Apply suggestions from code review
felixbarny Jul 7, 2021
bb1bcde
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
e6b50d2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 13, 2021
9ba8957
Update specs/agents/tracing-spans-handling-huge-traces.md
SergeyKleyman Jul 13, 2021
b20c102
Renamed same_kind_compression_max_duration config option
SergeyKleyman Jul 15, 2021
51db949
Added span_compression_same_kind_max_duration config option
SergeyKleyman Jul 15, 2021
473bb4d
Added span_compression_enabled config option
SergeyKleyman Jul 15, 2021
00dcfa8
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
ccf2aa4
Changed end to sum.us in composite sub-object
SergeyKleyman Jul 15, 2021
a046548
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 15, 2021
f711c07
Replaced exact_match bool with compression_strategy enum
SergeyKleyman Jul 15, 2021
df344a2
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
98a5bd9
Added outcome requirement to eligible for compression
SergeyKleyman Jul 15, 2021
ef501a3
Added outcome requirement to eligible for compression PART 2
SergeyKleyman Jul 15, 2021
798d270
Added links from tracing-spans.md to tracing-spans-compress.md
SergeyKleyman Jul 15, 2021
3754297
Fixed missing isSameKind check in tryToCompressComposite()
SergeyKleyman Jul 15, 2021
4df5afb
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
0be6c90
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
44c3936
Update specs/agents/tracing-spans-drop-fast-exit.md
SergeyKleyman Jul 19, 2021
182d610
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
a7d728b
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
6b36436
Update specs/agents/tracing-spans-compress.md
SergeyKleyman Jul 19, 2021
2a54365
Removed "Exit span API" requirement from tracing-spans.md
SergeyKleyman Jul 19, 2021
d3a4453
Merge remote-tracking branch 'felixbarny/compressed-spans' into compr…
SergeyKleyman Jul 19, 2021
990463e
Update specs/agents/tracing-spans-drop-fast-exit.md
AlexanderWert Jul 19, 2021
a8e1e91
reafctored file structure for handling huge traces
AlexanderWert Jul 19, 2021
916d1fa
Merge commit 'b338fe9e1539180b05ce57ac0cfb8f3c18aa9b88'
AlexanderWert Jul 19, 2021
48b08c9
Update specs/agents/tracing-spans-destination.md
AlexanderWert Jul 19, 2021
971c96f
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
6821501
Update specs/agents/tracing-spans.md
AlexanderWert Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 61 additions & 16 deletions specs/agents/tracing-spans-limit.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,34 +22,79 @@ Setting an upper limit will prevent overloading the agent and the APM server wit

### Implementation

#### Span count

When a span is put in the agent's reporter queue, a counter should be incremented on its transaction, in order to later identify the _expected_ number of spans.
In this way we can identify data loss, e.g. because events have been dropped.

This counter SHOULD internally be named `reported` and MUST be mapped to `span_count.started` in the intake API.
The word `started` is a misnomer but needs to be used for backward compatibility.
The rest of the spec will refer to this field as `span_count.reported`.

When a span is dropped, it is not reported to the APM Server,
instead another counter is incremented to track the number of spans dropped.
In this case the above mentioned counter for `reported` spans is not incremented.

```json
"span_count": {
"reported": 500,
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
"dropped": 42
}
```

The total number of spans that an agent created within a transaction is equal to `span_count.started + span_count.dropped`.

#### Checking the limit

Before creating a span,
agents must determine whether creating that span would exceed the span limit.
The limit is reached when the total number of created spans minus the number of dropped spans is greater or equals to the max number of spans.
The limit is reached when the number of reported spans is greater or equal to the max number of spans.
In other words, the limit is reached if this condition is true:

span_count.total - span_count.dropped >= transaction_max_spans

The `span_count.total` counter is not part of the intake API,
but it helps agents to determine whether the limit has been reached.
It reflects the total amount of started spans within a transaction.
span_count.reported >= transaction_max_spans
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

On span end, agents that support the concurrent creation of spans need to check the condition again.
That is because any number of spans may be started before any of them end.
Agents SHOULD guard against race conditions and SHOULD prefer lock-free CAS loops over using locks.

Example with lock:
```java
boolean report
lock()
report = span_count.reported < transaction_max_spans
if (report) {
span_count.reported++
}
unlock()
```

Example CAS loop:
```java
boolean report
while (true) {
int reported = span_count.reported.atomic_get()
report = reported < transaction_max_spans
if (report && !span_count.reported.compareAndSet(reported, reported + 1)) {
// race condition - retry
continue
}
break
}
```
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

#### Configuration snapshot

To ensure consistent behavior within one transaction,
the `transaction_max_spans` option should be read once on transaction start.
Even if the option is changed via remote config during the lifetime of a transaction,
the value that has been read at the start of the transaction should be used.

Note that it's not enough to just consider this condition on span start:

span_count.sent >= transaction_max_spans

That's because there may be any number of concurrent spans that are started but not yet ended.
While the condition could potentially be evaluated on span end,
it's preferable to know at the start of the span whether the span should be dropped.
The reason being that agents can omit heavy operations, such as capturing a request body.

### Metric collection
#### Metric collection

Even though we can determine whether to drop a span before starting it, it's not legal to return a `null` or noop span in that case.
That's because we're [collecting statistics about dropped spans](tracing-spans-dropped-stats.md) as well as
[breakdown metrics](https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64#heading=h.ondan294nbpt)
even for spans that exceed `transaction_max_spans`.

For spans that are known to be dropped upfront, Agents SHOULD NOT collect information that is expensive to get and not needed for metrics collection.
This includes capturing headers, request bodies, and summarizing SQL statements, for example.
15 changes: 0 additions & 15 deletions specs/agents/tracing-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,21 +56,6 @@ The documentation should clarify that spans with `unknown` outcomes are ignored

Spans may have an associated stack trace, in order to locate the associated source code that caused the span to occur. If there are many spans being collected this can cause a significant amount of overhead in the application, due to the capture, rendering, and transmission of potentially large stack traces. It is possible to limit the recording of span stack traces to only spans that are slower than a specified duration, using the config variable `ELASTIC_APM_SPAN_FRAMES_MIN_DURATION`.

### Span count

When a span is started a counter should be incremented on its transaction, in order to later identify the _expected_ number of spans. In this way we can identify data loss, e.g. because events have been dropped, or because of instrumentation errors.

To handle edge cases where many spans are captured within a single transaction, the agent should enable the user to start dropping spans when the associated transaction exeeds a configurable number of spans. When a span is dropped, it is not reported to the APM Server, but instead another counter is incremented to track the number of spans dropped. In this case the above mentioned counter for started spans is not incremented.

```json
"span_count": {
"started": 500,
"dropped": 42
}
```

Here's how the limit can be configured for [Node.js](https://www.elastic.co/guide/en/apm/agent/nodejs/current/agent-api.html#transaction-max-spans) and [Python](https://www.elastic.co/guide/en/apm/agent/python/current/configuration.html#config-transaction-max-spans).

### Exit spans

Exit spans are spans that describe a call to an external service,
Expand Down