Problem Statement
When exporting with delta temporality, the measurement that occurs immediately after a Collect has significantly worse performance (450 ns, 4 allocs with exemplars disabled) than the normal case (50ns, 0 allocs). Collect clears the internal state associated with an attribute set, so the first measurement in a new collection interval needs to recreate that state, which is expensive. Since collect does this for every instrument, and for every attribute set in the entire SDK, Collect can cause a large number of allocations in a short period of time, and cause issues.
This came up in the context of bound instruments (open-telemetry/opentelemetry-specification#5050 (review), open-telemetry/opentelemetry-specification#5050 (comment)), which could solve this by not clearing bound instruments from the SDKs internal storage during collect.
Proposed Solution
Lazily clear the SDK's internal state for delta metrics (PoC: #8230). This can be done by marking everything in the SDK's storage "stale", rather than immediately deleting it. This approach has already been implemented in .NET and Rust (open-telemetry/opentelemetry-specification#5050 (comment)). This would generically solve this issue, even for attribute sets which are not known ahead of time (e.g. otelhttp metrics).
Downsides:
- Complexity. Adding staleness tracking to the internal state makes the SDK somewhat more complex.
- Memory usage. Today, delta aggregations maintain a "hot" and a "cold" map[attribute.Set]aggregation. Once elements in the cold map are collected, they can be deleted. But if we don't delete them, they will stick around longer. In the best-case scenario (where the same attribute sets are observed each collection interval), this effectively doubles the memory usage for delta metrics.
Alternatives
Keep what we have today, or rely on bound instruments (future) to solve this case.
Prior Art
#8230
Additional Context
Add any other context or screenshots about the feature request here.
Tip: React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
cc @cijothomas can maybe provide more context on the .NET and Rust implementations.
Problem Statement
When exporting with delta temporality, the measurement that occurs immediately after a Collect has significantly worse performance (450 ns, 4 allocs with exemplars disabled) than the normal case (50ns, 0 allocs). Collect clears the internal state associated with an attribute set, so the first measurement in a new collection interval needs to recreate that state, which is expensive. Since collect does this for every instrument, and for every attribute set in the entire SDK, Collect can cause a large number of allocations in a short period of time, and cause issues.
This came up in the context of bound instruments (open-telemetry/opentelemetry-specification#5050 (review), open-telemetry/opentelemetry-specification#5050 (comment)), which could solve this by not clearing bound instruments from the SDKs internal storage during collect.
Proposed Solution
Lazily clear the SDK's internal state for delta metrics (PoC: #8230). This can be done by marking everything in the SDK's storage "stale", rather than immediately deleting it. This approach has already been implemented in .NET and Rust (open-telemetry/opentelemetry-specification#5050 (comment)). This would generically solve this issue, even for attribute sets which are not known ahead of time (e.g. otelhttp metrics).
Downsides:
Alternatives
Keep what we have today, or rely on bound instruments (future) to solve this case.
Prior Art
#8230
Additional Context
Add any other context or screenshots about the feature request here.
Tip: React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding
+1orme too, to help us triage it. Learn more here.cc @cijothomas can maybe provide more context on the .NET and Rust implementations.