-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add high rate queue for log template snapshots #7310
Conversation
Debugger benchmarksParameters
See matching parameters
SummaryFound 5 performance improvements and 0 performance regressions! Performance is the same for 4 metrics, 6 unstable metrics.
See unchanged results
Request duration reports for reportsgantt
title reports - request duration [CI 0.99] : candidate=None, baseline=None
dateFormat X
axisFormat %s
section baseline
noprobe (296.845 µs) : 260, 333
. : milestone, 297,
basic (290.781 µs) : 281, 300
. : milestone, 291,
loop (10.959 ms) : 10928, 10990
. : milestone, 10959,
section candidate
noprobe (289.995 µs) : 262, 318
. : milestone, 290,
basic (298.85 µs) : 290, 308
. : milestone, 299,
loop (10.251 ms) : 10222, 10280
. : milestone, 10251,
|
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 53 metrics, 10 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.064 s) : 0, 1064163
Total [baseline] (10.332 s) : 0, 10331622
Agent [candidate] (1.064 s) : 0, 1063592
Total [candidate] (10.319 s) : 0, 10319461
section appsec
Agent [baseline] (1.184 s) : 0, 1184404
Total [baseline] (10.573 s) : 0, 10572802
Agent [candidate] (1.191 s) : 0, 1190598
Total [candidate] (10.525 s) : 0, 10525469
section iast
Agent [baseline] (1.173 s) : 0, 1172618
Total [baseline] (10.731 s) : 0, 10731262
Agent [candidate] (1.173 s) : 0, 1173060
Total [candidate] (10.699 s) : 0, 10698681
section profiling
Agent [baseline] (1.261 s) : 0, 1261003
Total [baseline] (10.601 s) : 0, 10601336
Agent [candidate] (1.26 s) : 0, 1260311
Total [candidate] (10.553 s) : 0, 10552628
gantt
title petclinic - break down per module: candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section tracing
BytebuddyAgent [baseline] (665.979 ms) : 0, 665979
BytebuddyAgent [candidate] (665.568 ms) : 0, 665568
GlobalTracer [baseline] (305.331 ms) : 0, 305331
GlobalTracer [candidate] (305.134 ms) : 0, 305134
AppSec [baseline] (49.94 ms) : 0, 49940
AppSec [candidate] (50.011 ms) : 0, 50011
Remote Config [baseline] (675.79 µs) : 0, 676
Remote Config [candidate] (678.046 µs) : 0, 678
Telemetry [baseline] (7.613 ms) : 0, 7613
Telemetry [candidate] (7.615 ms) : 0, 7615
section appsec
BytebuddyAgent [baseline] (676.308 ms) : 0, 676308
BytebuddyAgent [candidate] (681.144 ms) : 0, 681144
GlobalTracer [baseline] (299.227 ms) : 0, 299227
GlobalTracer [candidate] (301.184 ms) : 0, 301184
AppSec [baseline] (153.35 ms) : 0, 153350
AppSec [candidate] (154.219 ms) : 0, 154219
Remote Config [baseline] (622.189 µs) : 0, 622
Remote Config [candidate] (623.299 µs) : 0, 623
Telemetry [baseline] (8.818 ms) : 0, 8818
Telemetry [candidate] (8.002 ms) : 0, 8002
IAST [baseline] (22.765 ms) : 0, 22765
IAST [candidate] (21.048 ms) : 0, 21048
section iast
BytebuddyAgent [baseline] (781.224 ms) : 0, 781224
BytebuddyAgent [candidate] (781.352 ms) : 0, 781352
GlobalTracer [baseline] (296.244 ms) : 0, 296244
GlobalTracer [candidate] (295.381 ms) : 0, 295381
AppSec [baseline] (47.334 ms) : 0, 47334
AppSec [candidate] (48.037 ms) : 0, 48037
Remote Config [baseline] (572.794 µs) : 0, 573
Remote Config [candidate] (569.301 µs) : 0, 569
Telemetry [baseline] (6.942 ms) : 0, 6942
Telemetry [candidate] (6.978 ms) : 0, 6978
IAST [baseline] (26.75 ms) : 0, 26750
IAST [candidate] (27.206 ms) : 0, 27206
section profiling
ProfilingAgent [baseline] (95.054 ms) : 0, 95054
ProfilingAgent [candidate] (95.896 ms) : 0, 95896
BytebuddyAgent [baseline] (661.859 ms) : 0, 661859
BytebuddyAgent [candidate] (660.956 ms) : 0, 660956
GlobalTracer [baseline] (387.701 ms) : 0, 387701
GlobalTracer [candidate] (387.211 ms) : 0, 387211
AppSec [baseline] (51.341 ms) : 0, 51341
AppSec [candidate] (51.414 ms) : 0, 51414
Remote Config [baseline] (672.551 µs) : 0, 673
Remote Config [candidate] (661.483 µs) : 0, 661
Telemetry [baseline] (7.345 ms) : 0, 7345
Telemetry [candidate] (7.33 ms) : 0, 7330
Profiling [baseline] (95.078 ms) : 0, 95078
Profiling [candidate] (95.921 ms) : 0, 95921
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.071 s) : 0, 1071494
Total [baseline] (8.538 s) : 0, 8537698
Agent [candidate] (1.072 s) : 0, 1071833
Total [candidate] (8.558 s) : 0, 8557604
section iast
Agent [baseline] (1.171 s) : 0, 1170958
Total [baseline] (8.999 s) : 0, 8999456
Agent [candidate] (1.169 s) : 0, 1168728
Total [candidate] (8.971 s) : 0, 8971452
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.169 s) : 0, 1169035
Total [baseline] (8.942 s) : 0, 8941669
Agent [candidate] (1.17 s) : 0, 1169968
Total [candidate] (8.913 s) : 0, 8912745
section iast_TELEMETRY_OFF
Agent [baseline] (1.168 s) : 0, 1167635
Total [baseline] (8.97 s) : 0, 8970167
Agent [candidate] (1.172 s) : 0, 1172054
Total [candidate] (8.949 s) : 0, 8948501
gantt
title insecure-bank - break down per module: candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section tracing
BytebuddyAgent [baseline] (670.632 ms) : 0, 670632
BytebuddyAgent [candidate] (670.854 ms) : 0, 670854
GlobalTracer [baseline] (307.32 ms) : 0, 307320
GlobalTracer [candidate] (307.566 ms) : 0, 307566
AppSec [baseline] (50.381 ms) : 0, 50381
AppSec [candidate] (50.292 ms) : 0, 50292
Remote Config [baseline] (676.105 µs) : 0, 676
Remote Config [candidate] (682.267 µs) : 0, 682
Telemetry [baseline] (7.697 ms) : 0, 7697
Telemetry [candidate] (7.747 ms) : 0, 7747
section iast
BytebuddyAgent [baseline] (779.289 ms) : 0, 779289
BytebuddyAgent [candidate] (778.779 ms) : 0, 778779
GlobalTracer [baseline] (295.507 ms) : 0, 295507
GlobalTracer [candidate] (295.16 ms) : 0, 295160
AppSec [baseline] (47.998 ms) : 0, 47998
AppSec [candidate] (49.594 ms) : 0, 49594
IAST [baseline] (27.173 ms) : 0, 27173
IAST [candidate] (24.069 ms) : 0, 24069
Remote Config [baseline] (574.515 µs) : 0, 575
Remote Config [candidate] (602.79 µs) : 0, 603
Telemetry [baseline] (6.986 ms) : 0, 6986
Telemetry [candidate] (7.068 ms) : 0, 7068
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (778.388 ms) : 0, 778388
BytebuddyAgent [candidate] (777.867 ms) : 0, 777867
GlobalTracer [baseline] (294.718 ms) : 0, 294718
GlobalTracer [candidate] (295.109 ms) : 0, 295109
AppSec [baseline] (50.295 ms) : 0, 50295
AppSec [candidate] (48.067 ms) : 0, 48067
IAST [baseline] (23.819 ms) : 0, 23819
IAST [candidate] (27.116 ms) : 0, 27116
Remote Config [baseline] (560.04 µs) : 0, 560
Remote Config [candidate] (565.361 µs) : 0, 565
Telemetry [baseline] (7.725 ms) : 0, 7725
Telemetry [candidate] (7.752 ms) : 0, 7752
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (776.604 ms) : 0, 776604
BytebuddyAgent [candidate] (780.596 ms) : 0, 780596
GlobalTracer [baseline] (295.075 ms) : 0, 295075
GlobalTracer [candidate] (294.583 ms) : 0, 294583
AppSec [baseline] (47.205 ms) : 0, 47205
AppSec [candidate] (47.315 ms) : 0, 47315
IAST [baseline] (27.911 ms) : 0, 27911
IAST [candidate] (27.79 ms) : 0, 27790
Remote Config [baseline] (564.636 µs) : 0, 565
Remote Config [candidate] (579.393 µs) : 0, 579
Telemetry [baseline] (6.756 ms) : 0, 6756
Telemetry [candidate] (7.597 ms) : 0, 7597
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 17 unstable metrics. Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section baseline
no_agent (363.97 µs) : 345, 383
. : milestone, 364,
iast (482.077 µs) : 460, 504
. : milestone, 482,
iast_FULL (543.147 µs) : 522, 564
. : milestone, 543,
iast_GLOBAL (501.28 µs) : 480, 523
. : milestone, 501,
iast_HARDCODED_SECRET_DISABLED (475.917 µs) : 455, 497
. : milestone, 476,
iast_INACTIVE (447.667 µs) : 427, 469
. : milestone, 448,
iast_TELEMETRY_OFF (470.583 µs) : 449, 492
. : milestone, 471,
tracing (435.173 µs) : 415, 456
. : milestone, 435,
section candidate
no_agent (372.286 µs) : 353, 392
. : milestone, 372,
iast (480.095 µs) : 458, 502
. : milestone, 480,
iast_FULL (542.42 µs) : 522, 563
. : milestone, 542,
iast_GLOBAL (491.705 µs) : 471, 512
. : milestone, 492,
iast_HARDCODED_SECRET_DISABLED (471.096 µs) : 450, 492
. : milestone, 471,
iast_INACTIVE (444.82 µs) : 424, 466
. : milestone, 445,
iast_TELEMETRY_OFF (467.229 µs) : 446, 488
. : milestone, 467,
tracing (437.182 µs) : 417, 458
. : milestone, 437,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section baseline
no_agent (1.348 ms) : 1329, 1368
. : milestone, 1348,
appsec (1.721 ms) : 1698, 1745
. : milestone, 1721,
appsec_no_iast (1.71 ms) : 1687, 1733
. : milestone, 1710,
iast (1.461 ms) : 1438, 1484
. : milestone, 1461,
profiling (1.477 ms) : 1452, 1501
. : milestone, 1477,
tracing (1.46 ms) : 1435, 1484
. : milestone, 1460,
section candidate
no_agent (1.353 ms) : 1334, 1372
. : milestone, 1353,
appsec (1.722 ms) : 1699, 1746
. : milestone, 1722,
appsec_no_iast (1.69 ms) : 1665, 1715
. : milestone, 1690,
iast (1.456 ms) : 1433, 1478
. : milestone, 1456,
profiling (1.519 ms) : 1493, 1545
. : milestone, 1519,
tracing (1.457 ms) : 1433, 1481
. : milestone, 1457,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section baseline
no_agent (1.465 ms) : 1454, 1476
. : milestone, 1465,
appsec (2.213 ms) : 2178, 2248
. : milestone, 2213,
iast (1.973 ms) : 1932, 2015
. : milestone, 1973,
iast_GLOBAL (2.012 ms) : 1970, 2054
. : milestone, 2012,
profiling (1.859 ms) : 1826, 1893
. : milestone, 1859,
tracing (1.841 ms) : 1808, 1873
. : milestone, 1841,
section candidate
no_agent (1.465 ms) : 1454, 1477
. : milestone, 1465,
appsec (2.218 ms) : 2183, 2253
. : milestone, 2218,
iast (1.978 ms) : 1936, 2020
. : milestone, 1978,
iast_GLOBAL (2.022 ms) : 1980, 2064
. : milestone, 2022,
profiling (1.858 ms) : 1825, 1892
. : milestone, 1858,
tracing (1.838 ms) : 1806, 1871
. : milestone, 1838,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.38.0-SNAPSHOT~fde88f2ab0, baseline=1.38.0-SNAPSHOT~0e5e2749b0
dateFormat X
axisFormat %s
section baseline
no_agent (15.581 s) : 15581000, 15581000
. : milestone, 15581000,
appsec (15.019 s) : 15019000, 15019000
. : milestone, 15019000,
iast (18.718 s) : 18718000, 18718000
. : milestone, 18718000,
iast_GLOBAL (17.8 s) : 17800000, 17800000
. : milestone, 17800000,
profiling (15.255 s) : 15255000, 15255000
. : milestone, 15255000,
tracing (14.949 s) : 14949000, 14949000
. : milestone, 14949000,
section candidate
no_agent (15.455 s) : 15455000, 15455000
. : milestone, 15455000,
appsec (15.018 s) : 15018000, 15018000
. : milestone, 15018000,
iast (18.494 s) : 18494000, 18494000
. : milestone, 18494000,
iast_GLOBAL (17.992 s) : 17992000, 17992000
. : milestone, 17992000,
profiling (15.872 s) : 15872000, 15872000
. : milestone, 15872000,
tracing (14.915 s) : 14915000, 14915000
. : milestone, 14915000,
|
Historically, the Queue inside SnapshotSink was designed for a low rate number of snapshots (1 per second) and the consumer thread is the one shared with all other tasks (AgentTaskScheduler.INSTANCE). So the reactivity of the thread to consume the queue is not enough for handling up to 5000 snapshots/s allowed for log template probes. We introduce then a dedicated high rate queue with dedicated threads with better reactivity (max 100ms between polls). All non-capturing snapshots are added into this queue.
d26c2d3
to
6ef7e14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the core part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of having two simple queues with different logic on interval adjustment. Very NICE!
public static final int MAX_SNAPSHOT_SIZE = 1024 * 1024; | ||
private static final int MINUTES_BETWEEN_ERROR_LOG = 5; | ||
public static final int LOW_RATE_CAPACITY = 1024; | ||
static final int HIGH_RATE_MIN_FLUSH_INTERVAL = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that in ms?
this.scheduled = | ||
AgentTaskScheduler.INSTANCE.scheduleAtFixedRate( | ||
this::flush, this, currentFlushInterval, currentFlushInterval, TimeUnit.MILLISECONDS); | ||
private void lowRateReschedule() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the highRateLogic is on SnapshotSink but lowRateLogic is on DebuggerSink?
I would suggest continue this refactor and make both queues handle their logic.
Maybe DebuggerSink should own two SnapshotSinks or have two classes for SnapshotSinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's a complicated beast as the flushing thread for low rate Q is also used for flushing probe status & symbols
void doReconsiderFlushInterval() { | ||
double remainingCapacityPercent = snapshotSink.remainingCapacity() * 1D / CAPACITY; | ||
long currentInterval = currentFlushInterval; | ||
void doReconsiderLowRateFlushInterval() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment on the reason/goal behind this function.
long interval = currentHighRateFlushInterval; | ||
if (snapshotCount == HIGH_RATE_CAPACITY) { | ||
currentHighRateFlushInterval = HIGH_RATE_MIN_FLUSH_INTERVAL; | ||
} else if (snapshotCount > HIGH_RATE_CAPACITY * 3 / 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider define HIGH_RATE_CAPACITY * 3 / 4 as a const HIGH_RATE_75_PERCENT_CAPACITY
} | ||
} | ||
|
||
private void reconsiderHighRateFlushInterval(int snapshotCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow the logic here.
From what it seems this function try to have interval where snapshotCount is around 10%-25% of capacity. not sure why this is the goal.
The way I think about it is that uploadPayloads take snapshots and create batches of them. I think we want interval that send 1 batch per flush.
ratio = 1 / (snapshotCount / averageSnapshotsPerAFullBatch)
newInterval = oldInterval * ratio.
this will try to get the interval to be the time it take to collect enough snapshots to emit them in a single batch.
We can estimate averageSnapshotsPerAFullBatch as we can take the average json size of a log message. or just calculate it after every batch was generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the logic is not like this.
We try to adjust the interval between flush to avoid filling the queue when we are doing the serialization of all the snapshots.
it's a compromise between polling frequently the queue but not too much if the rate of arrival is low, but then if we have high rate we want to decrease the flush interval to avoid dropping snapshots with a full queue.
This is a heuristic so not trying to be exact or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log template snapshots are samll enough to consider taking the whole queue at once. Worst case we are splitting into 2 batches, not a big deal.
What Does This Do
We introduce a dedicated high rate queue with dedicated threads with better reactivity (max 100ms between polls). All non-capturing snapshots are added into this queue.
Motivation
Historically, the Queue inside SnapshotSink was designed for a low rate number of snapshots (1 per second) and the consumer thread is the one shared with all other tasks (AgentTaskScheduler.INSTANCE). So the reactivity of the thread to consume the queue is not enough for handling up to 5000 snapshots/s allowed for log template probes.
Additional Notes
Jira ticket: DEBUG-2499