-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure metrics latency #2323
base: main
Are you sure you want to change the base?
Measure metrics latency #2323
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2323 +/- ##
=====================================
Coverage 79.5% 79.5%
=====================================
Files 123 123
Lines 21258 21258
=====================================
+ Hits 16905 16907 +2
+ Misses 4353 4351 -2 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
stress/src/metrics_latency.rs
Outdated
// collect threads in short intervals | ||
let mut total_count = 0; | ||
while threads.iter().any(|t| !t.is_finished()) { | ||
// collect agressively so we could measure inserts properly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to understand what this is simulating? Metric collection interval is by default 30 sec or 60 sec in the spec. What is the scenario for collecting every 500 microsec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically idea, is to choose collection interval so it wouldn't affect numbers (p99) but still allow to test updates and inserts (for insert I need to clean it regularly, but not too frequent that updates would become inserts).
I updated code with explanation why this number... it could be different, but it should be good enough so it would work in all cases without affecting p99.
( Sorry for rebase,-force push, first commit was a bit rushed... I wasn't happy with, now it's better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not following the scenario being tested - are we expecting that users will run collect() every few microseconds? I don't think that is something we need to consider for performance optimizations.
Also, I opened #2328 to track revisiting the aggressive cleaning up of HashMap.
e0df038
to
aa55bb8
Compare
})); | ||
} | ||
let mut total_count = 0; | ||
while start.elapsed() < Duration::from_secs(1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to measure the impact of Collect to overall throughput, lets add the existing PeriodicReader with a dummy exporter to the existing stress test. The PeriodicReader interval should be the defaults (60 secs) - or we can shorten to a low number like 5 sec, but definitely not micro-secs!
(I don't expect adding Reader to have any measurable impact to stress test.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is opposite,- collect should be invisible in the statistics, but keep in mind, that this number is almost arbitrary... and I should probably have better implementation, but for the starters it already shows meaningful numbers :)
So with 300micro-sec... it will clear around 1800-2500times per second (time to collect also adds up).
For p99, I care about 99% of measurements, and can ignore the rest. Which means, that for p99 it will have no effect on updates, if I measure more than 2500*99 =~250K/s, for my case I'm always over several millions, so collecting doesn't have any affect on statistics.
For inserts situation is different... I need to avoid hitting overflow (2k), which means that at most I can reach around 2M (which might be the case with better hardware...), so I basically hope that I clear frequently enough, to not hit 2k (overflow), but not too frequent, so that it will greatly affect p99 (but it probably affects in some scenarios... so this is not ideal).
Probably I should implement some sort of dynamic/adaptive interval, so that it wouldn't clean to frequently (so it wouldn't affect more than 1% of measurements for p99, but not too rarely to hit overflow (2k).
In other words, if I collect after less than 100, then it will affect p99, but if I collect after 2k, then for inserts it might reached overflow.
I could also implement different strategies for each case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on the scenario being covered? Without understanding the actual scenario, its hard to provide useful feedback.
Tests how much time it takes to call
.add(value, attribs)
under various conditions.It measures several things: average time, median time (p50), p95 and p99.
The way it works is pretty simple:
.add(value, attribs)
on several threads in the loop and measure each callHere's the results on my machine on main branch.
See results.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes