-
Notifications
You must be signed in to change notification settings - Fork 46
Add query per stage latency metrics #839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…er_stage_latency_metrics
@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest) | |||
InnerHitFetchTask::getDiagnostic))); | |||
} | |||
searchContext.getResponseBuilder().setDiagnostics(diagnostics); | |||
|
|||
// TODO: These are the diagnostics I want to publish to prometheus, I'll try to figure out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left these comments here as a reference for myself.
@@ -120,6 +120,7 @@ public static void updateSearchResponseMetrics( | |||
.labelValues(index, "facet:" + entry.getKey()) | |||
.observe(entry.getValue()); | |||
} | |||
searchStageLatencyMs.labelValues(index, "rescore").observe(diagnostics.getRescoreTimeMs());//adding extra rescore metric to avoid calculating average of all rescorers | |||
for (Map.Entry<String, Double> entry : diagnostics.getRescorersTimeMsMap().entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, searchStageLatencyMs only has the rescorer latency for each rescorer, but for an initial overview, a general latency for all rescorers might also be useful.
@@ -151,7 +155,6 @@ public MetricSnapshots collect() { | |||
if (publishVerboseMetrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would we set this to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only when we need to the extra metrics to investigate something
@@ -139,6 +140,9 @@ public MetricSnapshots collect() { | |||
try { | |||
metrics.add(searchTimeoutCount.collect()); | |||
metrics.add(searchTerminatedEarlyCount.collect()); | |||
metrics.add(searchStageLatencyMs.collect());// Just adding this here should mean it gets published to prometheus, is that what we want? | |||
// when is publishVerboseMetrics set to true, I couldn't find this metric in the grafana shard without any filters? | |||
// maybe it makes sense to add an extra parameter like publishSearchStageLatencyMs with default value true (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have an option to turn this on and off separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a live index setting which enables publishing verbose metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if a metric is per-hit, it should come under the verbose metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. I was considering adding something like publishOnlyPerQueryStageLatency
, which could offer a way to only publish searchStageLatencyMs
without searchResponseSizeBytes
and searchResponseTotalHits
. (All three would still be published if verbose is set to true.)
But if we only turn this on in case we want to investigate something, or the performance impact is not too high, I suppose that wouldn't be necessary.
@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest) | |||
InnerHitFetchTask::getDiagnostic))); | |||
} | |||
searchContext.getResponseBuilder().setDiagnostics(diagnostics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the diagnostics object here contains all of the required metrics, including the rescorer metrics. Can you please confirm once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it looks like it here. What I couldn't find there was a total rescore time metric, but just adding up the time of all rescorers in a prometheus query might be an easier option.
Working on publishing per stage latency metrics via prometheus.