Skip to content

Conversation

mathisnyp
Copy link

Working on publishing per stage latency metrics via prometheus.

@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest)
InnerHitFetchTask::getDiagnostic)));
}
searchContext.getResponseBuilder().setDiagnostics(diagnostics);

// TODO: These are the diagnostics I want to publish to prometheus, I'll try to figure out
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left these comments here as a reference for myself.

@@ -120,6 +120,7 @@ public static void updateSearchResponseMetrics(
.labelValues(index, "facet:" + entry.getKey())
.observe(entry.getValue());
}
searchStageLatencyMs.labelValues(index, "rescore").observe(diagnostics.getRescoreTimeMs());//adding extra rescore metric to avoid calculating average of all rescorers
for (Map.Entry<String, Double> entry : diagnostics.getRescorersTimeMsMap().entrySet()) {
Copy link
Author

@mathisnyp mathisnyp Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, searchStageLatencyMs only has the rescorer latency for each rescorer, but for an initial overview, a general latency for all rescorers might also be useful.

@@ -151,7 +155,6 @@ public MetricSnapshots collect() {
if (publishVerboseMetrics) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we set this to true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only when we need to the extra metrics to investigate something

@@ -139,6 +140,9 @@ public MetricSnapshots collect() {
try {
metrics.add(searchTimeoutCount.collect());
metrics.add(searchTerminatedEarlyCount.collect());
metrics.add(searchStageLatencyMs.collect());// Just adding this here should mean it gets published to prometheus, is that what we want?
// when is publishVerboseMetrics set to true, I couldn't find this metric in the grafana shard without any filters?
// maybe it makes sense to add an extra parameter like publishSearchStageLatencyMs with default value true (?)
Copy link
Author

@mathisnyp mathisnyp Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have an option to turn this on and off separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a live index setting which enables publishing verbose metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a metric is per-hit, it should come under the verbose metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. I was considering adding something like publishOnlyPerQueryStageLatency, which could offer a way to only publish searchStageLatencyMs without searchResponseSizeBytes and searchResponseTotalHits. (All three would still be published if verbose is set to true.)

But if we only turn this on in case we want to investigate something, or the performance impact is not too high, I suppose that wouldn't be necessary.

@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest)
InnerHitFetchTask::getDiagnostic)));
}
searchContext.getResponseBuilder().setDiagnostics(diagnostics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the diagnostics object here contains all of the required metrics, including the rescorer metrics. Can you please confirm once?

Copy link
Author

@mathisnyp mathisnyp Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like it here. What I couldn't find there was a total rescore time metric, but just adding up the time of all rescorers in a prometheus query might be an easier option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants