Add query per stage latency metrics #839

mathisnyp · 2025-03-25T18:59:39Z

Working on publishing per stage latency metrics via prometheus.

…er_stage_latency_metrics

mathisnyp · 2025-03-26T18:37:53Z

src/main/java/com/yelp/nrtsearch/server/handler/SearchHandler.java

@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest)
                        InnerHitFetchTask::getDiagnostic)));
      }
      searchContext.getResponseBuilder().setDiagnostics(diagnostics);
-
+      // TODO: These are the diagnostics I want to publish to prometheus, I'll try to figure out


I just left these comments here as a reference for myself.

mathisnyp · 2025-03-26T18:39:03Z

src/main/java/com/yelp/nrtsearch/server/monitoring/SearchResponseCollector.java

@@ -120,6 +120,7 @@ public static void updateSearchResponseMetrics(
            .labelValues(index, "facet:" + entry.getKey())
            .observe(entry.getValue());
      }
+      searchStageLatencyMs.labelValues(index, "rescore").observe(diagnostics.getRescoreTimeMs());//adding extra rescore metric to avoid calculating average of all rescorers
      for (Map.Entry<String, Double> entry : diagnostics.getRescorersTimeMsMap().entrySet()) {


For now, searchStageLatencyMs only has the rescorer latency for each rescorer, but for an initial overview, a general latency for all rescorers might also be useful.

mathisnyp · 2025-03-26T18:39:58Z

src/main/java/com/yelp/nrtsearch/server/monitoring/SearchResponseCollector.java

@@ -151,7 +155,6 @@ public MetricSnapshots collect() {
      if (publishVerboseMetrics) {


When would we set this to true?

Only when we need to the extra metrics to investigate something

mathisnyp · 2025-03-26T18:58:11Z

src/main/java/com/yelp/nrtsearch/server/monitoring/SearchResponseCollector.java

@@ -139,6 +140,9 @@ public MetricSnapshots collect() {
    try {
      metrics.add(searchTimeoutCount.collect());
      metrics.add(searchTerminatedEarlyCount.collect());
+      metrics.add(searchStageLatencyMs.collect());// Just adding this here should mean it gets published to prometheus, is that what we want?
+      // when is publishVerboseMetrics set to true, I couldn't find this metric in the grafana shard without any filters?
+      // maybe it makes sense to add an extra parameter like publishSearchStageLatencyMs with default value true (?)


Does it make sense to have an option to turn this on and off separately?

There is a live index setting which enables publishing verbose metrics

I think if a metric is per-hit, it should come under the verbose metrics

Yes, that makes sense. I was considering adding something like publishOnlyPerQueryStageLatency, which could offer a way to only publish searchStageLatencyMs without searchResponseSizeBytes and searchResponseTotalHits. (All three would still be published if verbose is set to true.)

But if we only turn this on in case we want to investigate something, or the performance impact is not too high, I suppose that wouldn't be necessary.

sarthakn7 · 2025-04-01T01:51:43Z

src/main/java/com/yelp/nrtsearch/server/handler/SearchHandler.java

@@ -328,7 +328,8 @@ public SearchResponse handle(IndexState indexState, SearchRequest searchRequest)
                        InnerHitFetchTask::getDiagnostic)));
      }
      searchContext.getResponseBuilder().setDiagnostics(diagnostics);


I think the diagnostics object here contains all of the required metrics, including the rescorer metrics. Can you please confirm once?

Yes, it looks like it here. What I couldn't find there was a total rescore time metric, but just adding up the time of all rescorers in a prometheus query might be an easier option.

mathisnyp added 2 commits March 25, 2025 18:51

add comments for clarification

053ea0e

Merge branch 'main' of https://github.com/Yelp/nrtsearch into query_p…

ed2d203

…er_stage_latency_metrics

mathisnyp commented Mar 26, 2025

View reviewed changes

sarthakn7 reviewed Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add query per stage latency metrics #839

Add query per stage latency metrics #839

Uh oh!

mathisnyp commented Mar 25, 2025

Uh oh!

mathisnyp Mar 26, 2025

Uh oh!

mathisnyp Mar 26, 2025 •

edited

Loading

Uh oh!

mathisnyp Mar 26, 2025

Uh oh!

sarthakn7 Apr 1, 2025

Uh oh!

mathisnyp Mar 26, 2025 •

edited

Loading

Uh oh!

sarthakn7 Apr 1, 2025

Uh oh!

sarthakn7 Apr 1, 2025

Uh oh!

mathisnyp Apr 1, 2025

Uh oh!

sarthakn7 Apr 1, 2025

Uh oh!

mathisnyp Apr 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -151,7 +155,6 @@ public MetricSnapshots collect() {
		if (publishVerboseMetrics) {

Add query per stage latency metrics #839

Are you sure you want to change the base?

Add query per stage latency metrics #839

Uh oh!

Conversation

mathisnyp commented Mar 25, 2025

Uh oh!

mathisnyp Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

mathisnyp Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mathisnyp Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

sarthakn7 Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

mathisnyp Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarthakn7 Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

sarthakn7 Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

mathisnyp Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

sarthakn7 Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

mathisnyp Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mathisnyp Mar 26, 2025 •

edited

Loading

mathisnyp Mar 26, 2025 •

edited

Loading

mathisnyp Apr 1, 2025 •

edited

Loading