Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[router-bridge] misleading information about query planning time #455

Open
garypen opened this issue Mar 6, 2024 · 1 comment
Open

[router-bridge] misleading information about query planning time #455

garypen opened this issue Mar 6, 2024 · 1 comment

Comments

@garypen
Copy link
Contributor

garypen commented Mar 6, 2024

The router provides a query planning metric: apollo_router_query_planning_time as a histogram.

This information is tracked from within the router and is in fact tracking all of the time that elapses between submitting the query to the router bridge until the router bridge returns.

That duration may only be a short amount of time for query planning and in fact may be mostly queueing whilst waiting for query planning to be performed. This is because the router-bridge actually maintains a queue of queries to be planned and the actual query planner works synchronously when pulling new jobs from a queue. Let's imaging a scenario where we have two queries to be planned:
complex: takes 30s
simple: takes 1 ms
If they arrive at times (seconds):
0.000: complex
0.001: simple
Then they will finish at:
30.000: complex
30.001: simple
and these times will be reported as apollo_router_query_planning_time which can cause confusion when wondering why a simple query takes so long to plan.

If we want to address this, there are several things we could do within the router/router-bridge:

  1. We'd need to make the amount of jobs which can queue in the QP less than the currently hard-coded 10_000
  2. We'd need to modify the Query planner service so that it didn't just always return Poll::Ready(Ok()) and propagate that back-pressure (somehow)
  3. We could modify the documentation for the existing metric and clarify what the histogram is actually measuring
  4. We could add new metrics to break out queuing time from actual planning time
  5. ...
@Geal
Copy link
Contributor

Geal commented Mar 6, 2024

we could make a queue on the router side and make sure there's only 1 element in the planning queue on the router bridge side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants