Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enough functionality to implement adaptive load shedding #6148

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Oct 14, 2024

An investigation into backpressure issues in the router.

Most of the changes are in various plugins to implement backpressure. However, those fixes are not enough to provide useful functionality...

The current implementation of the router create a new pipeline for each connection. This has the unfortunate impact of discarding state which is required for various load impacting layers to work correctly.

This exploration modifies the router to hold a single master pipeline which is clone'd for each connection. This allows the various tower connection limiting layers to work correctly.

I've got a version which works with standard tower layers, commented out here, but I've also got a potentially more interesting version which uses a load shedded based on Little's Law, which is what is active in this code.

Notes:

Modifying the pipeline to be cloneable has generally worked fine, but it has caused issues for the Limit layer. This layer looks "generally problematic" since it appears to make a number of assumptions about what request rejection actually means. I've done some minimal modification to try and make it work win a cloned pipeline, but tests are still failing and I'm not sure it does what it should do.

I also noticed that when implementing backpressure, various mock tests needed to be modified since test rejection happened earlier in the pipeline and a map_result() somewhere isn't triggered. That needs some investigation, but I think it's a small problem to address.

I modified the bridge query planner pool to prevent excessive queueing in this layer. Since I now want to control this by load_shedding before this service is reached, I only want enough channels to support the number of planners.

Summary:

This PR provides a router which will operate with approximately the same performance of the base router, but which controls memory and rejects excess load to prevent "over-commit" by the router. This is a very desirable property.

More testing is required, but this is looking promising so far.

Description here

Fixes #issue_number


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

An investigation into backpressure issues in the router.

Most of the changes are in various plugins to implement backpressure.
However, those fixes are not enough to provide useful functionality...

The current implementation of the router create a new pipeline for each
connection. This has the unfortunate impact of discarding state which is
required for various load impacting layers to work correctly.

This exploration modifies the router to hold a single master pipeline
which is clone'd for each connection. This allows the various tower
connection limiting layers to work correctly.

I've got a version which works with standard tower layers, commented out
here, but I've also got a potentially more interesting version which
uses a load shedded based on Little's Law, which is what is active in
this code.

Notes:

Modifying the pipeline to be cloneable has generally worked fine, but it
has caused issues for the Limit layer. This layer looks "generally
problematic" since it appears to make a number of assumptions about what
request rejection actually means. I've done some minimal modification to
try and make it work win a cloned pipeline, but tests are still failing
and I'm not sure it does what it should do.

I also noticed that when implementing backpressure, various mock tests
needed to be modified since test rejection happened earlier in the
pipeline and a map_result() somewhere isn't triggered. That needs some
investigation, but I think it's a small problem to address.

I modified the bridge query planner pool to prevent excessive queueing
in this layer. Since I now want to control this by load_shedding before
this service is reached, I only want enough channels to support the
number of planners.

Summary:

This PR provides a router which will operate with approximately
the same performance of the base router, but which controls memory and
rejects excess load to prevent "over-commit" by the router. This is a
very desirable property.

More testing is required, but this is looking promising so far.
@garypen garypen self-assigned this Oct 14, 2024
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Oct 14, 2024

✅ Docs Preview Ready

No new or changed pages found.

Copy link
Contributor

@garypen, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@Geal Geal changed the title Enough functionality to implement adapative load shedding Enough functionality to implement adaptive load shedding Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants