Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(grpc): stop server on SIGTERM and SIGINT #647

Merged
merged 2 commits into from
Mar 24, 2025

Conversation

dionysius
Copy link
Contributor

@dionysius dionysius commented Mar 14, 2025

Why is this PR required? What issue does it fix?: Stopping the binary doesn't exit the process because the grpc server goroutine is still running and doesn't get cancelled in any form.

What this PR does?: Listens (additionally, see below) to system signals SIGTERM and SIGINT to handle grpc server stopping.

Does this PR require any upgrade changes?: no

If the changes in this PR are manually verified, list down the scenarios covered:: Stopping the binary in agent mode will now stop the process (since all goroutines have naturally stopped).

Any additional information for your reviewer? :
This is only a quick solution but small and concise. Better approach would be a refactoring and passing around of the stop-channel from the top. Even better would be to introduce a clean context instead for the whole lifetime of the application process. Even the used function which register those signals has been updated to return a context many years ago: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/manager/signals/signal.go. See also: https://stackoverflow.com/a/74895157 and https://gist.github.com/embano1/e0bf49d24f1cdd07cffad93097c04f0a

Checklist:

  • Fixes does not cleanly shut down on SIGTERM #646
  • PR Title follows the convention of <type>(<scope>): <subject>
  • Has the change log section been updated?
  • Commit has unit tests
  • Commit has integration tests
  • (Optional) Are upgrade changes included in this PR? If not, mention the issue/PR to track:
  • (Optional) If documentation changes are required, which issue on https://github.com/openebs/website is used to track them:

@dionysius dionysius requested a review from a team as a code owner March 14, 2025 06:51
Copy link
Member

@Abhinandan-Purkait Abhinandan-Purkait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Abhinandan-Purkait
Copy link
Member

cc @tiagolobocastro @niladrih

Copy link
Member

@niladrih niladrih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dionysius
Copy link
Contributor Author

dionysius commented Mar 14, 2025

@niladrih I've considered it, but it will crash since https://github.com/kubernetes-sigs/controller-runtime/blob/v0.2.0/pkg/manager/signals/signal.go#L24. This function is only allowed to be called once.

Thats why the notes that the function should be called somewhere on the "top" of the application and the stopCh then passed down everywhere. Currently the agent mode and the controller mode do use this function and the grpc server is next to it. No easy way to pass around stopCh without starting a refactoring.

This PR is designed to be small and concise. Any better approach is a bigger refactor.

Copy link
Member

@niladrih niladrih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Could you add a // TODO: comment explaining the tech debt involved in this approach? Thanks.

@dionysius
Copy link
Contributor Author

dionysius commented Mar 14, 2025

Hmm... I was about to prepare such commit. I just saw one more option, define:

var (
  stopCh := signals.SetupSignalHandler()
)

on package level for pkg/driver. All affected files reside in there

@dionysius
Copy link
Contributor Author

dionysius commented Mar 14, 2025

Yeah, this variant doesn't look to bad either: https://github.com/openebs/zfs-localpv/compare/develop...dionysius:zfs-localpv:grpc_stop_2?expand=1.

But needs some comments and a quick check here since I want to be sure that two routines can wait for the channel simultanously

@dionysius
Copy link
Contributor Author

dionysius commented Mar 14, 2025

No, it doesn't behave the same. While the issue might be fixed (the process stops), it doesn't report the stopped routines (there are a few missing):

I0314 07:32:14.850218   23156 grpc.go:142] Shutting down gRPC server gracefully
I0314 07:32:14.850366   23156 backup.go:184] Shutting down Bkp workers
I0314 07:32:14.850422   23156 zfsnode.go:214] Shutting down Node controller

Yeah this PR is the quick way to go, anything else would be best using a context and or "proxy"copy the stopCh to the various places.

I've updated the places with TODO infos accordingly.

@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.91%. Comparing base (8c402d3) to head (da25775).
Report is 54 commits behind head on develop.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #647      +/-   ##
===========================================
- Coverage    95.99%   95.91%   -0.08%     
===========================================
  Files            1        1              
  Lines          574      686     +112     
===========================================
+ Hits           551      658     +107     
- Misses          19       23       +4     
- Partials         4        5       +1     
Flag Coverage Δ
bddtests 95.91% <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Abhinandan-Purkait
Copy link
Member

@dionysius Are we good to merge?

@Abhinandan-Purkait Abhinandan-Purkait merged commit e92c61f into openebs:develop Mar 24, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

does not cleanly shut down on SIGTERM
5 participants