-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is not a real PR, it's a collection of TODOs #46
base: main
Are you sure you want to change the base?
Conversation
These are the questions I got while reviewing the code base. We can discuss them over GH and then create dedicated issues to track them (if needed). Signed-off-by: Flavio Castelli <[email protected]>
// creationg of one of the CronJob fails. We create on CronJob per node, | ||
// hence failing on node 5/10 will cause this reconciliation loop to | ||
// exit immediately. However, the shim finalizer has not been set yet, | ||
// that happens at end of this `if` block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to change the handleInstallShim
to return an array of errors, so that all the nodes get the chance to get the shim installed on them.
Then OFC we need to ensure the finalizer is always set, even if there was an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that calling deployJobOnNode should happen concurrently, and potential errors should be returned aggregated to help implement potential strategies, e.g., retries later on if needed.
Also, it would indeed make sense to call ensureFinalizerForShim even if an error happens:
// Ensure the finalizer is called even if a return happens before
defer func() {
err := sr.ensureFinalizerForShim(ctx, &shimResource, KwasmOperatorFinalizer)
if err != nil {
log.Error().Msgf("Failed to ensure finalizer: %s", err)
}
}()
@@ -474,14 +493,15 @@ func (sr *ShimReconciler) handleDeleteShim(ctx context.Context, shim *kwasmv1.Sh | |||
} | |||
|
|||
func (sr *ShimReconciler) getNodeListFromShimsNodeSelctor(ctx context.Context, shim *kwasmv1.Shim) (*corev1.NodeList, error) { | |||
//TODO (flavio): probably eager optimization - do some pagination? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of clusters with 100s of nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with 5k nodes, it's likely less than 10MB; hence, it's probably not worth it.
I'm leaving the PR as draft because this is not intended to be merged.
These are the questions I got while reviewing the code base. The idea is to them over GH and then create dedicated issues to track them (if needed).
There are also some small things that, if agreed, I can submit with proper PRs