-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add syncronization between the MPS control daemon and the device plugin #606
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
ReadyFilePath = "/mps/.ready" | ||
) | ||
|
||
// ReadyFile represents a file used to store readyness of the MPS daemon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ReadyFile represents a file used to store readyness of the MPS daemon. | |
// ReadyFile represents a file used to store readiness of the MPS daemon. |
// Matches checks whether the contents of the ready file matches the specified config. | ||
func (f ReadyFile) Matches(config *spec.Config) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a different verb than Matches
to future proof this (since we don't plan on doing an exact match of the full config file in the future).
return nil | ||
} | ||
|
||
// Load loads the contents of th read file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Load loads the contents of th read file. | |
// Load loads the contents of the ready file. |
|
||
// NewCommand constructs a mount command. | ||
func NewCommand() *cli.Command { | ||
// Create the 'generate-cdi' command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Create the 'generate-cdi' command | |
// Create the 'wait' command |
if err != nil { | ||
return nil, fmt.Errorf("unable to validate flags: %v", err) | ||
} | ||
config.Flags.GFD = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth a comment why we nil out the GFD config?
readyFile := mps.ReadyFile{} | ||
if err := readyFile.Remove(); err != nil { | ||
klog.Warningf("Failed to remove .ready file: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if MPS is turned off in the config? Will a pevious ready file just be sitting around until MPS get reenabled at some point in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with the current implementation. What options do we have to ensure that this is removed if the MPS daemonset is removed?
klog.Info("Starting OS watcher.") | ||
sigs := watch.Signals(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT) | ||
|
||
var retry <-chan time.Time | ||
|
||
restart: | ||
err = assertMpsIsReady(config) | ||
if err == nil { | ||
klog.Infof("MPS is ready") | ||
return nil | ||
} | ||
klog.ErrorS(err, "MPS is not ready; retrying ...") | ||
retry = time.After(30 * time.Second) | ||
|
||
for { | ||
select { | ||
case <-retry: | ||
goto restart | ||
// Watch for any signals from the OS. On SIGHUP, restart this loop, | ||
// restarting all of the plugins in the process. On all other | ||
// signals, exit the loop and exit the program. | ||
case s := <-sigs: | ||
switch s { | ||
case syscall.SIGHUP: | ||
klog.Info("Received SIGHUP, restarting.") | ||
goto restart | ||
default: | ||
klog.Infof("Received signal \"%v\", shutting down.", s) | ||
goto exit | ||
} | ||
} | ||
|
||
} | ||
exit: | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure any of this restart logic is necessary. In fact, I'm not sure if this command should be called wait
but rather just assert
and then fail if it cannot assert that the MPS daemon is up and running. Since the intention is to run this as an init container, I actually don't want it to loop and instead just keep failing until it asserts that the daemon is up and running.
if event.Name == mps.ReadyFilePath { | ||
if config == nil || config.Sharing.SharingStrategy() != spec.SharingStrategyMPS { | ||
klog.InfoS("Ignoring event", "event", event) | ||
continue | ||
} | ||
switch { | ||
case event.Op&fsnotify.Remove == fsnotify.Remove: | ||
klog.Infof("%s removed; restarting", mps.ReadyFilePath) | ||
goto restart | ||
case event.Op&(fsnotify.Create|fsnotify.Write) != 0: | ||
klog.Infof("%s created or modified; checking config", mps.ReadyFilePath) | ||
readyFile := mps.ReadyFile{} | ||
|
||
matches, err := readyFile.Matches(config) | ||
if err != nil { | ||
klog.ErrorS(err, "failed to check .ready file") | ||
goto restart | ||
} | ||
if !matches { | ||
klog.Info("mismatched MPS sharing config; restarting.") | ||
goto restart | ||
} | ||
continue | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read through this in detail, but at first glance it seems a bit convoluted. Can't we rely on the logic already in-place that handles config updates and hook into that somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be possible to just rely on the config change restart logic. Let's make the following assumptions:
- The MPS daemonset owns the lifecycle of the
.ready
file and this file contains the current config used to start the MPS control daemon. - If the config changes and both the MPS daemon and the device plugin are restared by their respective config managers, we should be able to fail plugin startup until: 1. A .ready file exists with the expected config, 2. The MPS pipe is ready from the perspective of the device plugin.
This should not require the additional error / watcher events here.
Note that since the config-manager
sidecars are not always deployed, this seems to imply that there are configuration changes that we may miss in this case. For example, how do we capture the restart of the MPS daemon or some other configuration update that doesn't use the config manager? Do we expect the user to manually restart both pods in this case to ensure that we're in sync?
This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed. |
@elezar I suppose you don't want this change here to stale out? |
@frittentheke you're right, I need to revisit this. |
This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed. |
Please keep this alive @elezar |
This change adds an init-container to the device plugin to ensure that the MPS daemon with the correct config has been started before the device plugin starts.
This init container logs: