Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidOperationException : Process must exit before requested information can be determined upon cancellation #133

Open
borland opened this issue Nov 28, 2024 · 1 comment

Comments

@borland
Copy link
Contributor

borland commented Nov 28, 2024

Encountered in a failed TeamCity build

Failed Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(behaviour: Async) [8 ms]
Error Message:
 System.InvalidOperationException : Process must exit before requested information can be determined.
Stack Trace:
   at System.Diagnostics.Process.EnsureState(State state)
 at System.Diagnostics.Process.get_ExitCode()
 at Octopus.Shellfish.ShellCommandExecutorHelpers.SafelyGetExitCode(Process process) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommandExecutorHelpers.cs:line 21
 at Octopus.Shellfish.ShellCommand.ExecuteAsync(CancellationToken cancellationToken) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommand.cs:line 243
 at Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(SyncBehaviour behaviour) in /opt/buildagent/work/c4d67eb417aa367f/source/Tests/ShellCommandFixture.StdInput.cs:line 228

Analysis

The relevant code is (edited for brevity) doing this:

try {
    await exitedTask.ConfigureAwait(true);
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) {
    TryKillProcessAndChildrenRecursively(process);
}
// Exception thrown from SafelyGetExitCode
return new ShellCommandResult(SafelyGetExitCode(process));

What is most likely happening.

  • On linux, TryKillProcessAndChildrenRecursively is implemented by sending a SIGTERM to the process.
  • SIGTERM is a message asking the process to exit itself. It may not exit immediately, or even if it is trying to, it may not be fast enough before we try and get the exit code

Discussion

The old ShellExecutor API does not have this bug.
The old code goes like this:

  • In background, register cancellation
  • Wait for the process to exit (this cannot be cancelled)
  • When the cancellation token fires, the background registration kills the process
  • The wait does not return until after the process has actually exited.

Whereas the new code goes like this:

  • Wait for the process to exit (with cancellation)
  • Cancellation token fires, cancelling the wait
  • Proceed to trying to kill the process, then SafelyGetExitCode; we have not waited for the "kill" to actually kill.

Internally in OctopusDeploy we use a forked copy of the old code, called SilentProcessRunner. It also looks like it will have this bug, as the cancellation will cancel the wait operation before the process has actually exited.

Possible Solutions

  • We could simply handle the "not yet exited" code path and return -1 from SafelyGetExitCode

  • We could insert a second wait-for-exit on the cancellation path to ensure the process has actually exited before we proceed to SafelyGetExitCode

  • We could change the linux codepath to use SIGKILL rather than SIGTERM so it should be dead immediately. On the windows codepath we call Process.Kill (which is the Win32 TerminateProcess API) which is equivalent to SIGKILL

  • Or some combination of the above

My ideal solution:

  • Change windows to send a "soft shutdown" message to the process to match the linux SIGTERM.
    We could post a WM_QUIT or GenerateConsoleCtrlEvent or as this person on StackOverflow found, writing the Ctrl+C escape character to the child process stdinput stream

  • On both windows/linux, after sending a soft shutdown message, install a short additional WaitForExit with a timer. 5 seconds seems to be a common value and seems fair.

  • If the process does not exit after this time, hard kill it with TerminateProcess / SIGKILL

  • Proceed to SafelyGetExitCode, and install the fallback code so that it returns -1 rather than throwing exceptions if the process is still running for some reason.

That's quite a lot of work though, in practice I'm thinking probably just handling the "not yet exited" code path and return -1 from SafelyGetExitCode

@adam-mccoy
Copy link
Contributor

If I'm understanding the proposed fix correctly, on cancellation the reported exit code will vary between success (0) if the process managed to exit on time, or -1 if it didn't. If anything, I think the result should consistently be -1 in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants