You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Failed Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(behaviour: Async) [8 ms]
Error Message:
System.InvalidOperationException : Process must exit before requested information can be determined.
Stack Trace:
at System.Diagnostics.Process.EnsureState(State state)
at System.Diagnostics.Process.get_ExitCode()
at Octopus.Shellfish.ShellCommandExecutorHelpers.SafelyGetExitCode(Process process) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommandExecutorHelpers.cs:line 21
at Octopus.Shellfish.ShellCommand.ExecuteAsync(CancellationToken cancellationToken) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommand.cs:line 243
at Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(SyncBehaviour behaviour) in /opt/buildagent/work/c4d67eb417aa367f/source/Tests/ShellCommandFixture.StdInput.cs:line 228
Analysis
The relevant code is (edited for brevity) doing this:
try{awaitexitedTask.ConfigureAwait(true);}catch(OperationCanceledException)when(cancellationToken.IsCancellationRequested){TryKillProcessAndChildrenRecursively(process);}// Exception thrown from SafelyGetExitCodereturnnewShellCommandResult(SafelyGetExitCode(process));
SIGTERM is a message asking the process to exit itself. It may not exit immediately, or even if it is trying to, it may not be fast enough before we try and get the exit code
Discussion
The old ShellExecutor API does not have this bug. The old code goes like this:
In background, register cancellation
Wait for the process to exit (this cannot be cancelled)
When the cancellation token fires, the background registration kills the process
The wait does not return until after the process has actually exited.
Whereas the new code goes like this:
Wait for the process to exit (with cancellation)
Cancellation token fires, cancelling the wait
Proceed to trying to kill the process, then SafelyGetExitCode; we have not waited for the "kill" to actually kill.
Internally in OctopusDeploy we use a forked copy of the old code, called SilentProcessRunner. It also looks like it will have this bug, as the cancellation will cancel the wait operation before the process has actually exited.
Possible Solutions
We could simply handle the "not yet exited" code path and return -1 from SafelyGetExitCode
We could insert a second wait-for-exit on the cancellation path to ensure the process has actually exited before we proceed to SafelyGetExitCode
We could change the linux codepath to use SIGKILL rather than SIGTERM so it should be dead immediately. On the windows codepath we call Process.Kill (which is the Win32 TerminateProcess API) which is equivalent to SIGKILL
On both windows/linux, after sending a soft shutdown message, install a short additional WaitForExit with a timer. 5 seconds seems to be a common value and seems fair.
If the process does not exit after this time, hard kill it with TerminateProcess / SIGKILL
Proceed to SafelyGetExitCode, and install the fallback code so that it returns -1 rather than throwing exceptions if the process is still running for some reason.
That's quite a lot of work though, in practice I'm thinking probably just handling the "not yet exited" code path and return -1 from SafelyGetExitCode
The text was updated successfully, but these errors were encountered:
If I'm understanding the proposed fix correctly, on cancellation the reported exit code will vary between success (0) if the process managed to exit on time, or -1 if it didn't. If anything, I think the result should consistently be -1 in this situation.
Encountered in a failed TeamCity build
Analysis
The relevant code is (edited for brevity) doing this:
What is most likely happening.
TryKillProcessAndChildrenRecursively
is implemented by sending a SIGTERM to the process.Discussion
The old ShellExecutor API does not have this bug.
The old code goes like this:
Whereas the new code goes like this:
Internally in OctopusDeploy we use a forked copy of the old code, called SilentProcessRunner. It also looks like it will have this bug, as the cancellation will cancel the wait operation before the process has actually exited.
Possible Solutions
We could simply handle the "not yet exited" code path and return -1 from SafelyGetExitCode
We could insert a second wait-for-exit on the cancellation path to ensure the process has actually exited before we proceed to
SafelyGetExitCode
We could change the linux codepath to use
SIGKILL
rather thanSIGTERM
so it should be dead immediately. On the windows codepath we callProcess.Kill
(which is the Win32TerminateProcess
API) which is equivalent to SIGKILLOr some combination of the above
My ideal solution:
Change windows to send a "soft shutdown" message to the process to match the linux
SIGTERM
.We could post a WM_QUIT or GenerateConsoleCtrlEvent or as this person on StackOverflow found, writing the Ctrl+C escape character to the child process stdinput stream
On both windows/linux, after sending a soft shutdown message, install a short additional WaitForExit with a timer. 5 seconds seems to be a common value and seems fair.
If the process does not exit after this time, hard kill it with TerminateProcess / SIGKILL
Proceed to SafelyGetExitCode, and install the fallback code so that it returns -1 rather than throwing exceptions if the process is still running for some reason.
That's quite a lot of work though, in practice I'm thinking probably just handling the "not yet exited" code path and return -1 from SafelyGetExitCode
The text was updated successfully, but these errors were encountered: