Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process started by "bb run" has different signal handling disposition than "bazel run" #8326

Open
akirchhoff-modular opened this issue Feb 5, 2025 · 8 comments
Assignees

Comments

@akirchhoff-modular
Copy link

Describe the bug

When bazel run is used to run a process, it can be stopped with SIGINT, SIGTERM, or SIGQUIT. (Example: ctrl-C will stop the process.) When the BuildBuddy CLI is used instead (bb run), these signals are blocked, and the process cannot be stopped in this way.

To Reproduce
Steps to reproduce the behavior:

  1. In an empty directory, create a minimal workspace with these commands:

    touch MODULE.bazel
    echo 'sh_binary(name = "sleeper", srcs = ["sleeper.sh"])' > BUILD.bazel
    echo 'echo Sleeping; sleep 10' > sleeper.sh
    chmod +x sleeper.sh
  2. Run it with Bazel: bazel run :sleeper. When you see "Sleeping", press ctrl-C to interrupt it. (Works)

  3. Run it with Bazelisk: USE_BAZEL_VERSION=latest bazelisk run :sleeper. When you see "Sleeping", press ctrl-C to interrupt it. (Works)

  4. Run it with Bazel through BuildBuddy CLI: bb run :sleeper. When you see "Sleeping", press ctrl-C to interrupt it. (Does not work)

  5. Run it through the BuildBuddy CLI, through Bazelisk: USE_BAZEL_VERSION=buildbuddy-io/latest BB_USE_BAZEL_VERSION=latest bazelisk run :sleeper. When you see "Sleeping", press ctrl-C to interrupt it. (Does not work)

Expected behavior

When I press ctrl-C, I expect the process to terminate. Under Bazel and Bazel-under-Bazelisk, it does. But when using the BuildBuddy CLI (whether or not through Bazelisk), it does not, and remains running. I also expect SIGTERM and SIGQUIT to terminate the process.

Screenshots
If applicable, add screenshots to help explain your problem.

% USE_BAZEL_VERSION=latest bazelisk run :sleeper
INFO: Analyzed target //:sleeper (1 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //:sleeper up-to-date:
  bazel-bin/sleeper
INFO: Elapsed time: 0.139s, Critical Path: 0.00s
INFO: 2 processes: 4 action cache hit, 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Running command line: bazel-bin/sleeper
Sleeping
^C%
% USE_BAZEL_VERSION=buildbuddy-io/latest BB_USE_BAZEL_VERSION=latest bazelisk run :sleeper
INFO: Analyzed target //:sleeper (1 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //:sleeper up-to-date:
  bazel-bin/sleeper
INFO: Elapsed time: 0.127s, Critical Path: 0.00s
INFO: 2 processes: 4 action cache hit, 2 internal.
INFO: Build completed successfully, 2 total actions
Sleeping
^C^C^C^C^C^C^C^C^C^C^C^C^C

Additional context

This is at least in part a side effect of BuildBuddy's --script_path fiddling/interposition. When I comment out these lines and rebuild the BuildBuddy CLI:

buildbuddy/cli/cmd/bb/bb.go

Lines 187 to 192 in e6c1c33

// If this is a `bazel run` command, add a --run_script arg so that
// we can execute post-bazel plugins between the build and the run step.
bazelArgs, scriptPath, err = bazelisk.ConfigureRunScript(bazelArgs)
if err != nil {
return 1, err
}

Then signals work correctly again, matching the normal Bazel behavior.

@siggisim
Copy link
Member

siggisim commented Feb 5, 2025

cc @maggie-lou @bduffany

@maggie-lou
Copy link
Contributor

👍 I'll take a look tomorrow

@bduffany
Copy link
Member

bduffany commented Feb 11, 2025

It looks like this issue might only happen when running sh_binary targets. When running with bb the process tree looks like this:

        - 154116 bash (pgid: 154116, sid: 154116, tty: 34833) bash 
          - 580762 sh (pgid: 580762, sid: 154116, tty: 34833) /bin/sh /home/b/.cache/bazel/_bazel_b/edd33241c517dda06a232361ced2493d/execroot/_main/bazel-out/k8-fastbuild/bin/sleeper 
            - 581175 sleep (pgid: 580762, sid: 154116, tty: 34833) sleep 10 

But when running with bazel the process tree looks like this:

        - 154116 bash (pgid: 154116, sid: 154116, tty: 34833) bash 
          - 581554 bazel (pgid: 581554, sid: 154116, tty: 34833) bazel run :sleeper 
            - 581559 bash (pgid: 581554, sid: 154116, tty: 34833) /bin/bash -c /home/b/.cache/bazel/_bazel_b/edd33241c517dda06a232361ced2493d/execroot/_main/bazel-out/k8-fastbuild/bin/sleeper 
              - 581893 sleep (pgid: 581554, sid: 154116, tty: 34833) sleep 10 

When running under bazel, the sh_binary runs under bash, but for some reason, when running under bb, it runs under sh.

In the past, I've run into issues with signal handling in sh - it doesn't handle signals as nicely as bash does.

I wonder if there's some env variable that determines which shell gets used, which we're dropping, or something like that. Or maybe this is a bug with Bazel's --script_path flag.

@bduffany
Copy link
Member

eh, nevermind - probably not related to sh vs bash. When using --script_path with vanilla bazel, it runs under sh instead of bash like it would under bazel run, but Ctrl+C still works.

@akirchhoff-modular
Copy link
Author

I didn't include this in the issue description because I didn't want to bias towards a particular solution, but I did narrow down the cause of this behavior.

First, note that upon exec, the signal disposition of the exec'ed process will depend on that of the process before the exec: (Excerpt from POSIX.1-2024 standard)

Signals set to the default action (SIG_DFL) in the calling process image shall be set to the default action in the new process image. Except for SIGCHLD, signals set to be ignored (SIG_IGN) by the calling process image shall be set to be ignored by the new process image. Signals set to be caught by the calling process image shall be set to the default action in the new process image (see <signal.h>).

So: If the process before exec has set the signal to be ignored, it will be ignored in the process after exec as well. If the signal is set to either default, or has a non-default handler, then it will be set to default in the process after exec.

Case 1: Bazel is directly invoked. In this case, Bazel installs some signal handlers:

https://github.com/bazelbuild/bazel/blob/master/src/main/cpp/blaze_util_posix.cc#L209-L212

These are used to forward cancelations to the server process. However, how they're used doesn't actually matter. What matters is that they are set at all. This means that by the time the client process receives the ExecRequest and execs here:

https://github.com/bazelbuild/bazel/blob/5b4973534f6adf439ca6d61b7627e0fdc8d9dfe8/src/main/cpp/blaze_util_posix.cc#L328

The signals will be SIG_DFL in the exec'ed process, because the process before exec had a handler set.

Case 2: Bazelisk is invoked, and Bazelisk runs Bazel. In this case, Bazelisk runs this signal.Ignore line:

https://github.com/bazelbuild/bazelisk/blob/1f9a1aca958cdb50b4adb84b15cdda55a600ed31/core/core.go#L702

This ignores those signals in the Bazelisk process. However, it's called after the Bazel process was spawned, so the Bazel process still inherits reasonable (non-ignored) signal handlers. Even if the signal.Ignore ran first, Bazel installed its own signal handler anyway, so it works out fine. It's not a problem that the parent process has signals ignored -- when a user interrupts, the signal is sent to all processes in the process group, so even if a parent process ignores the signal, the child process will still receive it too.

Case 3: BuildBuddy CLI is run. In this case, the BuildBuddy CLI first runs a build, and then it runs the resulting --script_path. Digging into each of those steps:

  1. The build step calls directly into Bazelisk logic, when dispatching out to Bazel to do the build. This ends up running signal.Ignore on those three signals, and this applies globally to the entire process. From this point onwards, bb will ignore signals.
  2. Later, bb execs into the --script_path script. Because the signals are still set to SIG_IGN, the resulting exec'ed process will also have SIG_IGN for those signals.

Case 4: Bazelisk is invoked and runs BuildBuddy CLI. Basically the same as case 3, except with an extra parent process in the way ignoring signals (but again, that's not a problem because everything in the process group will receive the signal -- it's up to the child to figure out what to do with it though).


Basically the problem is that during the build step, when the BuildBuddy CLI called into Bazelisk as a library, Bazelisk contaminated the BuildBuddy CLI's signal disposition, so that when the BuildBuddy CLI goes then to exec the --script_path script, all the signals have already been set to SIG_IGN and that setting gets inherited through the exec.

There's many different ways this could be handled, but some options:

  1. Bazelisk could be patched to not run signal.Ignore -- a safer option could be to set a signal.Notify while running the cmd.Wait() and then running signal.Stop afterwards, which would restore the signal disposition to what it was before.
  2. Rather than using Bazelisk as a library, the BuildBuddy CLI could run it in a subprocess. Then the scope of the signal.Ignore is limited.
  3. Before exec'ing the --script_path, BuildBuddy could set a dummy signal.Notify on all of SIGINT, SIGTERM, and SIGQUIT. The channel passed to signal.Notify doesn't otherwise even have to be used at all -- the mere fact that signal.Notify now set a handler on these signals now means that the exec'ed process will receive SIG_DFL signal dispositions.

Option 3 has the downside that any steps that run between the bazel run --script_path and the exec, will still be uninterruptible. In particular this means that post-Bazel plugins won't be able to be interrupted. But at least interrupts for the executable actually getting run would go through.

@fmeum
Copy link
Contributor

fmeum commented Feb 12, 2025

Thanks for the great analysis!

I sent bazelbuild/bazelisk#657 to realize Option 1. We can patch it in even before the upstream PR has been merged.

@akirchhoff-modular
Copy link
Author

Thanks for the great analysis!

I sent bazelbuild/bazelisk#657 to realize Option 1. We can patch it in even before the upstream PR has been merged.

From my experiments, signal.Ignore is not actually reset by signal.Reset. signal.Ignore is somewhat permanent and doesn't have an easy "undo". Example:

package main

import (
	"fmt"
	"os/signal"
	"syscall"
)

func main() {
	fmt.Println(signal.Ignored(syscall.SIGINT))
	signal.Ignore(syscall.SIGINT)
	fmt.Println(signal.Ignored(syscall.SIGINT))
	signal.Reset(syscall.SIGINT)
	fmt.Println(signal.Ignored(syscall.SIGINT))
}

Running it:

% go run example.go 
false
true
true

signal.Ignored remains true after the signal.Reset. This is why my option 1 suggestion was to avoid using Ignore at all, not to try to restore after Ignore (which from my experiments, is hard-to-impossible, due to limitations in the Go standard library).

@bduffany
Copy link
Member

Thank you for sharing your knowledge about signals! Confirmed that the signal.Reset patch doesn't seem to fix the issue, but using signal.Notify/signal.Stop does. @fmeum here's bazelbuild/bazelisk#658 with the patch

modularbot pushed a commit to modular/max that referenced this issue Feb 13, 2025
Due to a BuildBuddy CLI bug
(buildbuddy-io/buildbuddy#8326), when using
`./bazelw run`, we get invoked with the `SIGINT` handler being
`SIG_IGN`. This is annoying because it becomes much harder to terminate
the process when `SIGINT` is disabled. While less-commonly used,
BuildBuddy also disables `SIGTERM` and `SIGQUIT`. Detect this situation
and return the state to what it would have been if we had been invoked
without the BuildBuddy CLI.

Part of AITLIB-132.

MODULAR_ORIG_COMMIT_REV_ID: bf1a3105755564fa03eaad2cf7c6e2636418e7a0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants