-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__errno_location crash after call to __cxa_guard_acquire #2268
Comments
You can compile a module with |
I guess that could at least avoid the calls during the boot process, we just have to be careful about potential races. @fwsGonzo do you know why TLS may not be initialised properly? Was it removed at some point? |
Sadly, I don't remember - just keep in mind that global construction order is one of the big fiascos of C++: https://en.cppreference.com/w/cpp/language/siof Maybe the RNG can be initialized lazily. BSS will at least be zero, and since we have to re-seed it occasionally, just making it have to reseed on first use may be a way forward? |
I like that idea of lazy RNG - we might even want to make it a plugin or driver. I think in our case the global construction order is not the main issue, but rather that we can't depend on any constructors being called until libc is fully initialized - and we now know that cxa_guard is one of the reasons. We do decide when to initialize, and can also initialize different translation units in the order we prefer, via the linker script trick of putting them in different sections. But yea, I think we need to extend the rules:
There may be more rules - and I don't think we can enforce them automatically - but it should be a small subset of the kernel features that needs to be ready before we can initialize libc, so we should make sure to clearly identify that. One way of doing it would be to only allow pure C, but I think we can use a pretty decent subset of C++, including PMR allocators. |
After thinking a bit more about this I think it would make sense to compile the kernel with I added this in #2273 (see PR and commits for more details). We could change this if there's a better approach for delaying initialisation later. |
No need to compile the entire OS with no-threadsafe-statics. You can use this in CMake:
And
I think a lot of footguns have been removed when syscalls panic during boot, so being selective here is fine. |
My reasoning for doing it for the whole kernel in the PR was that if we are disabling it manually for classes that do this anyway, then it's better to not have it on. But I don't have a very strong preference. If we can turn it on for the service, do you think it would be beneficial to have it selectively in the kernel too? It wouldn't work for anything that is started before libc (and after that we don't have full futex). |
This should be fixed now that #2273 has been merged as we no longer use thread safe statics and disallow syscalls until musl has been initialised. |
When enabling SMP the initial call to
RNG::get().init()
inapic_revenant
crashes.RNG::get().init()
returns a static variable that is initialised by the first call toget()
(here). After stepping through the code I see that the crash is caused by a surrounding call to__cxa_guard_acquire
that protects the initialisation of the static variable, which then calls a syscall that returnsENOSYS
. This value is written to__errno_location
by musl, which then triggers a CPU exception.There seems to be multiple issues here:
__cxa_guard_acquire
is added by the compiler to protect the initialisation of static variables at runtime (such as in RNG). The implementation is here. Depending on whether we have threads enabled in LIBCXX it will call various locking mechanisms in libc/musl, which have to be initialised first. If this happens early in the boot process, the kernel/musl may be unable to handle them. I'm not sure what the best approach to fix this would be, but ideally we shouldn't use code that relied on this feature until we get far enough in the boot process.__errno_location
isn't set up properly, at least not on other cores than core 0, and syscalls will write to the memory location it points to. In combination with the__cxa_guard_acquire
issue above, this may happen unexpectedly. After a call to syscall, musl's syscall_ret writeserrno
to the location stored in__errno_location
function from the pthread struct. The struct is retrieved from%fs:0
here. It looks like TLS may not be properly set up, there's a function in IncludeOS to do it here -- but it doesn't look like it's called fromapic_revenant
.I'm not sure why this has worked previously, but perhaps the new toolchain/build system happens to make
errno_val
point outside valid memory, so we now get a crash instead of potential instability? Could this have been causing issues such as #2252?To reproduce, first enable SMP, then run
nix-shell --argstr unikernel test/kernel/integration/smp --run ./test.py
(this is on v0.16.0-release branch)
The text was updated successfully, but these errors were encountered: