Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__sanitizer_start_switch_fiber() unmaps per-thread "fake" stack #1760

Open
stsp opened this issue May 30, 2024 · 28 comments
Open

__sanitizer_start_switch_fiber() unmaps per-thread "fake" stack #1760

stsp opened this issue May 30, 2024 · 28 comments

Comments

@stsp
Copy link

stsp commented May 30, 2024

It seems like these days detect_stack_use_after_return
breaks fiber switching.
When detect_stack_use_after_return is
enabled, asan malloc's the "fake" stack and
puts the locals there together with redzones.
That process is (partially) documented here:
https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn
That very same stack ptr is put into the first
argument of __sanitizer_start_switch_fiber().
If detect_stack_use_after_return is disabled,
then the "fake stack" machinery is not used,
so __sanitizer_start_switch_fiber() always puts
NULL into its first arg.

Now the problem is, the fake-stack is per-thread,
not per-fiber. When some fiber exits, we put
NULL into the first arg of __sanitizer_start_switch_fiber(),
and that unmaps the entire per-thread fake-stack:
https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166
Which, as noted above, contains current locals
and redzones. So all crashes.

Probably __sanitizer_start_switch_fiber() should
allocate and free its own fake stacks, and not touch
the per-thread one?

@stsp
Copy link
Author

stsp commented May 30, 2024

Things seem to work right as long as
__sanitizer_start_switch_fiber() never
gets NULL as the first argument.
So maybe it should just be re-documented?

@ioquatix
Copy link

If you think it's mainly a documentation issue, why not submit a PR for the documentation?

@stsp
Copy link
Author

stsp commented May 31, 2024

That will break all existing users,
and also will make the first arg
completely redundant, as the stack
then never changes within tid.

Also I wonder if it is actually correct.
Coroutine switch changes the current
stack position (AFAICS %r13 is used
to address fake stack), but I am not
sure if the fake stack allocator is
prepared for such switches. So thinking
about this a bit more, I really suspect
the documentation change is not the
way forward. If such documentation
change is done, then also disabling
detect_stack_use_after_return for
fibers is needed.
But it might be better to allocate
per-fiber fake-stacks.

@ioquatix
Copy link

I wrote a note here, maybe it will help you: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L830-L839

@stsp
Copy link
Author

stsp commented May 31, 2024

Yes, that helped me getting rid of
the WARNING: ASan is ignoring requested __asan_handle_no_return,
just as you wrote in your note.

Unfortunately this has nothing to
do with the reported problem: unless
detect_stack_use_after_return is disabled,
you can't pass NULL as the first argument
of __sanitizer_start_switch_fiber().
And your suggestion is about not passing
NULLs to second and third args, not first.
Many thanks for the help though, that
problem was also real, plus I've found
bugs in my own code while integrating
your solution.

@stsp
Copy link
Author

stsp commented May 31, 2024

of __sanitizer_start_switch_fiber().
And your suggestion is about not passing
NULLs to second and third args

... of __sanitizer_finish_switch_fiber(),
not __sanitizer_start_switch_fiber().

@stsp
Copy link
Author

stsp commented May 31, 2024

In fact, you seem to have the
potential NULL call here:
https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1577
Could you please check 2 things:

  • this is actually called with NULL
  • when it isn't called with NULL, it sets
    old_fiber->context.fake_stack to non-NULL?

@ioquatix
Copy link

I can take a look.

@ioquatix
Copy link

ioquatix commented Jun 1, 2024

I'll try to confirm this with real data from CRuby, but my understanding is as follows:

this is actually called with NULL

Yes, at the end of the life of the fiber, we call it with NULL to indicate that the fake stack can be freed.

when it isn't called with NULL, it sets old_fiber->context.fake_stack to non-NULL?

Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.

@KJTsanaktsidis
Copy link

OK so I think the way this works is...

  • __sanitizer_start_switch_fiber unmaps the fake stack in AsanThread::fake_stack_ if NULL is passed as the first parameter
  • __sanitizer_finish_switch_fiber changes the fake stack associated with AsanThread::fake_stack_.

So the fake stack that is unmapped in __sanitizer_start_switch_fiber is replaced in __sanitizer_finish_switch_finish.

So what you're supposed to do when switching fibers:

  • IF the fiber being switched from is dead, then call: __sanitizer_start_switch_fiber(NULL, to_fiber->stack_base, new_fiber->stack_size)
  • Otherwise, if the fiber being switched from will be returned to eventually, then call: __sanitizer_start_switch_fiber(&from_fiber->fake_stack_ptr, to_fiber->stack_base, to_fiber->stack_size). That will save the current fake stack in &from_fiber->fiber_stack_ptr. It will also set the stack bounds for the switched-to fiber to stack_base/stack_size.
  • Then, after you switch out %rsp/%rip,
  • the new fiber should call __sanitizer__finish_switch_fiber(to_fiber->fake_stack_ptr, &from_fiber->stack_base, &from_fiber->stack_size). This will switch AsanFiber::fake_stack_ to the to-fiber's stack pointer, and retrieves the current stack base/size in &from_fiber.
    • Although this only has to happen at one "time" (when a fiber is switched to), this is two locations in code. Once in the new fiber creation path (will be called the first time a new fiber is switched to), and once at the end of the %rsp/%rip restore code (every subsequent time a fiber is switche dto).
    • Note that we don't need to save &from_fiber->stack_base more than once for a fiber (it won't change), so it's OK to pass NULL, NULL in the second place.

I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.

@stsp
Copy link
Author

stsp commented Jun 1, 2024

_sanitizer_finish_switch_fiber changes the fake stack associated with AsanThread::fake_stack.

Changes to what?
The problem is that fake_stack is per-thread.
So once unmapped, it gets changed to the
same unmapped stack.
https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166
This is the libasan code, its very easy to
follow. You can see where the per-thread
stack is destroyed, and that its not
re-allocated by the fiber-switching routines.

I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.

Its correct by matching the libasan
documentation. But libasan itself doesn't
match its own documentation.

Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.

Thanks!
If you can confirm it is actually non-NULL
(which would already be strange), please
also confirm that every coroutine sets it
to the different value. This will mean every
coroutine has its own fake-stack, just as
documented. Which will mean some other
version of libasan, not the one I am pointing
to the sources of.

@KJTsanaktsidis
Copy link

KJTsanaktsidis commented Jun 1, 2024

Ahh, i think i see where the confusion is - the allocation of fake stacks in ASAN is lazy.

You can see where the per-thread stack is destroyed, and that its not re-allocated by the fiber-switching routines

Not quite - the current_fake_stack being destroyed there is not a threadlocal variable or anything like that - it's the member variable AsanThread::_fake_stack. I mean it was threadlocal but the destruction is conditional on it being NULL, as you say.

When we start a new fiber in Ruby, we actually call __sanitizer_finish_switch_fiber(NULL, stack_base, stack_len) (https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/cont.c#L839). to->fake_stack is actually always set to NULL on fiber create here: https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/coroutine/arm64/Context.h#L76

__sanitizer_finish_start_fiber duly sets the current threadlocal fake stack to NULL - https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#163 - and that isn't undone by __sanitizer_finish_switch_fiber on coroutine start because we pass NULL and so don't hit this block: https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#176

So immediately after fiber creation, the thread doesn't yet have a fake stack.

TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber in every fiber.

HOWEVER, as part of dispatching the first ruby code onto the new fiber, we call asan_get_thread_fake_stack_handle(), via this stacktrace:

(gdb) bt
#0  0x00005555556d2340 in __asan::SetTLSFakeStack(__asan::FakeStack*) ()
#1  0x000055555576bfcc in __asan::AsanThread::AsyncSignalSafeLazyInitFakeStack() ()
#2  0x0000555555955b22 in asan_get_thread_fake_stack_handle () at ../internal/sanitizers.h:241
#3  ec_switch (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:800
#4  fiber_restore_thread (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:820
#5  0x0000555555955ccb in fiber_entry (from=<optimized out>, to=<optimized out>) at ../cont.c:848
#6  0x0000000000000000 in ?? ()

AsyncSignalSafeLazyInitFakeStack does what it says - if the thread's fake stack is NULL, it initializes a new one. So that is how the fake stack for a new fiber is created.

@stsp
Copy link
Author

stsp commented Jun 2, 2024

__sanitizer_finish_start_fiber duly sets the current threadlocal fake stack to NULL

You have a typo here, meant to say
__sanitizer_start_switch_fiber.

TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber in every fiber.

This wasn't my point, as I was talking about
passing NULL to __sanitizer_start_switch_fiber,
not to __sanitizer_finish_switch_fiber.
While I note your clever trick about calling
__asan_get_current_fake_stack() on a fiber
start, this doesn't solve the reported problem.
Please note 2 things:

  • fake-stack is still per-thread. Once you unmap
    and re-create in, you unmap and re-create the
    per-thread stack. It doesn't become per-fiber.
  • When detect_stack_use_after_return is enabled,
    you actually work on a per-thread fake-stack!
    It doesn't contain the return addresses, but it
    does contain your locals.

This means that once __sanitizer_start_switch_fiber(NULL, ...)
returns, every access to locals will crash. The only
thing you can do, is to immediately switch to
another coroutine, without touching any locals.
This is possible because the return addresses
are located on a normal stack, not fake-stack,
so you can still do the calls.
But this works only when you switch to a new
fiber, that quickly re-creates the fake stack with
your __asan_get_current_fake_stack(). This
scenario actually works, I checked it. What still
doesn't work is the switch to non-new coroutine.
Non-new coroutine doesn't have the chance to
apply your trick and re-create the fake-stack.
And even if it could, that still won't help because
it already has the locals on an unmapped
fake-stack.

So to summarize.
Your trick allowed me to get a bit further, if I:

  • don't touch any locals after __sanitizer_start_switch_fiber(NULL, ...)
    and quickly jump to another fiber instead.
  • that "another fiber" is a new fiber that doesn't
    yet have any locals (because its new, it doesn't
    have anything), so it can re-create the fake stack.

Obviously this is not an acceptable solution.

@KJTsanaktsidis
Copy link

fake-stack is still per-thread. Once you unmap
and re-create in, you unmap and re-create the
per-thread stack. It doesn't become per-fiber.

this is not correct. A new fiber starts with a NULL fake stack, and gains a fake stack lazily. When you switch away from the fiber, a pointer to its fake stack is saved in the first parameter to __sanitize_start_stack_switch.

Perhaps this is best summarised as: __sanitiser_start_stack_switch will either:

  • transfer ownership of the current fibers fake stack to the caller, with the first parameter being an out param to receive it, or
  • Deallovate the current fibers fake stack it NULL is passed in
  • but in either case, it will set the thread local fake stack pointer to NULL

When resuming a fiber with __sanitizer_finish_stack_switch

  • for the first time, pass NULL as the first argument; a fake stack will be lazily created
  • Subsequent times, using the pointer saved from the out-param of __sanitizer_start_stack_switch

@stsp
Copy link
Author

stsp commented Jun 2, 2024

OK, I applied your suggestion to my repo:
dosemu2/dosemu2@e652437
just to make sure we are on the same ground.
Now up to the discussion.

A new fiber starts with a NULL fake stack, and gains a fake stack lazily.

Not quite, as you set NULL fake stack by
calling __sanitizer_finish_stack_switch(NULL, ...);
explicitly.
So the beginning of a fiber still works on
an old fake-stack, but that is not a problem,
as with your technique it is never unmapped.
I have realized that now, after fixing a problem
in my code. :)
Still, its not a reported problem, but you
helped me fixing another one, thank you.

for the first time, pass NULL as the first argument; a fake stack will be lazily created

No, it is created only because you
explicitly call __asan_get_current_fake_stack(),
as you showed in your stack-trace.
This is an undocumented trick.
I added it to my repo so that we can continue
this quest, but please note that undocumented
tricks are normally not counted as a solutions.

Why this trick doesn't save the world, is because
the stack is re-created too late. After the call
to __sanitiser_start_stack_switch(NULL, ...);
you are already out of your locals. You must
be very careful to not touch any locals and
quickly jump to another fiber.
If you do so - everything works.

So 2 problems remain:

  • we have a completely undocumented "bug fixer"
    which is the need to explicitly call
    __asan_get_current_fake_stack(); after every
    stack switch.
  • even this bug-fixer doesn't allow us touching
    any locals between the call to __sanitizer_start_switch_fiber(NULL, ...);
    and swapcontext().

So let me say, your trick is very clever and it
can even solve the problem fully, but I am not
fond of either of 2 limitations above.

@stsp
Copy link
Author

stsp commented Jun 2, 2024

Note that I am not applying the
"don't touch any locals between
__sanitizer_start_switch_fiber(NULL, ...);
and swapcontext()" trick to my repo,
because I consider that a bit too much.
So your trick being in my repo, currently
doesn't help at all. But it allowed me to
make sure things are actually working the
way you describe, and locally I can produce
a fully working setup now, which wasn't
possible w/o your trick.

@KJTsanaktsidis
Copy link

your trick is very clever and it can even solve the problem fully

hm, you’re giving me too much credit here, I added this call for entirely unrelated reasons (the GC needs the info). Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?

doesn't allow us touching any locals between the call to __sanitizer_start_switch_fiber(NULL, ...); and swapcontext().

this is a guess accidentally not a problem for us in Ruby because the bulk of our switching machinery (coroutine_transfer) is implemented in assembly. So we don’t actually have really any C code between the start and finish switches. Perhaps you should use an __attribute of some kind on your stack switching code to disable Asan instrumentation on that function? (We should probably do that in CRuby too for the function that calls __sanitizer_start_stack_switch too)

@stsp
Copy link
Author

stsp commented Jun 2, 2024

Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?

OMG...
Well, yes it does. :)
Once you properly set the stack to NULL,
it gets re-allocated automatically. I made
this fix:
dosemu2/dosemu2@03b9298
while playing with your __asan_get_current_fake_stack();
trick, but I didn't realize that this fix alone
is enough, so I mistakenly applied also your
trick. :( Force-pushed it away now, its not needed.
So only the problem of touching locals stays...

So we don’t actually have really any C code between the start and finish switches.

No, this code:
https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581
calls to sanitizer and then to swapcontext().
Its obviously that you use no locals in between.
But add some, and you'll finally reproduce my
bug report. :)

@stsp
Copy link
Author

stsp commented Jun 2, 2024

To understand why is this important,
please see this code of mine:
https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L72
This exact line crashes.
I later use the fake_stack_save var here:
https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L82
While I can get the fake-stack-ptr from
the context structure, which would avoid
the problem, I follow the asan docs precisely:
https://chromium.googlesource.com/chromiumos/third_party/compiler-rt/+/google/stable/include/sanitizer/common_interface_defs.h#184

In most cases, this void* can be stored on the stack just before
switching.

So asan doc itself suggests storing that
into a local, rather than via other means.
And that suggestion breaks.

@stsp
Copy link
Author

stsp commented Jun 2, 2024

OK, actually I do realize that even more
precise reading would reveal that in
case of a NULL, nothing is actually
stored in a local... I just use the NULL
which I stored there manually. So in
that case no one suggests me to use it.
So I can't even blame the asan's suggestion
here...

@stsp
Copy link
Author

stsp commented Jun 2, 2024

Do you think this should be re-classified
to a documentation problem, so that the
doc just say to not touch any locals after
__sanitizer_start_switch_fiber(NULL, ...);?
IMHO its quite a bad thing to document though.

@KJTsanaktsidis
Copy link

No, this code: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581
calls to sanitizer and then to swapcontext(). Its obviously that you use no locals in between. But add some, and you'll finally reproduce my bug report. :)

Yes, I fully agree - this is only working in Ruby at the moment by accident.

I suppose we should mark fiber_setcontext and fiber_entry in Ruby with __attribute__((no_sanitize("address"))). That I think would be the more correct thing to do - can't have ASAN enabled in the code which deals with switching the ASAN fake stacks around.

Do you think this should be re-classified to a documentation problem

Yes - I would like to contribute some documentation improvements to ASAN & Fibers but I actually don't know how. It's a github wiki and I don' t know how to propose changes. I think the docs should...

  • Explain __sanitizer_start_switch_fiber and __sanitizer_finish_switch_fiber in terms of "either handing off ownership of the fake stack, or destroying it" like I did in my last message - IMHO that's the clearest way to think about it.
  • Tell you to disable ASAN in the funtions which call these functions (and any other functions between them & swapcontext or your assembly swapping routine)
  • Explain where in code you will put these calls, not just where in time (i.e. call __sanitizer_start_switch_fiber before swapcontext, with a conditional to handle the dead case, one call to __sanitizer_finish_switch_fiber in your fiber entrypoint to handle the new fiber case, and one call to __sanitizer_finish_switch_fiber after swapcontext to handle the resuming-existing-fiber case).

I mostly had to figure this stuff out for Ruby by trial and error and looking at the LLVM sources.

@KJTsanaktsidis
Copy link

FWIW I had a look at your code in dosemu, it looks correct to me now, other than that I think you need __attribute__((no_sanitize("address")) on co_switch_context and maybe co_runner.

Thanks for opening this issue, it's clearly very under-documented and It was useful for me to have a closer look at how this all works in Ruby too :)

@stsp
Copy link
Author

stsp commented Jun 2, 2024

Thank you very much!
__attribute__((no_sanitize("address")))
is indeed a very nice trick that I now applied.
With that trick, this problem is certainly
doc-only.

Another thing worth documenting, is the
one from #1760 (comment)
where @ioquatix says how the stack for
the main coroutine should be retrieved.
I.e. that __sanitizer_finish_switch_fiber()
should never have NULLs in the second
and third args.

stsp added a commit to dosemu2/dosemu2 that referenced this issue Jun 2, 2024
Since on entering the new coroutine the fake-stack should be
re-initialized, we can as well explicitly call
`__sanitizer_finish_switch_fiber()` with NULL at the first arg.
But this means the stack-save pointer is completely redundant!

See google/sanitizers#1760
stsp added a commit to dosemu2/dosemu2 that referenced this issue Jun 2, 2024
In co_runner() we re-initialize the asan's fake stack.
Even though the old stack is not unmapped, it doesn't match the
asan's view of the current fake stack.
So its better to mark co_runner() with attribute no_sanitize("address")
to avoid any potential false-positives.

See google/sanitizers#1760
@stsp
Copy link
Author

stsp commented Jun 2, 2024

There are a few more things to add.
I referenced 2 commits here which I
did in a separate branch.
One commit just removes the fake-stack
ptr! After our discussion it appears
redundant, because on a coroutine entry
you are supposed to invalidate the stack
by passing NULL to __sanitizer_finish_switch_fiber().
But I can just pass NULL explicitly, and no
need for the saved ptr at all.

Another patch adds no-sanitize attribute
to the coroutine entry func. This is because
the fake-stack is re-initialized here. Yes, the
old one is not unmapped, so we do not
observe any crash. But still the asan's view
of the fake stack mismatches to the reality,
so to avoid any problems I suggest adding
an attribute also there.

Could you please review these 2 observations?

@ioquatix
Copy link

ioquatix commented Jun 2, 2024

For my own understanding:

Another patch adds no-sanitize attribute to the coroutine entry func.

Does it only apply to the function but not the coroutine co->func(co->data);?

But I can just pass NULL explicitly, and no need for the saved ptr at all.

Isn't this only when you want to free the internal fake stack?

stsp added a commit to dosemu2/dosemu2 that referenced this issue Jun 2, 2024
We need to save old stack only to retrieve the stack for main
coroutine. But we never return from main coroutine. So it is enough
to save the stack only when the new coroutine is started, as there
it may be started by the main one.

See google/sanitizers#1760
@stsp
Copy link
Author

stsp commented Jun 2, 2024

I made another very important observation
(now also referenced here), which is that
@ioquatix suggestion to not have NULLs in
the second and third args of
__sanitizer_finish_switch_fiber() are only
applied to the start of the coroutine, but
not to the return of some coroutine. Because
we never return from the main coroutine,
but we can start some from the main one.

So it appears:

  • on coroutine entry you should call
    __sanitizer_finish_switch_fiber(NULL, &ptr, &len)
  • on coroutine return you should call
    __sanitizer_finish_switch_fiber(fake_save, NULL, NULL)

So basically they "complement" each other,
and I'd rather add a separate function for
starting a fiber.

If either (or all) of my observations are true,
then this API should be documented entirely
differently, making my observations explicit.

Does it only apply to the function but not the coroutine co->func(co->data);?

Yes, because co->func() is called later, on a
new fake-stack already. Whereas co_runner()
is on an old one. Its not unmapped only by
a very deep magic in the @KJTsanaktsidis 's logic. :)

Isn't this only when you want to free the internal fake stack?

No, here I am talking about
__sanitizer_finish_switch_fiber(), not
__sanitizer_start_switch_fiber().
We call __sanitizer_finish_switch_fiber()
on a fiber start, and there we MUST pass
NULL as a first arg. Unlike
__sanitizer_start_switch_fiber() it
doesn't unmap the fake-stack, only invalidates
it to allocate a new one later. We always need
to invalidate and switch the stack on fiber
entry, which leads to this "always-NULL"
observation.

@stsp
Copy link
Author

stsp commented Jun 2, 2024

Because we never return from the main coroutine,

A vague explanation.
We can return from main coroutine
to some other if that "other" called
the main coroutine at some point.
But that can't happen before we
called to that "other" coroutine from
the main one. So to catch the stack
of the main coroutine, it is enough
to do that on "other" coroutine entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants