Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

napi_threadsafe_function is very hard to use safely #55706

Open
mika-fischer opened this issue Nov 3, 2024 · 4 comments · May be fixed by #55877
Open

napi_threadsafe_function is very hard to use safely #55706

mika-fischer opened this issue Nov 3, 2024 · 4 comments · May be fixed by #55877
Assignees
Labels
node-api Issues and PRs related to the Node-API.

Comments

@mika-fischer
Copy link
Contributor

Version

v22.10.0

Platform

Linux s7 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

node-api

What steps will reproduce the bug?

The handle obtained with napi_create_threadsafe_function is supposed to be usable by arbitrary threads.
In particular, the following operations should be safe to call from arbitrary threads (until that thread calls napi_release_threadsafe_function or receives napi_closing from napi_call_threadsafe_function):
napi_get_threadsafe_function_context, napi_call_threadsafe_function, napi_acquire_threadsafe_function, and napi_release_threadsafe_function.

However, this is currently not the case.
In particular, in the face of Node.js environment shutdown there are data races and use-after-frees that occur when threads call any of these functions during or after the cleanup that happens when the Node.js environment shuts down.

There are two main issues I already found:

  • During finalization, the queue is accessed without holding its mutex here
    • If another thread calls napi_call_threadsafe_function concurrently, this leads to a data race on the queue internals here
  • At the end of finalization, the whole internal state is just deleted here even if there are still threads holding handles to the tsnf.
    • Afterwards each use of any of the above-mentioned functions is a use-after-free bug

See https://github.com/mika-fischer/node-bug-napi-tsfn for a detailed reproduction of the issue on Linux using valgrind

While working around this is technically possible (see src/run/fixed.hpp) it involves attaching a finalizer in order to track the finalization state in an external flag, whose lifetime must be managed via a shared_ptr or similar and all access to the TSFN must be protected by another mutex (or read-write-lock). This makes the whole thing very unergonomic to use and I'm pretty sure nobody will jump through these hoops.

It should also be relatively easy to fix and not break API or ABI

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior? Why is that the expected behavior?

  • Finalization should lock the mutex to make concurrent calls safe.
  • Finalization should put the TSFN into a state where all its resources are released, but only delete the actual TSFN object if there are no more handles to it. Otherwise the actual deletion should be deferred until one of napi_release_threadsafe_function or napi_call_threadsafe_function decreases the thread_count to zero.
  • In this finalized-but-not-yet deleted state, the operations mentioned above should work as follows:
    • napi_get_threadsafe_function_context should return the stored context pointer as usual
    • napi_call_threadsafe_function should return napi_closing, decrease the thread_count and if it falls to zero delete the TSFN
    • napi_acquire_threadsafe_function should return napi_closing
    • napi_release_threadsafe_function should return napi_ok, decrease the thread_count and if it falls to zero delete the TSFN

What do you see instead?

data races & use-after-frees leading to crashes

Additional information

No response

@RedYetiDev RedYetiDev added the node-api Issues and PRs related to the Node-API. label Nov 3, 2024
@mhdawson
Copy link
Member

It should also be relatively easy to fix and not break API or ABI

Is that something you would be able to submit a PR for ?

@mika-fischer
Copy link
Contributor Author

Sure, I can give it a go

mika-fischer added a commit to mika-fischer/node that referenced this issue Nov 16, 2024
Other threads can still hold a valid handle to the tsfn after
finalization if finalization was triggered by
- release with napi_tsfn_abort, or
- environment shutdown

Handle this by:
- protecting finalization itself with the mutex
- if necessary, delay deletion after finalization to when thread_count
  drops to 0

Fixes: nodejs#55706
mika-fischer added a commit to mika-fischer/node that referenced this issue Nov 16, 2024
Other threads can still hold a valid handle to the tsfn after
finalization if finalization was triggered by
- release with napi_tsfn_abort, or
- environment shutdown

Handle this by:
- protecting finalization itself with the mutex
- if necessary, delay deletion after finalization to when thread_count
  drops to 0
- releasing all resources as soon as possible before deletion

Fixes: nodejs#55706
@mika-fischer
Copy link
Contributor Author

@mhdawson See #55877

This fixes the test here https://github.com/mika-fischer/node-bug-napi-tsfn

I'll try to integrate the test into node next week, don't have any more free time this weekend.

@mhdawson
Copy link
Member

@mika-fischer thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
node-api Issues and PRs related to the Node-API.
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

4 participants