-
Notifications
You must be signed in to change notification settings - Fork 19
[pjrt] ensure client destruction on process exit #1999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @kmabeeTT |
|
Have we filed an issue with the torch-xla folks about them not calling |
3257cff to
fe091fd
Compare
|
@hshahTT torch-xla folks are not a fan of implementing proper destructor logic / calling Several parties have encountered this issue, not just us. |
|
||||||||||||||
d7c948d to
19cd3a7
Compare
|
Thank you! Works good for me, am able to run back to back pytest cmds on n300-llmbox now without hangs (and needing to call tt-smi -r between tests). I still get DRAM leak between tests in same pytest cmd, but that's a separate issue to figure out. |
`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that we are not closing the devices properly. Recently, this started causing hangs on `n300` boards on subsequent execution of tests. This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call `PJRT_ClientDestroy` - like in the case of `torch_xla`. Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly. NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of `SIGSEGV` (segmentation fault). For this, ideally we would want a fix on `tt-metal` side. Closes #1824
19cd3a7 to
aa67e43
Compare
torch_xladoesn't callPJRT_ClientDestroyproperly. This means that we are not closing the devices properly.Recently, this started causing hangs on
n300boards on subsequent execution of tests.This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call
PJRT_ClientDestroy- like in the case oftorch_xla.Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly.
NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of
SIGSEGV(segmentation fault). For this, ideally we would want a fix ontt-metalside.Closes #1824