Skip to content

Conversation

@pilkicTT
Copy link
Contributor

@pilkicTT pilkicTT commented Nov 5, 2025

torch_xla doesn't call PJRT_ClientDestroy properly. This means that we are not closing the devices properly.

Recently, this started causing hangs on n300 boards on subsequent execution of tests.

This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call PJRT_ClientDestroy - like in the case of torch_xla.

Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of SIGSEGV (segmentation fault). For this, ideally we would want a fix on tt-metal side.

Closes #1824

@jameszianxuTT
Copy link
Contributor

cc @kmabeeTT

@hshahTT
Copy link
Contributor

hshahTT commented Nov 5, 2025

Have we filed an issue with the torch-xla folks about them not calling PJRT_ClientDestroy properly? It's fine to add this WA for now, but it'd be nice to eventually fix the underlying issue within torch-xla itself

@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from 3257cff to fe091fd Compare November 5, 2025 18:32
@jameszianxuTT
Copy link
Contributor

@hshahTT torch-xla folks are not a fan of implementing proper destructor logic / calling PJRT Client Destroy pytorch/xla#9675 - so I don't expect to see a fix being upstreamed anytime soon.

Several parties have encountered this issue, not just us.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

TestsPassed ✅Skipped ⚠️Failed
TT-XLA Tests179 ran159 passed20 skipped0 failed
TestResult
No test annotations available

@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from d7c948d to 19cd3a7 Compare November 5, 2025 19:47
@kmabeeTT
Copy link
Contributor

kmabeeTT commented Nov 5, 2025

Thank you! Works good for me, am able to run back to back pytest cmds on n300-llmbox now without hangs (and needing to call tt-smi -r between tests). I still get DRAM leak between tests in same pytest cmd, but that's a separate issue to figure out.

`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that
we are not closing the devices properly.

Recently, this started causing hangs on `n300` boards on subsequent
execution of tests.

This PR introduces a global singleton object which will ensure that we
are properly destroying the client instance on process shutdown. The
singleton serves as a fallback mechanism if the framework doesn't call
`PJRT_ClientDestroy` - like in the case of `torch_xla`.

Additionally, optimizer sub-meshes are now closed after each
compilation; this previously was needed to avoid hangs, but now it
causes them. Leaving the mechanism of persisting optimizer submesh in
the code base, so that we can play with it if needed. Obviously, we need
to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates
abruptly, e.g. in case of `SIGSEGV` (segmentation fault). For this,
ideally we would want a fix on `tt-metal` side.

Closes #1824
@pilkicTT pilkicTT force-pushed the pilkic/workaround-destroy branch from 19cd3a7 to aa67e43 Compare November 6, 2025 12:14
@pilkicTT pilkicTT merged commit 409f079 into main Nov 6, 2025
38 checks passed
@pilkicTT pilkicTT deleted the pilkic/workaround-destroy branch November 6, 2025 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch_xla ClientInstance destructor doesn't get called leading to leaked devices and mesh

8 participants