[pjrt] ensure client destruction on process exit #1999

pilkicTT · 2025-11-05T18:15:56Z

torch_xla doesn't call PJRT_ClientDestroy properly. This means that we are not closing the devices properly.

Recently, this started causing hangs on n300 boards on subsequent execution of tests.

This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call PJRT_ClientDestroy - like in the case of torch_xla.

Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly.

NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of SIGSEGV (segmentation fault). For this, ideally we would want a fix on tt-metal side.

Closes #1824

jameszianxuTT · 2025-11-05T18:19:50Z

cc @kmabeeTT

hshahTT · 2025-11-05T18:26:10Z

Have we filed an issue with the torch-xla folks about them not calling PJRT_ClientDestroy properly? It's fine to add this WA for now, but it'd be nice to eventually fix the underlying issue within torch-xla itself

jameszianxuTT · 2025-11-05T18:35:18Z

@hshahTT torch-xla folks are not a fan of implementing proper destructor logic / calling PJRT Client Destroy pytorch/xla#9675 - so I don't expect to see a fix being upstreamed anytime soon.

Several parties have encountered this issue, not just us.

github-actions · 2025-11-05T18:47:47Z

	Tests	Passed ✅	Skipped ⚠️	Failed
TT-XLA Tests	179 ran	159 passed	20 skipped	0 failed

Test	Result
No test annotations available

View TT-XLA Tests

pjrt_implementation/inc/api/client_instance.h

pjrt_implementation/src/api/client_instance.cc

kmabeeTT · 2025-11-05T20:50:46Z

Thank you! Works good for me, am able to run back to back pytest cmds on n300-llmbox now without hangs (and needing to call tt-smi -r between tests). I still get DRAM leak between tests in same pytest cmd, but that's a separate issue to figure out.

`torch_xla` doesn't call `PJRT_ClientDestroy` properly. This means that we are not closing the devices properly. Recently, this started causing hangs on `n300` boards on subsequent execution of tests. This PR introduces a global singleton object which will ensure that we are properly destroying the client instance on process shutdown. The singleton serves as a fallback mechanism if the framework doesn't call `PJRT_ClientDestroy` - like in the case of `torch_xla`. Additionally, optimizer sub-meshes are now closed after each compilation; this previously was needed to avoid hangs, but now it causes them. Leaving the mechanism of persisting optimizer submesh in the code base, so that we can play with it if needed. Obviously, we need to dig deep into these issues to fix them properly. NOTE: this does not solve the case when the process terminates abruptly, e.g. in case of `SIGSEGV` (segmentation fault). For this, ideally we would want a fix on `tt-metal` side. Closes #1824

pilkicTT requested review from acolicTT, ajakovljevicTT, hshahTT, jameszianxuTT, jnie-TT, mrakitaTT, nvukobratTT, sdjukicTT, sgligorijevicTT and sshonTT as code owners November 5, 2025 18:15

pilkicTT force-pushed the pilkic/workaround-destroy branch from 3257cff to fe091fd Compare November 5, 2025 18:32

sshonTT reviewed Nov 5, 2025

View reviewed changes

pjrt_implementation/inc/api/client_instance.h Outdated Show resolved Hide resolved

pjrt_implementation/src/api/client_instance.cc Show resolved Hide resolved

pilkicTT force-pushed the pilkic/workaround-destroy branch from d7c948d to 19cd3a7 Compare November 5, 2025 19:47

sshonTT approved these changes Nov 5, 2025

View reviewed changes

jameszianxuTT approved these changes Nov 5, 2025

View reviewed changes

sgligorijevicTT approved these changes Nov 5, 2025

View reviewed changes

alcolic-to approved these changes Nov 6, 2025

View reviewed changes

pilkicTT added 2 commits November 6, 2025 13:14

address comments & close submesh properly

aa67e43

pilkicTT force-pushed the pilkic/workaround-destroy branch from 19cd3a7 to aa67e43 Compare November 6, 2025 12:14

pilkicTT merged commit 409f079 into main Nov 6, 2025
38 checks passed

pilkicTT deleted the pilkic/workaround-destroy branch November 6, 2025 14:16

This was referenced Nov 7, 2025

Device hangs when process is terminated abruptly (e.g., segfault) #2025

Open

Hangs on multi-chip workloads #2043

Closed

kmabeeTT mentioned this pull request Nov 14, 2025

Add 12x TP n300-llmbox models (qwen, mistral) to CI and remove placeholders #1976

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pjrt] ensure client destruction on process exit #1999

[pjrt] ensure client destruction on process exit #1999

Uh oh!

pilkicTT commented Nov 5, 2025

Uh oh!

jameszianxuTT commented Nov 5, 2025

Uh oh!

hshahTT commented Nov 5, 2025

Uh oh!

jameszianxuTT commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kmabeeTT commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[pjrt] ensure client destruction on process exit #1999

[pjrt] ensure client destruction on process exit #1999

Uh oh!

Conversation

pilkicTT commented Nov 5, 2025

Uh oh!

jameszianxuTT commented Nov 5, 2025

Uh oh!

hshahTT commented Nov 5, 2025

Uh oh!

jameszianxuTT commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmabeeTT commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

github-actions bot commented Nov 5, 2025 •

edited

Loading