Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Need a way to execute some cleanup calls before the program exits or crashes. #271

Open
dushyantbehl opened this issue Jul 31, 2024 · 1 comment

Comments

@dushyantbehl
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently when the fine tuning script crashes a lot of state associated with it is gone and leave some open file descriptors or connections which are not closed, for e.g. runs tracked by Aim which show as running even though the program has exited.

image

The proposal is to have an exit handler which will run close on these descriptors and even allow to save some state from the system before exiting.

Describe the solution you'd like

Need to look into what helps here, modules like https://docs.python.org/3/library/atexit.html exist but only help for cetain scenarios and not all of them.

@VassilisVassiliadis
Copy link
Contributor

We faced this issue when running our benchmarking runs. That is we had hundreds of AIM runs appearing active due to NCCL and GPU Out of Memory exceptions not closing the AIM experiments. This large number of active experiments eventually caused the web dashboard to stop working.

To fix this we basically did what you suggest in the issue description. Only instead of using atexit() we caught exceptions inside a wrapper file that invokes tuning.sft_trainer::train(). In our exception handler we manually invoked the close() method on the AIM run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants