You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently when the fine tuning script crashes a lot of state associated with it is gone and leave some open file descriptors or connections which are not closed, for e.g. runs tracked by Aim which show as running even though the program has exited.
The proposal is to have an exit handler which will run close on these descriptors and even allow to save some state from the system before exiting.
We faced this issue when running our benchmarking runs. That is we had hundreds of AIM runs appearing active due to NCCL and GPU Out of Memory exceptions not closing the AIM experiments. This large number of active experiments eventually caused the web dashboard to stop working.
To fix this we basically did what you suggest in the issue description. Only instead of using atexit() we caught exceptions inside a wrapper file that invokes tuning.sft_trainer::train(). In our exception handler we manually invoked the close() method on the AIM run.
Is your feature request related to a problem? Please describe.
Currently when the fine tuning script crashes a lot of state associated with it is gone and leave some open file descriptors or connections which are not closed, for e.g. runs tracked by Aim which show as running even though the program has exited.
The proposal is to have an exit handler which will run
close
on these descriptors and even allow to save some state from the system before exiting.Describe the solution you'd like
Need to look into what helps here, modules like https://docs.python.org/3/library/atexit.html exist but only help for cetain scenarios and not all of them.
The text was updated successfully, but these errors were encountered: