Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have Full stack trace visible in logs when a job fails #16

Open
PhilippDahlinger opened this issue Nov 10, 2023 · 2 comments
Open

Have Full stack trace visible in logs when a job fails #16

PhilippDahlinger opened this issue Nov 10, 2023 · 2 comments

Comments

@PhilippDahlinger
Copy link

When I started a job using clusterduck with Slurm and an error is raised, I only see the following stack trace:

Error executing job with overrides: ['seed=0', '+experiment/deformable_plate=ltsgns_mesh_eval', '+platform=kluster_1_gpu']
submitit ERROR (2023-11-10 13:02:46,649) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 76, in submitit_main
    process_job(args.folder)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 69, in process_job
    raise error
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra_plugins/clusterduck_launcher/clusterduck_launcher.py", line 116, in run_workers
    exceptions = [
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra_plugins/clusterduck_launcher/clusterduck_launcher.py", line 117, in <listcomp>
    result.return_value
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
IndexError: too many indices for tensor of dimension 5
srun: error: node2: task 0: Exited with exit code 1

I would like to see the stacktrace of my code, where in this example the IndexError was raised in order to find out the problem.
Is that possible?

@PhilippDahlinger
Copy link
Author

So I am using a solution presented here:
facebookresearch/hydra#2664

For me it works, you just have to catch the exception in your main training script. Not the prettiest solution, but for me, this resolves this.

@balazsgyenes
Copy link
Contributor

Great find! I'd still like to keep this issue open though, in the hopes of finding a cleaner solution.

@balazsgyenes balazsgyenes reopened this Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants