Have Full stack trace visible in logs when a job fails #16

PhilippDahlinger · 2023-11-10T14:09:33Z

When I started a job using clusterduck with Slurm and an error is raised, I only see the following stack trace:

Error executing job with overrides: ['seed=0', '+experiment/deformable_plate=ltsgns_mesh_eval', '+platform=kluster_1_gpu']
submitit ERROR (2023-11-10 13:02:46,649) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 76, in submitit_main
    process_job(args.folder)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 69, in process_job
    raise error
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra_plugins/clusterduck_launcher/clusterduck_launcher.py", line 116, in run_workers
    exceptions = [
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra_plugins/clusterduck_launcher/clusterduck_launcher.py", line 117, in <listcomp>
    result.return_value
  File "/home/i53/mitarbeiter/philipp_dahlinger/mambaforge/envs/ltsgns_mp/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
IndexError: too many indices for tensor of dimension 5
srun: error: node2: task 0: Exited with exit code 1

I would like to see the stacktrace of my code, where in this example the IndexError was raised in order to find out the problem.
Is that possible?

The text was updated successfully, but these errors were encountered:

PhilippDahlinger · 2023-11-21T10:39:59Z

So I am using a solution presented here:
facebookresearch/hydra#2664

For me it works, you just have to catch the exception in your main training script. Not the prettiest solution, but for me, this resolves this.

balazsgyenes · 2023-11-21T12:12:09Z

Great find! I'd still like to keep this issue open though, in the hopes of finding a cleaner solution.

PhilippDahlinger closed this as completed Nov 21, 2023

balazsgyenes reopened this Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have Full stack trace visible in logs when a job fails #16

Have Full stack trace visible in logs when a job fails #16

PhilippDahlinger commented Nov 10, 2023

PhilippDahlinger commented Nov 21, 2023

balazsgyenes commented Nov 21, 2023

Have Full stack trace visible in logs when a job fails #16

Have Full stack trace visible in logs when a job fails #16

Comments

PhilippDahlinger commented Nov 10, 2023

PhilippDahlinger commented Nov 21, 2023

balazsgyenes commented Nov 21, 2023