Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not sure if bug/feature] Where does submitit error output go? #2664

Open
Ubadub opened this issue May 16, 2023 · 5 comments
Open

[Not sure if bug/feature] Where does submitit error output go? #2664

Ubadub opened this issue May 16, 2023 · 5 comments

Comments

@Ubadub
Copy link

Ubadub commented May 16, 2023

Hi,

I am using the submitit plugin to run a --multirun sweep. Some of my jobs errored- based on on the logging messages my code is printing to the log file in the sweep subdir, I can tell they are exiting prematurely; however, at the bottom of the log file, submitit nevertheless claims that the Job completed successfully.

I'm opening this issue to ask where I can find the actual stacktrace, and also why the log message erroneously claims the job completed successfully.

I do see two files: .submitit/[JOBNAME_JOBNUMBER]/[JOBNAME_JOBNUMER]_log.out and .submitit/[JOBNAME_JOBNUMBER]/[JOBNAME_JOBNUMER]_log.err. The former appears to be identical to the main log file that gets placed in the sweep subdir after the run completes. The latter, confusingly, does appear to contain some error messages, but only ones produced by third party libraries (in particular, I am using the HuggingFace transformers library, which prints certain warning messages that are appearing in the stderr file). I am used to seeing this output when running a job directly in the console, but it's unclear to me why that output is appearing there but not anything else that I would normally expect to see in stderr. My working hypothesis is that that the error file contains messages that are printed directly to sys.stderr, but doesn't contain error messages that are the result of raised exceptions.

When all the runs complete, I do see a small, relatively unhelpful error message that usually just consists of the error message only, without a stack trace or a line number. And even if many jobs fail, only one error message appears to be produced.

Is this customizable behavior? Is there some file or setting I'm missing? How can I recover a full, useable stack trace, like what I would have seen had I run the command without --multirun and without the submitit launcher?

@odelalleau
Copy link
Collaborator

odelalleau commented May 16, 2023

Yeah, it's an annoying behavior with the submitit launcher. Here's the recipe I'm using myself:

import traceback

@hydra.main(...)
def main(cfg: DictConfig):
    try:
        run(cfg)
    except Exception:
        traceback.print_exc(file=sys.stderr)
        raise

==> you'll now see errors in the stderr log

@Ubadub
Copy link
Author

Ubadub commented May 23, 2023

Yeah, it's an annoying behavior with the submitit launcher. Here's the recipe I'm using myself:

import traceback

@hydra.main(...)
def main(cfg: DictConfig):
    try:
        run(cfg)
    except Exception:
        traceback.print_exc(file=sys.stderr)
        raise

==> you'll now see errors in the stderr log

Yeah that's what I'm doing, but is this the only way to do this? Doesn't make much sense.

@luisenp
Copy link

luisenp commented Jul 11, 2023

Had this exact issue, and found this, which was really helpful. Any clue why this is not the default behavior? The current state of things makes some errors really confusing.

@lrzpellegrini
Copy link

I am adding a +1 to this. This is quite a strange issue.

  • .err files are empty
  • On top of that, the .out files report that the "Job completed successfully" even when the process exists because of an exception

I'd add that @Ubadub solution works! However, I recommend using BaseException instead of Exception to cover all cases. I also added an optional flush. This is my attempt:

@hydra.main(...)
def main(cfg: DictConfig):
    import sys
    import traceback
    # This main is used to circumvent a bug in Hydra
    # See https://github.com/facebookresearch/hydra/issues/2664

    try:
        actual_main(cfg)
    except BaseException:
        traceback.print_exc(file=sys.stderr)
        raise
    finally:
        # fflush everything
        sys.stdout.flush()
        sys.stderr.flush()

@gleize
Copy link
Contributor

gleize commented Apr 12, 2024

A different solution : #2863

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants