-
Notifications
You must be signed in to change notification settings - Fork 31
[BUG] AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist #362
Description
hlitz@bohr1:~/git/scarab-infra$ ./sci --status myexp3
[datacenter, datacenter, clang, 509, None] is a valid simulation option.
Checking resource availability of slurm nodes:
UNAVAILABLE: bohr2
UNAVAILABLE: bohr3
UNAVAILABLE: bohr4
UNAVAILABLE: bohr5
AVAILABLE: maxwell
UNAVAILABLE: ohm
Checking what nodes currently have a running job with the following name(s):
allbench_traces_exp_test10_*_hlitz
Summary of running simulations (by node):
ohm: 1 Jobs
More log files than total runs. Maybe same experiment name was run multiple times?
WARN: Log files for config baseline, which is not in the experiment file
Currently running 1 simulations (from logs: 1)
Traceback (most recent call last):
File "/soe/hlitz/git/scarab-infra/sci", line 4246, in
sys.exit(main())
^^^^^^
File "/soe/hlitz/git/scarab-infra/sci", line 4229, in main
handle_descriptor_action(args.status, "info", dbg_override=args.debug_level)
File "/soe/hlitz/git/scarab-infra/sci", line 407, in handle_descriptor_action
sim_module.run_simulation_command(str(path), sim_action, dbg_lvl=dbg_lvl, infra_dir=infra_dir)
File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 521, in run_simulation_command
raise exc
File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 465, in run_simulation_command
slurm_runner.print_status(user, experiment_name, docker_image_list, descriptor_data, workloads_data, dbg_lvl)
File "/soe/hlitz/git/scarab-infra/scripts/slurm_runner.py", line 308, in print_status
print_simulation_status_summary(
File "/soe/hlitz/git/scarab-infra/scripts/utilities.py", line 1948, in print_simulation_status_summary
assert total_per_conf >= total_found, "ERR: Assert Failed: More jobs found (via squeue and log files) than should exist"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist
I think this happened when I had some configurations defined, they had been completed, and then I removed them and added other configs.
More log files than total runs. Maybe same experiment name was run multiple times?
- It should print the log file that is missing/additional
- what does "Maybe same experiment name was run multiple times?" mean?
- Does it just count log files or check which ones are running?