Skip to content

[BUG] AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist #362

@hlitz

Description

@hlitz

hlitz@bohr1:~/git/scarab-infra$ ./sci --status myexp3
[datacenter, datacenter, clang, 509, None] is a valid simulation option.
Checking resource availability of slurm nodes:
UNAVAILABLE: bohr2
UNAVAILABLE: bohr3
UNAVAILABLE: bohr4
UNAVAILABLE: bohr5
AVAILABLE: maxwell
UNAVAILABLE: ohm

Checking what nodes currently have a running job with the following name(s):
allbench_traces_exp_test10_*_hlitz

Summary of running simulations (by node):
ohm: 1 Jobs

More log files than total runs. Maybe same experiment name was run multiple times?
WARN: Log files for config baseline, which is not in the experiment file
Currently running 1 simulations (from logs: 1)
Traceback (most recent call last):
File "/soe/hlitz/git/scarab-infra/sci", line 4246, in
sys.exit(main())
^^^^^^
File "/soe/hlitz/git/scarab-infra/sci", line 4229, in main
handle_descriptor_action(args.status, "info", dbg_override=args.debug_level)
File "/soe/hlitz/git/scarab-infra/sci", line 407, in handle_descriptor_action
sim_module.run_simulation_command(str(path), sim_action, dbg_lvl=dbg_lvl, infra_dir=infra_dir)
File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 521, in run_simulation_command
raise exc
File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 465, in run_simulation_command
slurm_runner.print_status(user, experiment_name, docker_image_list, descriptor_data, workloads_data, dbg_lvl)
File "/soe/hlitz/git/scarab-infra/scripts/slurm_runner.py", line 308, in print_status
print_simulation_status_summary(
File "/soe/hlitz/git/scarab-infra/scripts/utilities.py", line 1948, in print_simulation_status_summary
assert total_per_conf >= total_found, "ERR: Assert Failed: More jobs found (via squeue and log files) than should exist"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist

I think this happened when I had some configurations defined, they had been completed, and then I removed them and added other configs.

More log files than total runs. Maybe same experiment name was run multiple times?

  1. It should print the log file that is missing/additional
  2. what does "Maybe same experiment name was run multiple times?" mean?
  3. Does it just count log files or check which ones are running?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions