[BUG] AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist



hlitz@bohr1:~/git/scarab-infra$ ./sci --status myexp3
[datacenter, datacenter, clang, 509, None] is a valid simulation option.
Checking resource availability of slurm nodes:
UNAVAILABLE: bohr2
UNAVAILABLE: bohr3
UNAVAILABLE: bohr4
UNAVAILABLE: bohr5
AVAILABLE:   maxwell
UNAVAILABLE: ohm

Checking what nodes currently have a running job with the following name(s):
allbench_traces_*_exp_test10_*_*_hlitz

Summary of running simulations (by node):
ohm: 1 Jobs

More log files than total runs. Maybe same experiment name was run multiple times?
WARN: Log files for config baseline, which is not in the experiment file
Currently running 1 simulations (from logs: 1)
Traceback (most recent call last):
  File "/soe/hlitz/git/scarab-infra/sci", line 4246, in <module>
    sys.exit(main())
             ^^^^^^
  File "/soe/hlitz/git/scarab-infra/sci", line 4229, in main
    handle_descriptor_action(args.status, "info", dbg_override=args.debug_level)
  File "/soe/hlitz/git/scarab-infra/sci", line 407, in handle_descriptor_action
    sim_module.run_simulation_command(str(path), sim_action, dbg_lvl=dbg_lvl, infra_dir=infra_dir)
  File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 521, in run_simulation_command
    raise exc
  File "/soe/hlitz/git/scarab-infra/scripts/run_simulation.py", line 465, in run_simulation_command
    slurm_runner.print_status(user, experiment_name, docker_image_list, descriptor_data, workloads_data, dbg_lvl)
  File "/soe/hlitz/git/scarab-infra/scripts/slurm_runner.py", line 308, in print_status
    print_simulation_status_summary(
  File "/soe/hlitz/git/scarab-infra/scripts/utilities.py", line 1948, in print_simulation_status_summary
    assert total_per_conf >= total_found, "ERR: Assert Failed: More jobs found (via squeue and log files) than should exist"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist


I think this happened when I had some configurations defined, they had been completed, and then I removed them and added other configs. 

> More log files than total runs. Maybe same experiment name was run multiple times?

1. It should print the log file that is missing/additional
2. what does "Maybe same experiment name was run multiple times?" mean?
3. Does it just count log files or check which ones are running?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist #362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] AssertionError: ERR: Assert Failed: More jobs found (via squeue and log files) than should exist #362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions