feat: Run detector timelines on SLURM #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mfmceneaney wants to merge 17 commits into main from issue_156

Collaborator

mfmceneaney commented Mar 4, 2025

No description provided.


          feat: Added slurm distribution functionality.

github-project-automation bot added this to CLAS12 Timelines

Collaborator Author

mfmceneaney commented Mar 4, 2025

Redid this PR after closing #289 and fixing branch history.


          fix: Do not clean output directories if organizing job outputs.

907bbb3

mfmceneaney requested a review from c-dilks

March 7, 2025 16:29

c-dilks linked an issue

that may be closed by this pull request

run Step 2 on SLURM #156

Open


          Merge branch 'main' into issue_156

5106d24

c-dilks reviewed

View reviewed changes

Member

c-dilks left a comment

Thanks a lot for doing this; it will be very useful for chefs (and developers), and will significantly speed things up.

I have some comments below, which hopefully aren't too much work to address.

Also I think for now it's great that the default behavior is the same; after we do some more testing we can switch the default to be Slurm.

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh Outdated

    
                  # start job lists

                  echo """

                  Generating job scripts..."""

                  slurmDir=$TIMELINESRC/slurm

Member

c-dilks Mar 20, 2025

Chefs won't have write access to $TIMELINESRC, use ./slurm instead (for consistency with run-monitoring.sh)

Also, some files created in this directory will overwrite those from step 1, if Slurm is used there too (e.g., job.$dataset.detectors.list). Perhaps an easy way to avoid this conflict is to set slurmDir to be different for the two steps, such as ./slurm/step1 for run-monitoring.sh and ./slurm/step2 here.

Collaborator Author

mfmceneaney Mar 24, 2025

I went with the second option using ./slurm/step1 for and ./slurm/step2 but would still like to check that this runs fine.

Collaborator Author

mfmceneaney Mar 24, 2025

I ended up having to set the output slurm directory to $outputDir/slurm/step2 in run-detector-timelines.sh since the script changes directories. It might also be wise to use this full path in run-monitoring.sh too even though that script does not change directories in a way that it would affect whether ./slurm or $outputDir/slurm would be identical.

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh

Comment on lines +329 to +330

    
              # set classpath

              export CLASSPATH=$CLASSPATH

Member

c-dilks Mar 20, 2025

$CLASSPATH is necessary for now, but is removed in #293. Depending on whether we merge this PR or #293 first, we'll need to remember to deal with this (though if we forget, the script will just fail, reporting $CLASSPATH as unbound).

Collaborator Author

mfmceneaney Mar 24, 2025

Ok, thanks! Will keep an eye on this.

bin/run-detectors-timelines.sh Outdated

    
              export CLASSPATH=$CLASSPATH

              # produce detector timelines

              java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail

Member

c-dilks Mar 20, 2025

Slurm will handle the logging, automatically splitting stdout and stderr.

Suggested change

      
            java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail
          
            java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir

You may also remove the logFile=$logDir/$key from a few lines above.

Later below, in the "error checking" part, we'll need to figure out how to read the Slurm error logs... or just tell the user to check them for themselves...

If we do end up reading the Slurm error logs, we'll need to use the job ID or something, so in the case where the user runs this script on twice, the correct set of log files is used.

Collaborator Author

mfmceneaney Mar 24, 2025

For now I just removed what you suggested and in the documentation I just told the user to check for the job errors following the directions in step 1.

Collaborator Author

mfmceneaney Mar 24, 2025

And I also removed the extra log file definition.

bin/run-detectors-timelines.sh Outdated Show resolved Hide resolved

bin/run-detectors-timelines.sh Outdated

    
                  --focus-timelines   only produce the detector timelines, do not run detector QA code

                  --focus-qa          only run the QA code (assumes you have detector timelines already)

                  --run-slurm         run timelines on SLURM instead of running multi-threaded locally

Member

c-dilks Mar 20, 2025

Could you also update the documentation how to use these new options?

doc/chef_guide.md: supposed to be as terse as possible
doc/procedure.md: where you don't have to be terse (in fact, chef_guide.md was created because procedure.md was too verbose...)

Collaborator Author

mfmceneaney Mar 24, 2025

Just added a paragraph in both of these files, but let me know if I was too verbose in the chefs' documentation.

Matthew McEneaney added 14 commits

March 24, 2025 10:54


          fix: Switch option name and update usage description.

ec91e62


          fix: Remove unnecessary control flow options.

e747db4


          fix: Rename option more intuitively.

505cf83


          docs: Reorganize help message.

137577d


          fix: Added missing parentheses.

898b940


          feat: Add check for simultaneity of exclusive options.

eb19e4b


          fix: Changed output slurm directory names to be unique for steps 1 an…

5f43d1e

…d 2 of QA.


          fix: Remove extraneous comments.

dcdfae9


          fix: Let slurm handle logging.

1d6c484


          fix: Replace continue with exit since not in loop.

e25670c


          docs: Add instructions for running detector timeline jobs.

dc4d422


          fix: Remove if block remnant.

80a81ca


          fix: Remove unnecessary log file definition.

22508a0


          fix: Correctly set slurm directory path.

5acfc0e

Collaborator Author

mfmceneaney commented Mar 24, 2025

@c-dilks Thanks for looking through this in detail! I made some more commits in response to your comments but let me know if anything else looks awry.

c-dilks added this to the v3 milestone

c-dilks mentioned this pull request

fix: OutOfMemoryError for FTOF and EPICS #152

Merged

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet