Skip to content

Conversation

mfmceneaney
Copy link
Collaborator

No description provided.

@mfmceneaney
Copy link
Collaborator Author

Redid this PR after closing #289 and fixing branch history.

@mfmceneaney mfmceneaney requested a review from c-dilks March 7, 2025 16:29
@c-dilks c-dilks linked an issue Mar 12, 2025 that may be closed by this pull request
Copy link
Member

@c-dilks c-dilks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this; it will be very useful for chefs (and developers), and will significantly speed things up.

I have some comments below, which hopefully aren't too much work to address.

Also I think for now it's great that the default behavior is the same; after we do some more testing we can switch the default to be Slurm.

# start job lists
echo """
Generating job scripts..."""
slurmDir=$TIMELINESRC/slurm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chefs won't have write access to $TIMELINESRC, use ./slurm instead (for consistency with run-monitoring.sh)

Also, some files created in this directory will overwrite those from step 1, if Slurm is used there too (e.g., job.$dataset.detectors.list). Perhaps an easy way to avoid this conflict is to set slurmDir to be different for the two steps, such as ./slurm/step1 for run-monitoring.sh and ./slurm/step2 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the second option using ./slurm/step1 for and ./slurm/step2 but would still like to check that this runs fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up having to set the output slurm directory to $outputDir/slurm/step2 in run-detector-timelines.sh since the script changes directories. It might also be wise to use this full path in run-monitoring.sh too even though that script does not change directories in a way that it would affect whether ./slurm or $outputDir/slurm would be identical.

Comment on lines +329 to +330
# set classpath
export CLASSPATH=$CLASSPATH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$CLASSPATH is necessary for now, but is removed in #293. Depending on whether we merge this PR or #293 first, we'll need to remember to deal with this (though if we forget, the script will just fail, reporting $CLASSPATH as unbound).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks! Will keep an eye on this.

export CLASSPATH=$CLASSPATH
# produce detector timelines
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm will handle the logging, automatically splitting stdout and stderr.

Suggested change
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir

You may also remove the logFile=$logDir/$key from a few lines above.

Later below, in the "error checking" part, we'll need to figure out how to read the Slurm error logs... or just tell the user to check them for themselves...

If we do end up reading the Slurm error logs, we'll need to use the job ID or something, so in the case where the user runs this script on twice, the correct set of log files is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just removed what you suggested and in the documentation I just told the user to check for the job errors following the directions in step 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I also removed the extra log file definition.

--focus-timelines only produce the detector timelines, do not run detector QA code
--focus-qa only run the QA code (assumes you have detector timelines already)
--run-slurm run timelines on SLURM instead of running multi-threaded locally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the documentation how to use these new options?

  • doc/chef_guide.md: supposed to be as terse as possible
  • doc/procedure.md: where you don't have to be terse (in fact, chef_guide.md was created because procedure.md was too verbose...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a paragraph in both of these files, but let me know if I was too verbose in the chefs' documentation.

@mfmceneaney
Copy link
Collaborator Author

@c-dilks Thanks for looking through this in detail! I made some more commits in response to your comments but let me know if anything else looks awry.

@c-dilks c-dilks added this to the v3 milestone Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

run Step 2 on SLURM
2 participants