-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Run detector timelines on SLURM #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Redid this PR after closing #289 and fixing branch history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing this; it will be very useful for chefs (and developers), and will significantly speed things up.
I have some comments below, which hopefully aren't too much work to address.
Also I think for now it's great that the default behavior is the same; after we do some more testing we can switch the default to be Slurm.
bin/run-detectors-timelines.sh
Outdated
# start job lists | ||
echo """ | ||
Generating job scripts...""" | ||
slurmDir=$TIMELINESRC/slurm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chefs won't have write access to $TIMELINESRC
, use ./slurm
instead (for consistency with run-monitoring.sh
)
Also, some files created in this directory will overwrite those from step 1, if Slurm is used there too (e.g., job.$dataset.detectors.list
). Perhaps an easy way to avoid this conflict is to set slurmDir
to be different for the two steps, such as ./slurm/step1
for run-monitoring.sh
and ./slurm/step2
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the second option using ./slurm/step1
for and ./slurm/step2
but would still like to check that this runs fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up having to set the output slurm directory to $outputDir/slurm/step2
in run-detector-timelines.sh
since the script changes directories. It might also be wise to use this full path in run-monitoring.sh
too even though that script does not change directories in a way that it would affect whether ./slurm
or $outputDir/slurm
would be identical.
# set classpath | ||
export CLASSPATH=$CLASSPATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks! Will keep an eye on this.
bin/run-detectors-timelines.sh
Outdated
export CLASSPATH=$CLASSPATH | ||
# produce detector timelines | ||
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slurm will handle the logging, automatically splitting stdout
and stderr
.
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir > $logFile.out 2> $logFile.err || touch $logFile.fail | |
java $TIMELINE_JAVA_OPTS $run_detectors_script $key $inputDir |
You may also remove the logFile=$logDir/$key
from a few lines above.
Later below, in the "error checking" part, we'll need to figure out how to read the Slurm error logs... or just tell the user to check them for themselves...
If we do end up reading the Slurm error logs, we'll need to use the job ID or something, so in the case where the user runs this script on twice, the correct set of log files is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just removed what you suggested and in the documentation I just told the user to check for the job errors following the directions in step 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I also removed the extra log file definition.
bin/run-detectors-timelines.sh
Outdated
--focus-timelines only produce the detector timelines, do not run detector QA code | ||
--focus-qa only run the QA code (assumes you have detector timelines already) | ||
--run-slurm run timelines on SLURM instead of running multi-threaded locally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update the documentation how to use these new options?
doc/chef_guide.md
: supposed to be as terse as possibledoc/procedure.md
: where you don't have to be terse (in fact,chef_guide.md
was created becauseprocedure.md
was too verbose...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a paragraph in both of these files, but let me know if I was too verbose in the chefs' documentation.
@c-dilks Thanks for looking through this in detail! I made some more commits in response to your comments but let me know if anything else looks awry. |
No description provided.