-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job_manager: Add htcondorVC3 job manager #251
base: maint-0.6
Are you sure you want to change the base?
Conversation
Added htcondor submission for workflow steps. No monitoring of any file transfers is implemented yet.
Add Cody changes for condor submission
…ller Conflicts: reana_job_controller/rest.py Merged upstream changes. Commit condor changes. Updated to most recent rjc. Added delete job for condor.
Update RJC
Still a lot of room for improvement.
…ller into cody_dev Conflicts: reana_job_controller/rest.py setup.py Merge khurtado's RJC with reanahub's current master. Conflicts involved changing the hardcoded paths within rjc/rest.py. Setup.py didn't like the added 'htcondor' requirement.
Create first instance of HTCondor's job manager There are a few VC3 specifics within. Testing is still needed.
Added ability to call htcondor_job_manager over the already defined kubernetes_job_manager.
- Implement watch jobs method for condor.
Changed how the htcondor job wrapper searches for a container. Instead of assuming singularity now searches through a list to be expanded later. Singularity is still hardcoded with arguments.
Erroneously left out shifter in module search list, fixed incorrect usage of cntr_path in job_wrapper.
To account for environments with bash versions > 4.0 associative arrays in job_wrapper for htcondor were removed and replaced with simple arrays. Adjusted call to modules within job_wrapper due to failure on Blue Waters.
First implementation to search for shifter
…ller into reanahub-master Conflicts: setup.py
Should only apply to tag 0.6.0
Removed an older method for passing the htcondor schedd address to the container, instead of at build time we moved it long ago to within the yaml configuration file for reana-cluster
Bump HTCondor installed through pip from 8.9.1 to 8.9.6
condor.py was used in the past before the creation of job_manager classes.
reana_job_controller/variables.py
Outdated
|
||
import os | ||
|
||
MAX_JOB_RESTARTS = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these variables are used only in VC3 context, it would be good to name the file more specifically. Examples:
- Do we need to have them as separate file? If not, consider moving into htcondorvc3 class.
- Shall they be reused for other VC3 backends than just HTCondor? If yes, leave them in separate file named like
vc3_config.py
orvc3_variables.py
. - We can always use
vc3_...
name prefix technique for other possible VC3-specific files such as job wrapper. Currently it lives infiles/job_wrapper.sh
, it may be better to use something likescripts/vc3/job_wrapper.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khurtado any thoughts on if we'd like to move these or prefix the naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it, it seems we do not use these 2 anymore:
reana-job-controller/reana_job_controller/variables.py
Lines 16 to 17 in 0e76609
SHARED_VOLUME_PATH = os.getenv('SHARED_VOLUME_PATH', '/var/reana') | |
SHARED_VOLUME_PATH_ROOT = os.getenv('SHARED_VOLUME_PATH_ROOT', SHARED_VOLUME_PATH) |
So, I think we can simply remove variables.py
, get rid of this line and set MAX_JOB_RESTARTS = 3
below the import calls in htcondorvc3_job_manager.py
reana_job_controller/variables.py
Outdated
# -*- coding: utf-8 -*- | ||
# | ||
# This file is part of REANA. | ||
# Copyright (C) 2017, 2018 CERN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see your contribution!
- Please amend copyright years.
- Please add yourself and @khurtado to the author list in
AUTHORS.rst
. - Regarding merging, we usually merge feature branches with linear history. There are basically two options: (1) Shall we rebase and keep the history? (2) Shall we squash everything under one commit? What would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add us to AUTHORS.rst
!
Regarding merging, we are fine with whatever would be easier for the REANA team. If squashing everything under one commit is easier for you then we are fine with that, the history of our commits is not as important to us as integrating them.
files/job_wrapper.sh
Outdated
#!/bin/bash | ||
|
||
# Replicate input files directory structure | ||
# @TODO: This could be executed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have:
etc/job_wrapper.sh
files/job_wrapper.sh
It is not clear which files are for what.
What about using:
etc/htcondorcern/job_wrapper.sh
etc/htcondorvc3/job_wrapper.sh
so that we can easily support several backends and distinguish between them?
This would mean that we'd have to move quite a few CERN-specific files from etc/*
to /etc/htcondorcern/*` ourselves... but merging this branch will be quite a lot of work anyway, so could perhaps do it alongside?!
(An alternative would be to use the same directory but strictly "cern" and "vc" naming prefix everywhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like creating different subdirectories in etc:
etc/htcondorvc3/job_wrapper.sh
etc/htcondorcern/job_wrapper.sh
As you said, handling several different backends would be easier to manage and visualize than prefixes or suffixes (job_wrapper_cern.sh / cern_job_wrapper.sh
).
If you're OK with the subdirectories in etc/
, I'd be happy to help move CERN specific files as well as the VC3 files.
Dockerfile
Outdated
@@ -77,5 +80,6 @@ EXPOSE 5000 | |||
|
|||
ENV COMPUTE_BACKENDS $COMPUTE_BACKENDS | |||
ENV FLASK_APP reana_job_controller/app.py | |||
ENV REANA_LOG_LEVEL DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be necessary to hard-code REANA_LOG_LEVEL
in Dockerfile
. One could pass it as environment variable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add a bit more information on this: the way of passing this configuration is a bit cumbersome right now as one would have to: go inside RWC, read the config value and pass it as an environment variable to RJC. In reanahub/reana#277 there is a description of the current process and possible solutions we could take to centralise these configuration variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should have been taken out, I missed this when cleaning things up.
def __init__(self, docker_img=None, cmd=None, prettified_cmd=None, | ||
env_vars=None, workflow_uuid=None, workflow_workspace=None, | ||
cvmfs_mounts='false', shared_file_system=False, | ||
job_name=None, kerberos=False, kubernetes_uid=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have added in the meantime support for running unpacked Singularity images from CVMFS. (See unpacked_img
in master
branch.) Is this something that would interest you for VC3 integration as well? If yes, this will be something to look into. If not, we can either raise NotImplemented
or pass
while merging. Asking just for the future in order to know how to propagate v0.6.0 -> master
changes to the HTCondorJobManagerVC3
class the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current moment we haven't explored CVMFS very much as we are targeting mainly National Science Foundation HPC sites here in the US which will not support it, so for now when merging a raise NotImplemented
may be best.
Dockerfile
Outdated
pip install --upgrade pip && \ | ||
pip install htcondor==8.9.6 retrying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have these libraries installed at this specific point in the Dockerfile, maybe because of the deb packages? Otherwise, we could just continue setting them in setup.py
so they will be installed at this point, and later available in the code .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, these should be handled in setup.py
, I'll clean these up.
reana_job_controller/config.py
Outdated
# How is this set in the environment? | ||
# It is hardcoded in Dockerfile and there is no code | ||
# in workflow-controller to override that for the job-controller. | ||
#SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS', | ||
# DEFAULT_COMPUTE_BACKEND).split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point for improvement 👍. The way this is set is a bit shady right now as one could pass it in REANA-Workflow-Controller with the process described in https://github.com/reanahub/reana-job-controller/pull/251/files#r416428666, but actually this is set at build time because that knowledge is used to selectively install the correct packages.
So when we build images for reana.cern.ch
we use the Makefile like:
BUILD_ARGUMENTS="COMPUTE_BACKENDS=htcondorcern,slurmcern,kubernetes" BUILD_TYPE=release make build
What is happening behind the scenes is that we pass these build args to REANA-Job-Controller docker build
, which results in the final image having this environment variable.
Side note, as it might be of interest here: Then we take this image and we name it reanahub/reana-job-controller-htcondorcern-slurmcern
, not very pretty but this is what we have for now :) (see previous discussions here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note Diego, it's very helpful! We'll be sure to follow that from now on. Perhaps we've been 'cheating' while building our images, we have simply been building them straight from the dockerfile
within the RJC repo rather than through the reanahub/reana
repo. I suppose this is a bad habit of ours.
Rename MAX_JOB_RESTARTS to MAX_NUM_RETRIES to make it consistent with the HTCondor CERN implementation.
To better organize job wrappers, files/job_wrapper has been moved to etc/htcondorvc3/job_wrapper.sh
Increase number of default retries.
Fix issue with krb5 dialog
Increase parrot timeout
htcondor\_submit.py no longer used
Remove old settings
Update default backend
…pying whole workflow dir
Change s3 repository to get parrot
Integrating the VC3 job manager class for 0.6 branch.