Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create frozen Conda environments for modules #2193

Open
pinin4fjords opened this issue Feb 24, 2023 · 9 comments
Open

Create frozen Conda environments for modules #2193

pinin4fjords opened this issue Feb 24, 2023 · 9 comments

Comments

@pinin4fjords
Copy link
Member

Description of feature

Problem

Conda environments are not reproducible over time. The sometimes large dependency trees mean you get a different software stack next week to the one you have today. This is bad for reproducible science.

The often used workaround for this has been to use Docker images, which have the effect of freezing dependency trees, but then if you find yourself rebuilding Docker images (e.g. to patch due to security concerns) you lose those frozen dependencies. Some (e.g. Paolo, I think) would say that really, we should be using Docker as a software delivery mechanism only.

A better way of doing this is to actually record the state of the environment when modules are created, and when the conda dependencies are updated, creating a frozen dependencies file that can be used to create environments when the workflows are run.

Available solutions

pythonspeed has an excellent (if not quite up to date) summary of this.

Essentially there are two ways to go.

conda env export

Create the environments, immediately record their state.

  • Advantages: no extra software required
  • Disadvantages: would be difficult for developers to do on a single machine in order to generate the separate environments that would be required for e.g. MacOS and Linux. Maybe it could be done with different machines in CI?

conda-lock

See https://github.com/conda/conda-lock.

  • Advantages:
    • can make multi-platform lock files
    • Bypasses the conda solver (you're basically just storing a list of URIs to the package archives). That could speed things up significantly.
  • Disadvantages
    • Requires more software
    • Users would need to install conda-lock to re-create environments at run time.

How I imagine the tools commands working

I don't know how we might persuade Nextflow itself to use lock files to create the environments from lock files at run time. So imagine a different sequence:

nf-core modules conda-lock - Runs conda-lock, creates lockfiles for all architectures required
nf-core init-locked_envs - Creates environments for all the lockfiles for all the the modules of a workflow that have them.

Then, when the workflow is run, the module environments are all recognised as being in place, and off we go. This could work incrementally, such that environments were still created on the fly for modules lacking lock files.

Potential problems

  • Rebuilding lock files when conda packages were bumped.
  • CI to ensure the above.
  • There may be some overlap with all the new funky Wave stuff
@pinin4fjords
Copy link
Member Author

See also Paolo's post in #bioconda https://nfcore.slack.com/archives/CM46YC6BZ/p1677007405615889

@pinin4fjords
Copy link
Member Author

See also discussion

@edmundmiller
Copy link
Contributor

edmundmiller commented Nov 21, 2023

I believe wave supports conda-lock files now!

My issue would be with readability on the environment.yml. I kinda just want to see what exactly we want and not the 100 dependencies.

@pinin4fjords
Copy link
Member Author

@emiller88 maybe we need a an environment-lock.yml in addition to the environment.yml? I know, another file, but would serve the different use cases of complete reproducibility vs flexible environment solve.

Would get messy with different architectures though...

@edmundmiller
Copy link
Contributor

Maybe a .conda directory to keep it cleaner?

I think it's a trade-off at the end of the day.

If you want to be sure about reproducibility, you use the container images.

If you want to roll the dice, use conda. It'll get you pretty close 95% of the time.

@pinin4fjords
Copy link
Member Author

See where you're coming from, don't completely agree.

I should be able to inspect the package complement of a frozen software env without poking about in a Docker image, and in an ideal world I'd like to be able to tweak an env to add something simple without rebuilding the whole thing (though since new thing may have its own deps I appreciate that's not a given).

@edmundmiller
Copy link
Contributor

I think this was in a time before tests/ and everything else in a modules directory. I think having both and environment.yml and environment.lock.yml isn't ridiculous at this point.

My issue is if they'll get updated and maintained.

I think we can automate this now.

Bump the environment.yml -> Create a lock file -> Pass the lock file to wave

@ewels
Copy link
Member

ewels commented Jun 17, 2024

maybe we need a an environment-lock.yml in addition to the environment.yml

Same as package.json and package-lock.json for npm. This is what I'd expect for conda lockfiles tbh.

Automation as @edmundmiller says FTW 👍🏻

@edmundmiller
Copy link
Contributor

Made a proof of concept, forgot to link it here though nf-core/modules#5827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants