Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider consolidating conda solves in CI tests #22

Open
2 of 18 tasks
charlesbluca opened this issue Feb 21, 2024 · 14 comments
Open
2 of 18 tasks

Consider consolidating conda solves in CI tests #22

charlesbluca opened this issue Feb 21, 2024 · 14 comments
Assignees
Labels
epic improvement Improves an existing functionality

Comments

@charlesbluca
Copy link
Member

charlesbluca commented Feb 21, 2024

Currently (using cuML as an example here), the conda test environment initialization for most CI jobs looks something like creating the test environment:

rapids-dependency-file-generator \
  --output conda \
  --file_key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --force -f env.yaml -n test

And then downloading and installing build artifacts from previous jobs on top of this environment:

CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

...

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  libcuml cuml

In addition to forcing us to eat the cost of a second conda environment solve, in many cases this can cause some pretty drastic changes to the environment which can be blocking - for example, consider this cuML run which fails because conda is unable to solve a downgrade from Arrow 15.0.0 (build 5) to 14.0.1.

Our current workaround for this is to manually add pinnings to the testing dependencies initially solved such that the artifact installation can be solved, but this can introduce a lot of burden in needing to:

  • identify what packages/changes are blocking artifact installation
  • open PR(s) modifying the impacted repos
  • follow up on each impacted repo to potentially remove the pinning later on

Would it be possible to consolidate some (or all) of these conda environment solves by instead:

  1. downloading the conda artifacts before creating the environment
  2. updating the dependencies.yaml (or, if this isn't possible, the generated conda environment file) to include the desired packages, making sure to explicitly specify source channel to ensure we're picking up the build artifacts
  3. creating the environment with this patched file

In my mind, the main blocker I could see to this working would be if rapids-download-conda-from-s3 requires some conda packages contained in the testing environment to work.

EDIT: Updating to capture state of other projects:

@msarahan
Copy link

I don't fully understand how much you need to be rebuilding, but if the answer is "not much" then maybe you could use https://conda.github.io/conda-pack/ or consider building a docker image instead of re-creating the environment each time?

@bdice
Copy link
Contributor

bdice commented Feb 22, 2024

I am proposing a flag --prepend-channels for rapids-dependency-file-generator that will enable a one-step conda environment creation. This would let us use local channels with CI artifacts and would eliminate the second conda solve used to install CI artifacts. rapidsai/dependency-file-generator#67

@bdice
Copy link
Contributor

bdice commented Feb 22, 2024

consider building a docker image instead of re-creating the environment each time?

This is a possible solution, but we'd need to maintain a separate Docker image for every repository and every test matrix configuration (e.g. a container containing the cuDF Python test environment on ARM using Python 3.9 and CUDA 12.2). I don't think we have the infrastructure to manage that while also ensuring that the conda environments are regularly updated (every night?) for all the possible containers so that we are aware of upstream changes that affect us. The workflows we have today would require an explosively large number of large containers (many of the containers would need to contain an entire CTK installed by conda).

This would also still require us to have a conda environment in that container that is compatible with the CI artifacts that we are installing. This doesn't avoid the "can't downgrade Arrow 15" problem because we must install the CI artifacts (conda packages built from the current PR) into the container.

@raydouglass
Copy link
Member

The docker repo suffers from a similar problem and it was (partly) mitigated by dynamically editing the environment file: https://github.com/rapidsai/docker/blob/branch-24.04/context/notebooks.sh#L31, so just throwing that out there.

@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2024

Once rapidsai/cuml#5781 is merged, do we plan to roll out the same changes to the rest of RAPIDS? IIUC that's what it would take to close this issue.

@jakirkham
Copy link
Member

Seems reasonable

We may want ensure that we are always getting cached CI packages too, which would close out issue: #14

@bdice
Copy link
Contributor

bdice commented Feb 26, 2024

@vyasr Yes, my plan is to convert all of RAPIDS to this single-solve approach.

@jakirkham The approach I'm taking in rapidsai/cuml#5781 will pin the major.minor versions of the packages as shown below. That isn't a full solution for #14 because nightlies of the same major.minor could still be selected, but it should help constrain the solver and reduce the likelihood of problems. If you have further suggestions please weigh in on #14 or on that cuML PR (if the code suggestions are small) and I'll adopt your proposals in a follow-up change.

  test_libcuml:
    common:
      - output_types: conda
        packages:
          - libcuml==24.4.*
          - libcuml-tests==24.4.*
  test_cuml:
    common:
      - output_types: conda
        packages:
          - libcuml==24.4.*
          - cuml==24.4.*

@jakirkham
Copy link
Member

jakirkham commented Feb 26, 2024

Ok it sounds like that is mostly an issue when we are getting close to (or in the midst of) a release. Is that right?

Agree that could be an issue, but that is a much smaller issue than picking up the prior release (especially well after it went out)

@bdice
Copy link
Contributor

bdice commented Feb 26, 2024

Yeah, I think the problematic case would be something like changing the dependencies of a package in a way that the conda packages from CI artifacts are not compatible with the rest of the testing environment, and the solver chooses recent nightlies instead of the PR artifacts. @charlesbluca proposed we could pin the exact channels like ${CPP_CHANNEL}::libcuml but we'd have to find a way to insert that channel information into the package's data in dependencies.yaml since the channel (a local filepath) isn't known ahead of time.

rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Feb 26, 2024
This PR creates a conda environment containing both test dependencies and cuml packages. This is a workaround for some issues seen with conda being unable to downgrade from Arrow 15 (in the initial environment with test dependencies) to Arrow 14 (currently pinned by cudf, which is a dependency of cuml).

This is a partial solution for rapidsai/build-planning#22. (More work is needed for other RAPIDS repos.)

Depends on rapidsai/dependency-file-generator#67.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Awe (https://github.com/AyodeAwe)
  - https://github.com/jakirkham
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #5781
@vyasr
Copy link
Contributor

vyasr commented Feb 27, 2024

An additional minor benefit is that this will speed up CI by reducing the number of solves.

@bdice
Copy link
Contributor

bdice commented Mar 1, 2024

An additional minor benefit is that this will speed up CI by reducing the number of solves.

This might not have a large impact on CI runtime. See rapidsai/cudf#15201 (comment). tl;dr, test startup time is dominated by network activity (downloading multiple GB of packages from RAPIDS and CTK dependencies) and the potential conda solve time savings are comparatively small.

@msarahan msarahan self-assigned this May 22, 2024
@msarahan
Copy link

msarahan commented May 22, 2024

Taking this on with high priority because cuspatial CI is blocked, and this might help there.

rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this issue May 24, 2024
@bdice
Copy link
Contributor

bdice commented Jun 4, 2024

@msarahan Is there a plan to continue on this work? We have implemented it for cuml and cuspatial but nothing else. It would be good to have alignment across RAPIDS to prevent future issues.

@msarahan msarahan added improvement Improves an existing functionality epic labels Jun 19, 2024
@msarahan
Copy link

Added task list to parent issue. We'll work on this as time allows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic improvement Improves an existing functionality
Projects
None yet
Development

No branches or pull requests

6 participants