-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump based CUDA image to ubuntu24.04 #1166
base: main
Are you sure you want to change the base?
Conversation
# Create a system-wide venv for the pip-installed world inside the containers | ||
RUN python -m venv --prompt jax /opt/venv && /opt/venv/bin/pip install --ignore-installed --no-cache-dir -e /opt/pip pip-tools | ||
# Make sure `python` refers to the venv version | ||
ENV PATH=/opt/venv/bin:${PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient instead of source /opt/venv/bin/activate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it. The search mechanism starts checking the PATH one at a time till first match. So, even if there is another python in the PATH, it picks up the first one:
$> docker run -it --rm ghcr.io/nvidia/jax-toolbox-internal:11967699553-base-amd64 bash -c "which python && which pip"
/opt/venv/bin/python
/opt/venv/bin/pip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/opt/venv/bin/activate
performs additional actions besides appending to PATH
. Are we sure it is OK to omit these in our containers?
@@ -2,7 +2,7 @@ | |||
|
|||
ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax | |||
ARG URLREF_MAXTEXT=https://github.com/google/maxtext.git#main | |||
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0 | |||
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tested? It does not seem to be related to the bump to Ubuntu 24.04.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, related. Google merged changes to support wheel build for python3.12 via this PR. For now we have to use master because the last release is 2.18.0 published 1 month ago and the fixing PR was merged 2 weeks ago. Waiting for 2.18+ release to ping it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this make use track master TF-Text instead of a pinned version.
We could pin a commit as an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this make use track master TF-Text instead of a pinned version. We could pin a commit as an alternative.
Our current 'standard' approach is to tracker the master in the Dockerfile, and then use manifest.yaml
and the EXTRA_BUILD_ARGS
mechanism for CI reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DwarKapex could you please add the plumbing for lingvo/tftext version tracking following the example of
JAX-Toolbox/.github/workflows/_ci.yaml
Line 158 in b0e6753
URLREF_PAXML=${{ fromJson(inputs.SOURCE_URLREFS).PAXML }} |
@@ -31,7 +31,7 @@ on: | |||
required: true | |||
DOCKERFILE: | |||
type: string | |||
description: "Dockerfile to use, e.g. .github/container/Dockerfile.t5x.amd64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why renaming all the amd64 Dockerfiles? Is there a discussion that I missed? 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between amd64 and arm64 is that we build some wheels (like lingo, tf-text) for arm64, but not for amd64. With this PR we build these packages for both archs, so no need to have 2 different Dockerfiles. I remove arm64 version of Dockerfile and extension amd64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would view the manual building of lingvo/tftext wheel as a temporary remedy that (unfortunately) complicates the Dockerfile. This workaround shouldn't stay forever. We should push for proper wheel builds by upstream and at the meantime keep the simpler Dockerfiles so that they will become the default again hopefully soon.
@olupton desperately need your input here =) |
Some builds are still broken 😜 |
Ubuntu24.04 uses
python-3.12
as a main interpreter. Unfortunately, not all python packages, we use here, has py-3.12 wheel for amd64/arm64, so need to build the following packages from source:Also
python-3.12
added a system-wide protection layer (PEP 668) when install packages usingpip
:The solution is obvious -- install everything into venv (the initial solution was proposed by @olupton).