What happened?
When following the example at https://www.kubeflow.org/docs/components/trainer/getting-started in an installation from the Kubeflow Community Distribution at some point the following error arises:
from kubeflow.trainer import TrainerClient, KubernetesBackendConfig
client = TrainerClient(backend_config=KubernetesBackendConfig(namespace="example"))
for r in client.list_runtimes():
print(f"Runtime: {r.name}")
Trainer control-plane version info is not available: unable to read 'kubeflow_trainer_version' from ConfigMap 'kubeflow-trainer-public' in namespace 'kubeflow-system': (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'ede3d47e-0593-40c7-84a7-065651479f8f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'd8d1b44e-a124-4309-9e68-01915f929a6f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'ce7ac3cf-9053-4526-b182-8f3346ff2ffe', 'Date': 'Thu, 18 Jun 2026 12:43:21 GMT', 'Content-Length': '427'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps \"kubeflow-trainer-public\" is forbidden: User \"system:serviceaccount:example:default-editor\" cannot get resource \"configmaps\" in API group \"\" in the namespace \"kubeflow-system\": Azure does not have opinion for this user.","reason":"Forbidden","details":{"name":"kubeflow-trainer-public","kind":"configmaps"},"code":403}
Runtime: deepspeed-distributed
Runtime: jax-distributed
Runtime: mlx-distributed
Runtime: torch-distributed
Runtime: torchtune-llama3.2-1b
Runtime: torchtune-llama3.2-3b
Runtime: torchtune-qwen2.5-1.5b
Runtime: xgboost-distributed
This is due to the code snippet here:
|
def verify_backend(self) -> None: |
|
"""Verify that the Trainer control plane exposes version metadata. |
|
|
|
This check only ensures that the public control-plane ConfigMap exists |
|
and contains a ``kubeflow_trainer_version`` field. It does not |
|
enforce version compatibility and never raises. |
|
""" |
|
|
|
system_namespace = os.getenv("KUBEFLOW_SYSTEM_NAMESPACE", "kubeflow-system") |
|
config_map_name = "kubeflow-trainer-public" |
|
|
|
try: |
|
_ = self.core_api.read_namespaced_config_map( |
|
name=config_map_name, |
|
namespace=system_namespace, |
|
).data["kubeflow_trainer_version"] |
|
except Exception as e: # noqa: BLE001 |
|
logger.warning( |
|
"Trainer control-plane version info is not available: " |
|
f"unable to read 'kubeflow_trainer_version' from ConfigMap " |
|
f"'{config_map_name}' in namespace '{system_namespace}': {e}" |
|
) |
|
return |
What did you expect to happen?
When following the getting started tutorial there is no error.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.31.6
Kustomize Version: v5.4.2
Server Version: v1.34.2
WARNING: version difference between client (1.31) and server (1.34) exceeds the supported minor version skew of +/-1
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:v2.2.0
Kubeflow Python SDK version:
$ pip show kubeflow
Name: kubeflow
Version: 0.4.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author:
Author-email: The Kubeflow Authors <kubeflow-discuss@googlegroups.com>
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-katib-api, kubeflow-trainer-api, kubernetes, pydantic
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
What happened?
When following the example at https://www.kubeflow.org/docs/components/trainer/getting-started in an installation from the Kubeflow Community Distribution at some point the following error arises:
This is due to the code snippet here:
sdk/kubeflow/trainer/backends/kubernetes/backend.py
Lines 63 to 85 in c247a8d
What did you expect to happen?
When following the getting started tutorial there is no error.
Environment
Kubernetes version:
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}" ghcr.io/kubeflow/trainer/trainer-controller-manager:v2.2.0Kubeflow Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍