Skip to content

kubeflow trainer shouldn't expect living in kubeflow-system #529

Description

@christian-heusel

What happened?

When following the example at https://www.kubeflow.org/docs/components/trainer/getting-started in an installation from the Kubeflow Community Distribution at some point the following error arises:

from kubeflow.trainer import TrainerClient, KubernetesBackendConfig

client = TrainerClient(backend_config=KubernetesBackendConfig(namespace="example"))

for r in client.list_runtimes():
    print(f"Runtime: {r.name}")
Trainer control-plane version info is not available: unable to read 'kubeflow_trainer_version' from ConfigMap 'kubeflow-trainer-public' in namespace 'kubeflow-system': (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'ede3d47e-0593-40c7-84a7-065651479f8f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'd8d1b44e-a124-4309-9e68-01915f929a6f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'ce7ac3cf-9053-4526-b182-8f3346ff2ffe', 'Date': 'Thu, 18 Jun 2026 12:43:21 GMT', 'Content-Length': '427'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps \"kubeflow-trainer-public\" is forbidden: User \"system:serviceaccount:example:default-editor\" cannot get resource \"configmaps\" in API group \"\" in the namespace \"kubeflow-system\": Azure does not have opinion for this user.","reason":"Forbidden","details":{"name":"kubeflow-trainer-public","kind":"configmaps"},"code":403}
Runtime: deepspeed-distributed
Runtime: jax-distributed
Runtime: mlx-distributed
Runtime: torch-distributed
Runtime: torchtune-llama3.2-1b
Runtime: torchtune-llama3.2-3b
Runtime: torchtune-qwen2.5-1.5b
Runtime: xgboost-distributed

This is due to the code snippet here:

def verify_backend(self) -> None:
"""Verify that the Trainer control plane exposes version metadata.
This check only ensures that the public control-plane ConfigMap exists
and contains a ``kubeflow_trainer_version`` field. It does not
enforce version compatibility and never raises.
"""
system_namespace = os.getenv("KUBEFLOW_SYSTEM_NAMESPACE", "kubeflow-system")
config_map_name = "kubeflow-trainer-public"
try:
_ = self.core_api.read_namespaced_config_map(
name=config_map_name,
namespace=system_namespace,
).data["kubeflow_trainer_version"]
except Exception as e: # noqa: BLE001
logger.warning(
"Trainer control-plane version info is not available: "
f"unable to read 'kubeflow_trainer_version' from ConfigMap "
f"'{config_map_name}' in namespace '{system_namespace}': {e}"
)
return

What did you expect to happen?

When following the getting started tutorial there is no error.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.31.6
Kustomize Version: v5.4.2
Server Version: v1.34.2
WARNING: version difference between client (1.31) and server (1.34) exceeds the supported minor version skew of +/-1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:v2.2.0

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.4.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author: 
Author-email: The Kubeflow Authors <kubeflow-discuss@googlegroups.com>
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-katib-api, kubeflow-trainer-api, kubernetes, pydantic
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions