Skip to content

Conversation

harche
Copy link

@harche harche commented Oct 1, 2025

Tested with claude code,

> what is the version of cgroup the node harpatilcluster000a7-m9vgd-worker-a-6jz6f is using? 

⏺ I'll check the cgroup version for the specified node.

⏺ kubernetes-mcp-server - Nodes: Debug Exec (MCP)(node: "harpatilcluster000a7-m9vgd-worker-a-6jz6f",
                                                 command: ["stat","-fc","%T","/sys/fs/cgroup/"])
  ⎿  cgroup2fs                                                                                       

⏺ cgroup v2

@openshift-ci openshift-ci bot requested a review from ardaguclu October 1, 2025 19:24
Copy link

openshift-ci bot commented Oct 1, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: harche
Once this PR has been reviewed and has the lgtm label, please assign ardaguclu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@harche
Copy link
Author

harche commented Oct 1, 2025

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025
@harche
Copy link
Author

harche commented Oct 1, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025
@harche
Copy link
Author

harche commented Oct 1, 2025

/hold for fixing CI issues.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025
@ardaguclu
Copy link
Member

I'll defer the review to;

@manusa @Cali0707 @matzew

@harche harche changed the title Add node debug tool with tests WIP: Add node debug tool with tests Oct 2, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025
@harche
Copy link
Author

harche commented Oct 2, 2025

Failiures in the linter and security jobs are not related to the changes in this PR. The linter failures are getting addressed in #39, while security failures fixes are in #40

@harche harche changed the title WIP: Add node debug tool with tests Add node debug tool with tests Oct 2, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025
Copy link

@Cali0707 Cali0707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @harche

The code looks good overall, left a few comments throughout

I'll test this on an OpenShift cluster tmrw

internalk8s "github.com/containers/kubernetes-mcp-server/pkg/kubernetes"
)

func initNodes(_ internalk8s.Openshift) []api.ServerTool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we can drop the internalk8s.Openshift parameter here since we don't need it? For the initXYZ methods, there's no requirement on this parameter existing - it seems to be present in only some of them

Suggested change
func initNodes(_ internalk8s.Openshift) []api.ServerTool {
func initNodes() []api.ServerTool {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, thanks.

//
// When namespace is empty, the configured namespace (or "default" if none) is used. When image is empty the
// default debug image is used. Timeout controls how long we wait for the pod to complete.
func (k *Kubernetes) NodesDebugExec(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split this function up a little bit? IMO it is getting quite large and is responsible for too much

Maybe we can create functions that:

  1. create the debug pod
  2. poll for debug completion
  3. Retrieve the logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, thanks.

// nodeDebugContainerName is the name used for the debug container, matching oc debug defaults.
nodeDebugContainerName = "debug"
// defaultNodeDebugTimeout is the maximum time to wait for the debug pod to finish executing.
defaultNodeDebugTimeout = 5 * time.Minute
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to lower this timeout as by default this will significantly exceed the client tool call connection timeout: https://github.com/modelcontextprotocol/typescript-sdk/blob/e0de0829019a4eab7af29c05f9a7ec13364f121e/src/shared/protocol.ts#L60

We probably also want to add support for progress notifications to be sent to the client for longer running tool calls like this one (cc @mrunalp @ardaguclu @matzew @manusa )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 1 min, thanks.

Comment on lines 127 to 128
grace := int64(0)
_ = podsClient.Delete(deleteCtx, created.Name, metav1.DeleteOptions{GracePeriodSeconds: &grace})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use ptr.To here like we do elsewhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored file at pkg/ocp/nodes_debug.go, now uses ptr.To. Thanks.

defaultNodeDebugTimeout = 5 * time.Minute
)

// NodesDebugExec mimics `oc debug node/<name> -- <command...>` by creating a privileged pod on the target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is ocp specific;
would it make sense, to group this functionality into some pkg/ocp package?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File pkg/ocp/nodes_debug.go is part of ocp package.

Copy link
Author

@harche harche Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, I will move everything this PR adds from pkg/kubernetes to pkg/ocp so future rebasing avoids the conflicts.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
Copy link

@Cali0707 Cali0707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harche would it be possible to improve the error messages we provide?

When running this server with claude code claude was able to call the tool, but it frequently got errors such as:

⏺ ocp-debug - Nodes: Debug Exec (MCP)(node: "ip-10-0-112-253.us-east-2.compute.internal", command: ["systemctl","status","kubelet"])
  ⎿  Error: command exited with code 1 (Error)

I'm not sure if there is a way to get more info about what went wrong, but with the current lack of error messages it was hard for the agent to figure out what went wrong and how to fix it

type Kubernetes struct {
manager *Manager
manager *Manager
podClientFactory func(namespace string) (corev1client.PodInterface, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't these changes in here cause the divergence from upstream?. I think, in this repository we shouldn't allow any changes touching these packages such as kubernetes/, mcp/, etc.

My suggestion is to add first in upstream and simply use it in downstream. Downstream repository should touch only ocp/ directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, working on it, #38 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Copy link

openshift-ci bot commented Oct 8, 2025

@harche: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants