alibaba · ninan-nn · Mar 27, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 26, 2026
diff --git a/.claude/skills/troubleshoot-sandbox/SKILL.md b/.claude/skills/troubleshoot-sandbox/SKILL.md
@@ -0,0 +1,112 @@
+---
+name: troubleshoot-sandbox
+description: Troubleshoot OpenSandbox issues by running diagnostics (logs, inspect, events, summary) via CLI or HTTP API to diagnose sandbox failures like OOM, crash, image pull errors, network problems, etc.
+user-invocable: true
+argument-hint: "[sandbox-id]"
+---
+
+# OpenSandbox Troubleshooting
+
+Troubleshoot sandbox $ARGUMENTS using the opensandbox diagnostics.
+
+There are two ways to interact with the diagnostics API: **CLI** (if opensandbox CLI is installed) or **HTTP** (curl against the server directly). Use whichever is available. The HTTP approach works regardless of how the sandbox was created (SDK, API, CLI).
+
+## Workflow
+
+### Step 1: Confirm sandbox state
+
+**CLI:**
+```bash
+opensandbox sandbox get <sandbox-id>
+```
+
+**HTTP:**
+```bash
+curl http://<server-domain>/v1/sandboxes/<sandbox-id>
+```
+
+If the server requires authentication, add `-H "OPEN-SANDBOX-API-KEY: <your-key>"` to all curl commands.
+
+Check the sandbox status (Running, Pending, Paused, Failed, etc.). If the sandbox is not found, it may have been deleted or expired.
+
+### Step 2: Get diagnostics summary (recommended first action)
+
+**CLI:**
+```bash
+opensandbox devops summary <sandbox-id>
+```
+
+**HTTP:**
+```bash
+curl http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/summary
+```
+
+This returns a combined plain-text view of:
+- **Inspect**: container/pod details (status, resources, network, labels)
+- **Events**: state transitions, OOM kills, errors
+- **Logs**: recent container output
+
+Read the output carefully and look for common failure patterns listed below.
+
+### Step 3: Drill down if needed
+
+If the summary is not enough, use individual endpoints for more detail:
+
+**CLI:**
+```bash
+opensandbox devops logs <sandbox-id> --tail 500
+opensandbox devops logs <sandbox-id> --since 30m
+opensandbox devops inspect <sandbox-id>
+opensandbox devops events <sandbox-id> --limit 100
+```
+
+**HTTP:**
+```bash
+# Get more log lines
+curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/logs?tail=500"
+
+# Get logs from recent time window
+curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/logs?since=30m"
+
+# Detailed container/pod inspection
+curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/inspect"
+
+# More events
+curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/events?limit=100"
+```
+
+### Step 4: Diagnose common problems
+
+| Symptom | What to check | Likely cause |
+|---------|---------------|--------------|
+| Status=Pending, no IP | inspect - look for Waiting containers | Image pull failure, insufficient resources, node scheduling |
+| OOMKilled=true | inspect - check memory limits | Container exceeded memory limit, increase memory resource |
+| Exit Code 137 | events + logs | OOM kill or external SIGKILL |
+| Exit Code 1 | logs - check application output | Application error, check entrypoint and env vars |
+| Exit Code 126/127 | logs | Entrypoint command not found or not executable |
+| Connection refused to sandbox | inspect - check ports and network | Service not started inside sandbox, wrong port, network policy blocking |
+| Sandbox stuck in Running but unresponsive | logs (tail=200) | Application hung, check for deadlocks or resource exhaustion |
+| execd health check failing | logs - look for execd errors | execd daemon crashed or port conflict |
+| ImagePullBackOff (K8s) | events | Wrong image name, missing registry credentials |
+| CrashLoopBackOff (K8s) | events + logs | Application keeps crashing, check exit code and logs |
+
+### Step 5: Suggest resolution
+
+Based on the diagnosis, suggest one of:
+- **Image issue**: Verify image name, check registry access
+- **OOM**: Increase memory limit in sandbox creation (e.g. `memory=4Gi`)
+- **Application error**: Fix the entrypoint or application code
+- **Network**: Check network policy, verify port configuration
+- **Scheduling (K8s)**: Check node resources, check pool availability
+- **execd**: Update execd image version, check port conflicts
+
+## API Reference
+
+All diagnostics endpoints return `text/plain` and are available at:
+
+| Endpoint | Query Params | Description |
+|----------|-------------|-------------|
+| `GET /v1/sandboxes/{id}/diagnostics/summary` | `tail` (default 50), `event_limit` (default 20) | Combined inspect + events + logs |
+| `GET /v1/sandboxes/{id}/diagnostics/logs` | `tail` (default 100), `since` (e.g. 10m, 1h) | Container/pod logs |
+| `GET /v1/sandboxes/{id}/diagnostics/inspect` | - | Container/pod detailed state |
+| `GET /v1/sandboxes/{id}/diagnostics/events` | `limit` (default 50) | Container/pod events |
@@ -190,6 +190,38 @@ Shortcut for `osb command run`. Everything after `--` is passed as the command.
 | `context`   | Manage code execution contexts            |
 | `interrupt` | Interrupt a running code execution        |
 
+### `osb devops` — DevOps Diagnostics
+
+| Command   | Description                                          |
+| --------- | ---------------------------------------------------- |
+| `logs`    | Retrieve container/pod logs                          |
+| `inspect` | Retrieve detailed container/pod inspection info      |
+| `events`  | Retrieve events related to a sandbox                 |
+| `summary` | One-shot diagnostics: inspect + events + logs combined |
+
+```bash
+# Quick diagnostics summary
+osb devops summary <sandbox-id>
+
+# Get last 500 log lines
+osb devops logs <sandbox-id> --tail 500
+
+# Get logs from the last 30 minutes
+osb devops logs <sandbox-id> --since 30m
+
+# Detailed container/pod inspection
+osb devops inspect <sandbox-id>
+
+# View events (up to 100)
+osb devops events <sandbox-id> --limit 100
+```
+
+All devops commands return plain text output, making them ideal for both human reading and AI agent consumption.
+
+![DevOps Summary 1](assets/cli_devops_summary_1.png)
+
+![DevOps Summary 2](assets/cli_devops_summary_2.png)
+
 ### `osb config` — Configuration
 
 | Command | Description                                |

@@ -0,0 +1,122 @@
+# Copyright 2026 Alibaba Group Holding Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""DevOps diagnostics commands: logs, inspect, events, summary."""
+
+from __future__ import annotations
+
+import click
+import httpx
+
+from opensandbox_cli.client import ClientContext
+from opensandbox_cli.utils import handle_errors
+
+
+def _devops_url(obj: ClientContext, sandbox_id: str, endpoint: str) -> str:
+    """Build the full URL for a DevOps diagnostics endpoint."""
+    cfg = obj.resolved_config
+    protocol = cfg.get("protocol", "http")
+    domain = cfg.get("domain", "localhost:8080")
+    return f"{protocol}://{domain}/v1/sandboxes/{sandbox_id}/diagnostics/{endpoint}"
+
+
+def _devops_headers(obj: ClientContext) -> dict[str, str]:
+    """Build request headers with optional API key."""
+    cfg = obj.resolved_config
+    headers: dict[str, str] = {"Accept": "text/plain"}
+    api_key = cfg.get("api_key")
+    if api_key:
+        headers["OPEN-SANDBOX-API-KEY"] = api_key
+    return headers
+
+
+def _fetch_plain_text(obj: ClientContext, sandbox_id: str, endpoint: str, params: dict | None = None) -> str:
+    """Fetch a diagnostics endpoint and return the plain-text body."""
+    sandbox_id = obj.resolve_sandbox_id(sandbox_id)
+    url = _devops_url(obj, sandbox_id, endpoint)
+    headers = _devops_headers(obj)
+    cfg = obj.resolved_config
+    timeout = cfg.get("request_timeout", 30)
+
+    resp = httpx.get(url, headers=headers, params=params, timeout=timeout)
+    if resp.status_code == 404:
+        raise click.ClickException(f"Sandbox '{sandbox_id}' not found.")
+    resp.raise_for_status()
+    return resp.text
+
+
+@click.group("devops", invoke_without_command=True)
+@click.pass_context
+def devops_group(ctx: click.Context) -> None:
+    """DevOps diagnostics for sandbox troubleshooting."""
+    if ctx.invoked_subcommand is None:
+        click.echo(ctx.get_help())
+
+
+# ---- logs ----------------------------------------------------------------
+
+@devops_group.command("logs")
+@click.argument("sandbox_id")
+@click.option("--tail", "-n", type=int, default=100, show_default=True, help="Number of trailing log lines.")
+@click.option("--since", "-s", default=None, help="Only logs newer than this duration (e.g. 10m, 1h).")
+@click.pass_obj
+@handle_errors
+def devops_logs(obj: ClientContext, sandbox_id: str, tail: int, since: str | None) -> None:
+    """Retrieve container logs for a sandbox."""
+    params: dict = {"tail": tail}
+    if since:
+        params["since"] = since
+    text = _fetch_plain_text(obj, sandbox_id, "logs", params=params)
+    click.echo(text)
+
+
+# ---- inspect -------------------------------------------------------------
+
+@devops_group.command("inspect")
+@click.argument("sandbox_id")
+@click.pass_obj
+@handle_errors
+def devops_inspect(obj: ClientContext, sandbox_id: str) -> None:
+    """Retrieve detailed container/pod inspection info."""
+    text = _fetch_plain_text(obj, sandbox_id, "inspect")
+    click.echo(text)
+
+
+# ---- events --------------------------------------------------------------
+
+@devops_group.command("events")
+@click.argument("sandbox_id")
+@click.option("--limit", "-l", type=int, default=50, show_default=True, help="Maximum number of events.")
+@click.pass_obj
+@handle_errors
+def devops_events(obj: ClientContext, sandbox_id: str, limit: int) -> None:
+    """Retrieve events related to a sandbox."""
+    params: dict = {"limit": limit}
+    text = _fetch_plain_text(obj, sandbox_id, "events", params=params)
+    click.echo(text)
+
+
+# ---- summary -------------------------------------------------------------
+
+@devops_group.command("summary")
+@click.argument("sandbox_id")
+@click.option("--tail", "-n", type=int, default=50, show_default=True, help="Number of trailing log lines.")
+@click.option("--event-limit", type=int, default=20, show_default=True, help="Maximum number of events.")
+@click.pass_obj
+@handle_errors
+def devops_summary(obj: ClientContext, sandbox_id: str, tail: int, event_limit: int) -> None:
+    """One-shot diagnostics: inspect + events + logs combined."""
+    params: dict = {"tail": tail, "event_limit": event_limit}
+    text = _fetch_plain_text(obj, sandbox_id, "summary", params=params)
+    click.echo(text)
@@ -26,6 +26,7 @@
 from opensandbox_cli.commands.code import code_group
 from opensandbox_cli.commands.command import command_group, exec_cmd
 from opensandbox_cli.commands.config_cmd import config_group
+from opensandbox_cli.commands.devops import devops_group
 from opensandbox_cli.commands.file import file_group
 from opensandbox_cli.commands.sandbox import sandbox_group
 from opensandbox_cli.config import resolve_config
@@ -109,3 +110,4 @@ def cli(
 cli.add_command(file_group)
 cli.add_command(code_group)
 cli.add_command(config_group)
+cli.add_command(devops_group)