Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions .claude/skills/troubleshoot-sandbox/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
name: troubleshoot-sandbox
description: Troubleshoot OpenSandbox issues by running diagnostics (logs, inspect, events, summary) via CLI or HTTP API to diagnose sandbox failures like OOM, crash, image pull errors, network problems, etc.
user-invocable: true
argument-hint: "[sandbox-id]"
---

# OpenSandbox Troubleshooting

Troubleshoot sandbox $ARGUMENTS using the opensandbox diagnostics.

There are two ways to interact with the diagnostics API: **CLI** (if opensandbox CLI is installed) or **HTTP** (curl against the server directly). Use whichever is available. The HTTP approach works regardless of how the sandbox was created (SDK, API, CLI).

## Workflow

### Step 1: Confirm sandbox state

**CLI:**
```bash
opensandbox sandbox get <sandbox-id>
```

**HTTP:**
```bash
curl http://<server-domain>/v1/sandboxes/<sandbox-id>
```

If the server requires authentication, add `-H "OPEN-SANDBOX-API-KEY: <your-key>"` to all curl commands.

Check the sandbox status (Running, Pending, Paused, Failed, etc.). If the sandbox is not found, it may have been deleted or expired.

### Step 2: Get diagnostics summary (recommended first action)

**CLI:**
```bash
opensandbox devops summary <sandbox-id>
```

**HTTP:**
```bash
curl http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/summary
```

This returns a combined plain-text view of:
- **Inspect**: container/pod details (status, resources, network, labels)
- **Events**: state transitions, OOM kills, errors
- **Logs**: recent container output

Read the output carefully and look for common failure patterns listed below.

### Step 3: Drill down if needed

If the summary is not enough, use individual endpoints for more detail:

**CLI:**
```bash
opensandbox devops logs <sandbox-id> --tail 500
opensandbox devops logs <sandbox-id> --since 30m
opensandbox devops inspect <sandbox-id>
opensandbox devops events <sandbox-id> --limit 100
```

**HTTP:**
```bash
# Get more log lines
curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/logs?tail=500"

# Get logs from recent time window
curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/logs?since=30m"

# Detailed container/pod inspection
curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/inspect"

# More events
curl "http://<server-domain>/v1/sandboxes/<sandbox-id>/diagnostics/events?limit=100"
```

### Step 4: Diagnose common problems

| Symptom | What to check | Likely cause |
|---------|---------------|--------------|
| Status=Pending, no IP | inspect - look for Waiting containers | Image pull failure, insufficient resources, node scheduling |
| OOMKilled=true | inspect - check memory limits | Container exceeded memory limit, increase memory resource |
| Exit Code 137 | events + logs | OOM kill or external SIGKILL |
| Exit Code 1 | logs - check application output | Application error, check entrypoint and env vars |
| Exit Code 126/127 | logs | Entrypoint command not found or not executable |
| Connection refused to sandbox | inspect - check ports and network | Service not started inside sandbox, wrong port, network policy blocking |
| Sandbox stuck in Running but unresponsive | logs (tail=200) | Application hung, check for deadlocks or resource exhaustion |
| execd health check failing | logs - look for execd errors | execd daemon crashed or port conflict |
| ImagePullBackOff (K8s) | events | Wrong image name, missing registry credentials |
| CrashLoopBackOff (K8s) | events + logs | Application keeps crashing, check exit code and logs |

### Step 5: Suggest resolution

Based on the diagnosis, suggest one of:
- **Image issue**: Verify image name, check registry access
- **OOM**: Increase memory limit in sandbox creation (e.g. `memory=4Gi`)
- **Application error**: Fix the entrypoint or application code
- **Network**: Check network policy, verify port configuration
- **Scheduling (K8s)**: Check node resources, check pool availability
- **execd**: Update execd image version, check port conflicts

## API Reference

All diagnostics endpoints return `text/plain` and are available at:

| Endpoint | Query Params | Description |
|----------|-------------|-------------|
| `GET /v1/sandboxes/{id}/diagnostics/summary` | `tail` (default 50), `event_limit` (default 20) | Combined inspect + events + logs |
| `GET /v1/sandboxes/{id}/diagnostics/logs` | `tail` (default 100), `since` (e.g. 10m, 1h) | Container/pod logs |
| `GET /v1/sandboxes/{id}/diagnostics/inspect` | - | Container/pod detailed state |
| `GET /v1/sandboxes/{id}/diagnostics/events` | `limit` (default 50) | Container/pod events |
32 changes: 32 additions & 0 deletions cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,38 @@ Shortcut for `osb command run`. Everything after `--` is passed as the command.
| `context` | Manage code execution contexts |
| `interrupt` | Interrupt a running code execution |

### `osb devops` — DevOps Diagnostics

| Command | Description |
| --------- | ---------------------------------------------------- |
| `logs` | Retrieve container/pod logs |
| `inspect` | Retrieve detailed container/pod inspection info |
| `events` | Retrieve events related to a sandbox |
| `summary` | One-shot diagnostics: inspect + events + logs combined |

```bash
# Quick diagnostics summary
osb devops summary <sandbox-id>

# Get last 500 log lines
osb devops logs <sandbox-id> --tail 500

# Get logs from the last 30 minutes
osb devops logs <sandbox-id> --since 30m

# Detailed container/pod inspection
osb devops inspect <sandbox-id>

# View events (up to 100)
osb devops events <sandbox-id> --limit 100
```

All devops commands return plain text output, making them ideal for both human reading and AI agent consumption.

![DevOps Summary 1](assets/cli_devops_summary_1.png)

![DevOps Summary 2](assets/cli_devops_summary_2.png)

### `osb config` — Configuration

| Command | Description |
Expand Down
Binary file added cli/assets/cli_devops_summary_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added cli/assets/cli_devops_summary_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
122 changes: 122 additions & 0 deletions cli/src/opensandbox_cli/commands/devops.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Copyright 2026 Alibaba Group Holding Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""DevOps diagnostics commands: logs, inspect, events, summary."""

from __future__ import annotations

import click
import httpx

from opensandbox_cli.client import ClientContext
from opensandbox_cli.utils import handle_errors


def _devops_url(obj: ClientContext, sandbox_id: str, endpoint: str) -> str:
"""Build the full URL for a DevOps diagnostics endpoint."""
cfg = obj.resolved_config
protocol = cfg.get("protocol", "http")
domain = cfg.get("domain", "localhost:8080")
return f"{protocol}://{domain}/v1/sandboxes/{sandbox_id}/diagnostics/{endpoint}"


def _devops_headers(obj: ClientContext) -> dict[str, str]:
"""Build request headers with optional API key."""
cfg = obj.resolved_config
headers: dict[str, str] = {"Accept": "text/plain"}
api_key = cfg.get("api_key")
if api_key:
headers["OPEN-SANDBOX-API-KEY"] = api_key
return headers


def _fetch_plain_text(obj: ClientContext, sandbox_id: str, endpoint: str, params: dict | None = None) -> str:
"""Fetch a diagnostics endpoint and return the plain-text body."""
sandbox_id = obj.resolve_sandbox_id(sandbox_id)
url = _devops_url(obj, sandbox_id, endpoint)
headers = _devops_headers(obj)
cfg = obj.resolved_config
timeout = cfg.get("request_timeout", 30)

resp = httpx.get(url, headers=headers, params=params, timeout=timeout)
if resp.status_code == 404:
raise click.ClickException(f"Sandbox '{sandbox_id}' not found.")
resp.raise_for_status()
return resp.text


@click.group("devops", invoke_without_command=True)
@click.pass_context
def devops_group(ctx: click.Context) -> None:
"""DevOps diagnostics for sandbox troubleshooting."""
if ctx.invoked_subcommand is None:
click.echo(ctx.get_help())


# ---- logs ----------------------------------------------------------------

@devops_group.command("logs")
@click.argument("sandbox_id")
@click.option("--tail", "-n", type=int, default=100, show_default=True, help="Number of trailing log lines.")
@click.option("--since", "-s", default=None, help="Only logs newer than this duration (e.g. 10m, 1h).")
@click.pass_obj
@handle_errors
def devops_logs(obj: ClientContext, sandbox_id: str, tail: int, since: str | None) -> None:
"""Retrieve container logs for a sandbox."""
params: dict = {"tail": tail}
if since:
params["since"] = since
text = _fetch_plain_text(obj, sandbox_id, "logs", params=params)
click.echo(text)


# ---- inspect -------------------------------------------------------------

@devops_group.command("inspect")
@click.argument("sandbox_id")
@click.pass_obj
@handle_errors
def devops_inspect(obj: ClientContext, sandbox_id: str) -> None:
"""Retrieve detailed container/pod inspection info."""
text = _fetch_plain_text(obj, sandbox_id, "inspect")
click.echo(text)


# ---- events --------------------------------------------------------------

@devops_group.command("events")
@click.argument("sandbox_id")
@click.option("--limit", "-l", type=int, default=50, show_default=True, help="Maximum number of events.")
@click.pass_obj
@handle_errors
def devops_events(obj: ClientContext, sandbox_id: str, limit: int) -> None:
"""Retrieve events related to a sandbox."""
params: dict = {"limit": limit}
text = _fetch_plain_text(obj, sandbox_id, "events", params=params)
click.echo(text)


# ---- summary -------------------------------------------------------------

@devops_group.command("summary")
@click.argument("sandbox_id")
@click.option("--tail", "-n", type=int, default=50, show_default=True, help="Number of trailing log lines.")
@click.option("--event-limit", type=int, default=20, show_default=True, help="Maximum number of events.")
@click.pass_obj
@handle_errors
def devops_summary(obj: ClientContext, sandbox_id: str, tail: int, event_limit: int) -> None:
"""One-shot diagnostics: inspect + events + logs combined."""
params: dict = {"tail": tail, "event_limit": event_limit}
text = _fetch_plain_text(obj, sandbox_id, "summary", params=params)
click.echo(text)
2 changes: 2 additions & 0 deletions cli/src/opensandbox_cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from opensandbox_cli.commands.code import code_group
from opensandbox_cli.commands.command import command_group, exec_cmd
from opensandbox_cli.commands.config_cmd import config_group
from opensandbox_cli.commands.devops import devops_group
from opensandbox_cli.commands.file import file_group
from opensandbox_cli.commands.sandbox import sandbox_group
from opensandbox_cli.config import resolve_config
Expand Down Expand Up @@ -109,3 +110,4 @@ def cli(
cli.add_command(file_group)
cli.add_command(code_group)
cli.add_command(config_group)
cli.add_command(devops_group)
Binary file added docs/assets/ai-troubleshoot-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/ai-troubleshoot-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading