Skip to content

Enhance sandbox pause/resume#1149

Draft
ryanzhang-oss wants to merge 2 commits into
opensandbox-group:mainfrom
ryanzhang-oss:copilot/worktree-2026-06-27T20-56-53
Draft

Enhance sandbox pause/resume#1149
ryanzhang-oss wants to merge 2 commits into
opensandbox-group:mainfrom
ryanzhang-oss:copilot/worktree-2026-06-27T20-56-53

Conversation

@ryanzhang-oss

@ryanzhang-oss ryanzhang-oss commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

This pull request added OSEP-0015 for "Spec-Driven Pod Snapshot for Pause and Resume" with status "draft" and date "2026-06-27".

Testing

  • Not run (OSEP proposal)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

- Added OSEP-0013: Isolated Execution API (implementing, due 2026-06-23)
- Added OSEP-0014: Multi-Tenancy Support for Kubernetes Runtime (draft, due 2026-05-07)
- Added OSEP-0015: Spec-Driven Pod Snapshot for Pause and Resume (draft, due 2026-06-27)
Copilot AI review requested due to automatic review settings June 29, 2026 18:58
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: oseps.

📋 Recommended labels (based on changed files):

  • documentation ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ryanzhang-oss

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the OSEP index and introduces a new enhancement proposal documenting a spec-driven Kubernetes pause/resume design using inline snapshot state on BatchSandbox plus new SnapshotClass/SnapshotClaim CRDs.

Changes:

  • Add OSEP-0013/0014/0015 entries to oseps/README.md.
  • Add new proposal document oseps/0015-pod-snapshot.md describing pod snapshot-based pause/resume semantics and APIs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
oseps/README.md Extends the OSEP index table with entries for OSEP-0013 through OSEP-0015.
oseps/0015-pod-snapshot.md Adds the full OSEP-0015 draft describing the proposed pause/resume and snapshot CRD model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +301 to +304
routes in `server/opensandbox_server/api/lifecycle.py` currently accept only the
path parameter and `X-Request-ID`. This OSEP updates those definitions to accept
an optional `PauseSandboxRequest` body. Omitting the body, sending `{}`, or using
old SDKs remains valid:
|---|---|---|---|---|
| `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SnapshotClass` required) |
| `Freeze` | Keep the Pod, freeze container cgroups. | Kept | Kept (in node RAM) | Reserved (Kubernetes freeze is future work) |
| `Hibernate` | Capture full pod state via the snapshot Job, then delete the Pod. | Persisted as checkpoint artifact | Persisted as checkpoint artifact | **Implemented** (replaces the OSEP-0008 path) |
Comment thread oseps/0015-pod-snapshot.md Outdated
Comment on lines +980 to +982
This revision implements one snapshot mechanism: a same-node full-pod-state
snapshot Job. The `type` enum is the extension point for future backends without
changing the `BatchSandbox`

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e3285b20c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread oseps/0015-pod-snapshot.md Outdated
Comment thread oseps/0015-pod-snapshot.md
Comment thread oseps/0015-pod-snapshot.md
Comment thread oseps/0015-pod-snapshot.md Outdated
Comment thread oseps/0015-pod-snapshot.md Outdated
Comment thread oseps/0015-pod-snapshot.md
Comment on lines +1117 to +1119
9. Ctrl status.snapshot.phase: Pending -> Committing -> Ready,
artifacts[] populated
10. Ctrl Delete the Pod, status.phase = Paused.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recheck the source Pod UID before deleting

In clusters where the sandbox pod is evicted or recreated while the checkpoint job is running, this flow can record an artifact for one pod and then delete/report Paused for a replacement pod with the same name. Since the proposal records sourcePodUID, require the controller to compare that UID with the current live pod before deletion and fail closed if it changed, otherwise resume can restore stale state and lose writes from the replacement pod.

Useful? React with 👍 / 👎.

Comment thread oseps/0015-pod-snapshot.md Outdated
Comment on lines +1697 to +1700
When no `SnapshotClass` is referenced by that claim and none is annotated
default, the controller synthesizes an implicit default class from the legacy
startup flags (`--snapshot-registry`,
`--snapshot-registry-insecure`) so existing clusters keep working with zero

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Migrate registry secret settings too

Existing OSEP-0008 deployments commonly require --snapshot-push-secret and --resume-pull-secret (the controller flags and pause-resume guide both document them), but this zero-config migration only synthesizes defaults from --snapshot-registry and --snapshot-registry-insecure while the new design forbids Kubernetes Secrets. Authenticated registries will stop working after rollout unless the migration preserves those credentials somehow or the proposal removes the zero-config compatibility claim.

Useful? React with 👍 / 👎.

Comment thread oseps/0015-pod-snapshot.md Outdated
Comment on lines +523 to +534
metadata:
name: sandbox-abc123-pause-20260627
namespace: default
labels:
opensandbox.io/sandbox-id: sandbox-abc123
opensandbox.io/generated-from-template: tenant-s3
spec:
snapshotClassName: s3
parameters:
region: us-west-2
bucket: opensandbox-snapshots
prefix: tenants/default/sandbox-abc123/pauses/20260627

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make generated claim names unique per pause

When the same sandbox pauses from the same template more than once on the same day, this generated claim name and prefix collide with the previous pause. That conflicts with the design's rule that per-pause values must create a new claim instead of mutating a shared one, and with the e2e expectation that two template pauses produce distinct claims/prefixes; include a generation, timestamp with sufficient granularity, or nonce in both the claim name and default prefix.

Useful? React with 👍 / 👎.

Comment thread oseps/0015-pod-snapshot.md Outdated
Comment on lines +1198 to +1209
The implementation updates these source definitions:

- `specs/sandbox-lifecycle.yml`: add optional `requestBody` for
`/sandboxes/{sandboxId}/pause` that references `PauseSandboxRequest`; keep
`/resume` bodyless in this revision.
- `server/opensandbox_server/api/schema.py`: add Pydantic models for
`PauseSandboxRequest`, `SnapshotPauseRequest`, and
inline/generated claim options.
- `server/opensandbox_server/api/lifecycle.py`: accept
`body: PauseSandboxRequest | None = Body(None)` on `pause_sandbox`.
- Kubernetes provider/runtime code: materialize the selected/generated
`SnapshotClaim`, then patch the `BatchSandbox`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include SDK updates for the new pause body

This implementation plan updates the OpenAPI spec, FastAPI schema, route, and Kubernetes provider, but omits SDK interfaces. Current supported SDKs expose PauseSandbox(id) and send no request body, so advanced callers still cannot choose Stop, a claim, or a template through the public clients even though the proposal says new clients can opt in; add generated/handwritten SDK model and method updates to the required source changes.

Useful? React with 👍 / 👎.

@ryanzhang-oss ryanzhang-oss marked this pull request as draft June 29, 2026 19:56
@ryanzhang-oss ryanzhang-oss force-pushed the copilot/worktree-2026-06-27T20-56-53 branch from 2e3285b to 496889b Compare July 1, 2026 00:36
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jul 1, 2026
@ryanzhang-oss ryanzhang-oss force-pushed the copilot/worktree-2026-06-27T20-56-53 branch from 496889b to e91cd5a Compare July 1, 2026 00:49
@hittyt hittyt added the OSEP label Jul 1, 2026
@fengcone

fengcone commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

This is useful as a future Hibernate/checkpoint + pluggable snapshot backend design, but the motivation should not frame it as fixing a current server-side SandboxSnapshot orchestration problem. The current pause/resume guide documents the server as patching BatchSandbox.spec.pause=true/false; the BatchSandbox controller then creates/reads the internal SandboxSnapshot, and registry plus push/pull secrets are controller-level configuration. Please update the motivation and OSEP-0008 comparison to use that current flow and implementation as the baseline, then justify why replacing the controller-owned internal snapshot path with inline BatchSandbox snapshot state is necessary.

Please also preserve the current pool-mode pause/resume behavior. Sandboxes created with extensions.poolRef are paused by detaching from the pool: the controller solidifies the allocated pool Pod into a concrete spec.template, clears spec.poolRef, releases the pooled allocation, and resumes later from that solidified template. If the new inline path becomes the default, it should either keep this detach-from-pool flow or explicitly leave pool-backed sandboxes on the legacy path; otherwise existing pool-backed pause/resume support regresses.

@ryanzhang-oss ryanzhang-oss force-pushed the copilot/worktree-2026-06-27T20-56-53 branch from e91cd5a to 1233df6 Compare July 2, 2026 00:32
// Snapshot references the current SandboxSnapshot result for the latest pause.
// The full per-container artifact detail lives on that object.
// +optional
Snapshot *SnapshotStatusRef `json:"snapshot,omitempty"`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of SnapshotStatusRef here is inconsistent with the description in the Kubernetes Resource Overview section.


| Operating mode | What pause does | Filesystem | Memory | Status this revision |
|---|---|---|---|---|
| `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SandboxSnapshotClass` required) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a Stop mode? What is the difference between Stop and Delete BatchSandbox? The client can rebuild from the original spec.

|---|---|---|---|---|
| `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SandboxSnapshotClass` required) |
| `Freeze` | Keep the Pod, freeze container cgroups. | Kept | Kept (in node RAM) | Reserved (Kubernetes freeze is future work) |
| `Hibernate` | Capture full pod state via the snapshot Job, then delete the Pod. | Persisted as checkpoint artifact | Persisted as checkpoint artifact | **Implemented** (replaces the OSEP-0008 path) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should add a mode that only preserves the filesystem. In current AI Agent scenarios, recovery can be done solely through the filesystem, without needing memory?

condition.
- `Hibernate` deletes the Pod and reports public `Paused` only after the snapshot
artifact is durable.
- `snapshotClaimTemplateName` is a server materialization input. After generating

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here — "snapshotClaimTemplateName is a server materialization input" — feels a bit awkward. Since there's no need for snapshotClaimTemplate at the controller level, why is this field still retained in spec.snapshotStrategy? Is it there for future extensibility?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation OSEP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants