Enhance sandbox pause/resume#1149
Conversation
- Added OSEP-0013: Isolated Execution API (implementing, due 2026-06-23) - Added OSEP-0014: Multi-Tenancy Support for Kubernetes Runtime (draft, due 2026-05-07) - Added OSEP-0015: Spec-Driven Pod Snapshot for Pause and Resume (draft, due 2026-06-27)
|
Changed directories: oseps. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use |
There was a problem hiding this comment.
Pull request overview
This PR updates the OSEP index and introduces a new enhancement proposal documenting a spec-driven Kubernetes pause/resume design using inline snapshot state on BatchSandbox plus new SnapshotClass/SnapshotClaim CRDs.
Changes:
- Add OSEP-0013/0014/0015 entries to
oseps/README.md. - Add new proposal document
oseps/0015-pod-snapshot.mddescribing pod snapshot-based pause/resume semantics and APIs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| oseps/README.md | Extends the OSEP index table with entries for OSEP-0013 through OSEP-0015. |
| oseps/0015-pod-snapshot.md | Adds the full OSEP-0015 draft describing the proposed pause/resume and snapshot CRD model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| routes in `server/opensandbox_server/api/lifecycle.py` currently accept only the | ||
| path parameter and `X-Request-ID`. This OSEP updates those definitions to accept | ||
| an optional `PauseSandboxRequest` body. Omitting the body, sending `{}`, or using | ||
| old SDKs remains valid: |
| |---|---|---|---|---| | ||
| | `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SnapshotClass` required) | | ||
| | `Freeze` | Keep the Pod, freeze container cgroups. | Kept | Kept (in node RAM) | Reserved (Kubernetes freeze is future work) | | ||
| | `Hibernate` | Capture full pod state via the snapshot Job, then delete the Pod. | Persisted as checkpoint artifact | Persisted as checkpoint artifact | **Implemented** (replaces the OSEP-0008 path) | |
| This revision implements one snapshot mechanism: a same-node full-pod-state | ||
| snapshot Job. The `type` enum is the extension point for future backends without | ||
| changing the `BatchSandbox` |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e3285b20c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| 9. Ctrl status.snapshot.phase: Pending -> Committing -> Ready, | ||
| artifacts[] populated | ||
| 10. Ctrl Delete the Pod, status.phase = Paused. |
There was a problem hiding this comment.
Recheck the source Pod UID before deleting
In clusters where the sandbox pod is evicted or recreated while the checkpoint job is running, this flow can record an artifact for one pod and then delete/report Paused for a replacement pod with the same name. Since the proposal records sourcePodUID, require the controller to compare that UID with the current live pod before deletion and fail closed if it changed, otherwise resume can restore stale state and lose writes from the replacement pod.
Useful? React with 👍 / 👎.
| When no `SnapshotClass` is referenced by that claim and none is annotated | ||
| default, the controller synthesizes an implicit default class from the legacy | ||
| startup flags (`--snapshot-registry`, | ||
| `--snapshot-registry-insecure`) so existing clusters keep working with zero |
There was a problem hiding this comment.
Migrate registry secret settings too
Existing OSEP-0008 deployments commonly require --snapshot-push-secret and --resume-pull-secret (the controller flags and pause-resume guide both document them), but this zero-config migration only synthesizes defaults from --snapshot-registry and --snapshot-registry-insecure while the new design forbids Kubernetes Secrets. Authenticated registries will stop working after rollout unless the migration preserves those credentials somehow or the proposal removes the zero-config compatibility claim.
Useful? React with 👍 / 👎.
| metadata: | ||
| name: sandbox-abc123-pause-20260627 | ||
| namespace: default | ||
| labels: | ||
| opensandbox.io/sandbox-id: sandbox-abc123 | ||
| opensandbox.io/generated-from-template: tenant-s3 | ||
| spec: | ||
| snapshotClassName: s3 | ||
| parameters: | ||
| region: us-west-2 | ||
| bucket: opensandbox-snapshots | ||
| prefix: tenants/default/sandbox-abc123/pauses/20260627 |
There was a problem hiding this comment.
Make generated claim names unique per pause
When the same sandbox pauses from the same template more than once on the same day, this generated claim name and prefix collide with the previous pause. That conflicts with the design's rule that per-pause values must create a new claim instead of mutating a shared one, and with the e2e expectation that two template pauses produce distinct claims/prefixes; include a generation, timestamp with sufficient granularity, or nonce in both the claim name and default prefix.
Useful? React with 👍 / 👎.
| The implementation updates these source definitions: | ||
|
|
||
| - `specs/sandbox-lifecycle.yml`: add optional `requestBody` for | ||
| `/sandboxes/{sandboxId}/pause` that references `PauseSandboxRequest`; keep | ||
| `/resume` bodyless in this revision. | ||
| - `server/opensandbox_server/api/schema.py`: add Pydantic models for | ||
| `PauseSandboxRequest`, `SnapshotPauseRequest`, and | ||
| inline/generated claim options. | ||
| - `server/opensandbox_server/api/lifecycle.py`: accept | ||
| `body: PauseSandboxRequest | None = Body(None)` on `pause_sandbox`. | ||
| - Kubernetes provider/runtime code: materialize the selected/generated | ||
| `SnapshotClaim`, then patch the `BatchSandbox`. |
There was a problem hiding this comment.
Include SDK updates for the new pause body
This implementation plan updates the OpenAPI spec, FastAPI schema, route, and Kubernetes provider, but omits SDK interfaces. Current supported SDKs expose PauseSandbox(id) and send no request body, so advanced callers still cannot choose Stop, a claim, or a template through the public clients even though the proposal says new clients can opt in; add generated/handwritten SDK model and method updates to the required source changes.
Useful? React with 👍 / 👎.
2e3285b to
496889b
Compare
496889b to
e91cd5a
Compare
|
This is useful as a future Hibernate/checkpoint + pluggable snapshot backend design, but the motivation should not frame it as fixing a current server-side Please also preserve the current pool-mode pause/resume behavior. Sandboxes created with |
e91cd5a to
1233df6
Compare
| // Snapshot references the current SandboxSnapshot result for the latest pause. | ||
| // The full per-container artifact detail lives on that object. | ||
| // +optional | ||
| Snapshot *SnapshotStatusRef `json:"snapshot,omitempty"` |
There was a problem hiding this comment.
The description of SnapshotStatusRef here is inconsistent with the description in the Kubernetes Resource Overview section.
|
|
||
| | Operating mode | What pause does | Filesystem | Memory | Status this revision | | ||
| |---|---|---|---|---| | ||
| | `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SandboxSnapshotClass` required) | |
There was a problem hiding this comment.
Why do we need a Stop mode? What is the difference between Stop and Delete BatchSandbox? The client can rebuild from the original spec.
| |---|---|---|---|---| | ||
| | `Stop` | Delete the Pod and release compute. | Lost unless on a retained PVC (OSEP-0003). | Lost | Supported (no snapshot, no `SandboxSnapshotClass` required) | | ||
| | `Freeze` | Keep the Pod, freeze container cgroups. | Kept | Kept (in node RAM) | Reserved (Kubernetes freeze is future work) | | ||
| | `Hibernate` | Capture full pod state via the snapshot Job, then delete the Pod. | Persisted as checkpoint artifact | Persisted as checkpoint artifact | **Implemented** (replaces the OSEP-0008 path) | |
There was a problem hiding this comment.
Maybe I should add a mode that only preserves the filesystem. In current AI Agent scenarios, recovery can be done solely through the filesystem, without needing memory?
| condition. | ||
| - `Hibernate` deletes the Pod and reports public `Paused` only after the snapshot | ||
| artifact is durable. | ||
| - `snapshotClaimTemplateName` is a server materialization input. After generating |
There was a problem hiding this comment.
The wording here — "snapshotClaimTemplateName is a server materialization input" — feels a bit awkward. Since there's no need for snapshotClaimTemplate at the controller level, why is this field still retained in spec.snapshotStrategy? Is it there for future extensibility?
Summary
This pull request added
OSEP-0015for "Spec-Driven Pod Snapshot for Pause and Resume" with status "draft" and date "2026-06-27".Testing
Breaking Changes
Checklist