feat(workspace): add snapshot/hydration for disaster recovery (#167) by reidliu41 · Pull Request #1198 · nearai/ironclaw

reidliu41 · 2026-03-15T07:19:57Z

Summary

Add periodic export of identity docs + context/** to length-prefixed Markdown file on disk
Add startup hydration: restore from snapshot when workspace database is empty
Use byte-length framing so embedded markers/HTML in content round-trip exactly
Validate user_id has no whitespace before writing metadata (split_whitespace safety)
Validate marker values reject control chars and --> to prevent format injection
Detect malformed begin markers with non-numeric length as explicit format errors
Skip content regions during parsing so fake markers inside documents are never matched
Guard concurrent snapshot passes with AtomicBool; update cadence state only on success
Wire snapshot config through Agent, HeartbeatRunner, and spawn_heartbeat
Insert hydration step in app.rs startup chain: hydrate → import → seed → backfill
Each recovery step is best-effort — failure warns but never blocks subsequent steps
Add 31 unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency

Change Type

Linked Issue

Closes #167

Validation

cargo fmt
cargo clippy --all --benches --tests --examples --all-features
x ] Relevant tests pass:
Manual testing:

Security Impact

Security Impact: Writes sensitive workspace content to local disk for recovery. Files remain plaintext; Unix permissions are
restricted to 0600. No encryption/signing/attestation in this PR.

Database Impact

None

Blast Radius

Rollback Plan

Review track:

…#167) - Add periodic export of identity docs + context/** to length-prefixed Markdown file on disk - Add startup hydration: restore from snapshot when workspace database is empty - Use byte-length framing so embedded markers/HTML in content round-trip exactly - Validate user_id has no whitespace before writing metadata (split_whitespace safety) - Validate marker values reject control chars and --> to prevent format injection - Detect malformed begin markers with non-numeric length as explicit format errors - Skip content regions during parsing so fake markers inside documents are never matched - Guard concurrent snapshot passes with AtomicBool; update cadence state only on success - Wire snapshot config through Agent, HeartbeatRunner, and spawn_heartbeat - Insert hydration step in app.rs startup chain: hydrate → import → seed → backfill - Each recovery step is best-effort — failure warns but never blocks subsequent steps - Add 31 unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency

gemini-code-assist · 2026-03-15T07:20:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical disaster recovery mechanism for the workspace by enabling periodic snapshots of essential documents and an automatic hydration process upon application startup. This ensures that even if the primary workspace database is lost or empty, core user data and context can be restored from a file-based backup. The implementation focuses on data integrity through a robust, byte-length-prefixed format and includes comprehensive validation to prevent data corruption or injection, making the system more resilient and reliable.

Highlights

Disaster Recovery via Workspace Snapshot: Implemented a new feature for periodic export of core workspace documents (identity files, MEMORY.md, and context/**) to a length-prefixed Markdown snapshot file on disk.
Startup Hydration: Added a startup hydration mechanism that restores the workspace from a snapshot file if the database is found to be empty, ensuring data persistence across restarts.
Robust Snapshot Format: Designed the snapshot format using byte-length framing to guarantee exact round-trip fidelity for document content, including embedded markers or HTML.
Security and Strictness: Incorporated validation for user IDs to prevent whitespace issues and for marker values to reject control characters or HTML comment terminators, mitigating format injection risks. Malformed markers are explicitly detected as errors.
Concurrency and Cadence Management: Introduced an AtomicBool guard to prevent concurrent snapshot passes and updated cadence state only upon successful snapshot completion, ensuring reliability.
Integration and Configuration: Integrated snapshot configuration through the Agent, HeartbeatRunner, and spawn_heartbeat functions, and added a hydration step to the application's startup chain (hydrate → import → seed → backfill), with each recovery step designed to be best-effort.
Comprehensive Testing: Included 31 new unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency aspects of the snapshot and hydration functionality.

Changelog

src/agent/agent_loop.rs
- Added snapshot_config field to the Agent struct.
- Updated Agent::new constructor to accept snapshot_config.
- Modified spawn_heartbeat call to pass the resolved snapshot_config.
src/agent/commands.rs
- Updated HeartbeatRunner::new instantiation to include a default, disabled SnapshotConfig for manual heartbeats.
src/agent/dispatcher.rs
- Updated test Agent::new calls to accommodate the new snapshot_config parameter with None.
src/agent/heartbeat.rs
- Added snapshot_config field to the HeartbeatRunner struct.
- Updated HeartbeatRunner::new constructor to accept snapshot_config.
- Integrated snapshot_if_due into the run_heartbeat_loop to perform snapshots after hygiene.
- Modified spawn_heartbeat function signature and call to accept and pass snapshot_config.
src/app.rs
- Introduced a 'Startup recovery chain' including a hydration step from a snapshot if the workspace is empty and a snapshot file exists.
src/config/mod.rs
- Added snapshot module to mod.rs.
- Exported SnapshotConfig from the config module.
- Added snapshot: SnapshotConfig field to the main Config struct.
- Initialized and resolved SnapshotConfig within Config::default() and Config::resolve().
src/config/snapshot.rs
- Added new module defining SnapshotConfig for application-level settings.
- Implemented Default trait for SnapshotConfig with reasonable defaults.
- Provided resolve() method to load configuration from environment variables.
- Added to_workspace_config() method to convert application config to workspace-specific config, including user ID sanitization for file paths.
- Included unit tests for default config, user ID sanitization, and path resolution.
src/main.rs
- Updated Agent::new call in async_main to pass the config.snapshot.
src/workspace/README.md
- Updated the spawn_heartbeat example to reflect the new snapshot_config parameter.
src/workspace/mod.rs
- Added pub mod snapshot; to expose the new snapshot module.
src/workspace/snapshot.rs
- Added new module implementing workspace snapshot and hydration logic.
- Defined SnapshotConfig (workspace-side), SnapshotReport, HydrationReport, and SnapshotError.
- Implemented snapshot_if_due for periodic document export with cadence and concurrency control.
- Implemented hydrate_from_snapshot for restoring documents from a snapshot file, including version and user ID validation.
- Defined document allowlists (SNAPSHOT_DOCS, SNAPSHOT_PREFIXES) for inclusion in snapshots.
- Introduced byte-length framing for robust content round-tripping.
- Added validate_marker_safe to prevent format injection and sanitize_user_id for safe file paths.
- Included extensive unit tests for format fidelity, parsing strictness, marker safety, allowlisting, state persistence, and concurrency.
tests/e2e_routine_heartbeat.rs
- Updated HeartbeatRunner::new calls in end-to-end tests to include a default SnapshotConfig.
tests/heartbeat_integration.rs
- Updated HeartbeatRunner::new call in integration tests to include a default SnapshotConfig.
tests/support/gateway_workflow_harness.rs
- Updated Agent::new call in test harness to include None for the new snapshot_config parameter.
tests/support/test_rig.rs
- Updated Agent::new call in test rig to include None for the new snapshot_config parameter.

Activity

The pull request introduces a new feature for workspace snapshot and hydration.
The PR description provides a detailed summary of changes and validation steps.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a new workspace snapshot feature, enabling periodic export of core workspace documents to a structured Markdown file and hydration from this snapshot on startup. This involves adding a SnapshotConfig struct, a dedicated src/workspace/snapshot.rs module for snapshot/hydration logic, and integrating this functionality into the agent's heartbeat process and application startup. Review comments suggest several improvements: simplifying snapshot configuration creation in agent_loop.rs, adding error logging for JSON serialization failures in snapshot.rs, optimizing regex compilation in snapshot.rs for performance, and completing the spawn_heartbeat example in README.md for clarity.

gemini-code-assist · 2026-03-15T07:23:35Z

src/agent/agent_loop.rs

+                    let snapshot = self
+                        .snapshot_config
+                        .as_ref()
+                        .map(|s| s.to_workspace_config(workspace.user_id()))
+                        .unwrap_or_else(|| {
+                            crate::config::SnapshotConfig::default()
+                                .to_workspace_config(workspace.user_id())
+                        });


This logic for creating the snapshot configuration can be simplified to be more concise and readable, reducing redundancy.

let snapshot = self .snapshot_config .as_ref() .cloned() .unwrap_or_default() .to_workspace_config(workspace.user_id());

gemini-code-assist · 2026-03-15T07:23:35Z

src/workspace/snapshot.rs

+    let Ok(json) = serde_json::to_string_pretty(&state) else {
+        return;
+    };


The function returns silently if JSON serialization fails. It would be better to log a warning to aid in debugging, consistent with the error handling for I/O operations in this function.

let json = match serde_json::to_string_pretty(&state) { Ok(j) => j, Err(e) => { tracing::warn!("snapshot: failed to serialize state: {e}"); return; } };

References

Create specific error variants for different failure modes (e.g., DownloadFailed with a URL string vs. ManifestRead with a file path) to provide semantically correct and clear error messages. Logging a warning for serialization failure aligns with the goal of providing clear error messages for debugging.

gemini-code-assist · 2026-03-15T07:23:36Z

src/workspace/snapshot.rs

+    let re =
+        Regex::new(r"<!-- begin: (.+) length:(\d+) -->").map_err(|e| SnapshotError::Format {
+            reason: format!("regex compile error: {e}"),
+        })?;
+
+    let mut results = Vec::new();
+    let mut pos = 0;
+
+    let malformed_re =
+        Regex::new(r"<!-- begin: .+ length:\S+ -->").map_err(|e| SnapshotError::Format {
+            reason: format!("regex compile error: {e}"),
+        })?;


These regular expressions are compiled on every function call, which can impact performance. For constant patterns, it's more efficient to compile them once using std::sync::LazyLock.

use std::sync::LazyLock; static BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| { Regex::new(r"").expect("BEGIN_MARKER_RE regex is valid") }); static MALFORMED_BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| { Regex::new(r"").expect("MALFORMED_BEGIN_MARKER_RE regex is valid") }); /// Parse documents by byte-length framing. /// /// Strict: all format errors produce `SnapshotError::Format`. fn parse_snapshot_documents(text: &str) -> Result<Vec<(String, String)>, SnapshotError> { let mut results = Vec::new(); let mut pos = 0;

gemini-code-assist · 2026-03-15T07:23:36Z

src/workspace/README.md

+    snapshot_path: PathBuf::new(),
+    state_path: PathBuf::new(),
+};
+spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);


The code example for spawn_heartbeat is incomplete as hygiene_config and store are used without being defined. This could be confusing for developers trying to use the example. Please define these variables to make the example self-contained and runnable.

Suggested change

spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);

use std::sync::Arc;

use crate::workspace::hygiene::HygieneConfig;

let hygiene_config = HygieneConfig::default();

let store: Option<Arc<dyn crate::db::Database>> = None;

spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);

zmanian

Review: REQUEST CHANGES

Solid design -- periodic export of workspace docs to a length-prefixed Markdown file with startup hydration when the DB is empty. Fits naturally into the heartbeat lifecycle. But two blocking issues need fixing.

Critical (blocking)

C1: Synchronous filesystem I/O in async context
snapshot_if_due and hydrate_from_snapshot are async but use std::fs::write, std::fs::read_to_string, std::fs::rename, std::fs::create_dir_all. These block the tokio executor thread. For snapshot files with many documents, this can starve other async tasks. Use tokio::fs equivalents or wrap in tokio::task::spawn_blocking.

C2: Regex compiled on every call
Two Regex::new() calls inside parse_snapshot_documents, which runs on every hydration. Should use std::sync::LazyLock for one-time compilation.

Important (should fix)

I1: Snapshot enabled by default
SnapshotConfig::default() sets enabled: true. Every existing deployment will start writing snapshot files to ~/.ironclaw/ after upgrade without opt-in. For a new feature that writes to disk, defaulting to enabled: false would be more conservative.

I2: Snapshot file is plaintext -- security impact understated
The PR says "Security Impact: None", but secrets or sensitive content in MEMORY.md, IDENTITY.md, or identity docs will be written to disk in cleartext with default file permissions. At minimum: (a) document that snapshot files may contain sensitive workspace content, (b) consider setting restrictive permissions (0600) on the snapshot file.

I3: Missing HEARTBEAT.md and TOOLS.md from snapshot allowlist
These are identity/configuration files per the workspace spec. Losing them after a database wipe is a significant disaster recovery gap. Intentional exclusion?

I4: Greedy regex ambiguity with space-containing paths
 with greedy .+ breaks if a path contains the literal string length:. The test marker_allows_space_in_path explicitly allows spaces. Reject paths containing length: in validate_marker_safe to prevent parsing ambiguity.

What's good

Byte-length framing (not delimiter-based) -- content round-trips exactly
Atomic writes via tmp-file + rename for crash safety
Path traversal protection via sanitize_user_id, marker injection prevention, hydration allowlist
Concurrency guard via AtomicBool with RAII drop matches existing hygiene pattern
31 unit tests covering round-trip fidelity, parse strictness, marker safety
No .unwrap() in production code, CI all green

Suggestions

Add SnapshotConfig::disabled() constructor for test ergonomics (currently duplicated across 5 test files)
Add an integration test for the full hydrate flow with a real Workspace instance
Consider SnapshotOutcome::Skipped | Exported { path, count } enum instead of Option<PathBuf> + skipped: bool

With the sync I/O fix (#C1) and regex compilation fix (#C2), this would be ready to merge.

zmanian · 2026-03-15T20:06:11Z

Maintainer note: confidential VM use case

This snapshot/hydration architecture is potentially valuable for confidential VM (TEE) deployments, where all in-memory state is lost on shutdown. The serialize-restore lifecycle is exactly what's needed for VM upgrades (snapshot -> terminate -> launch new attested instance -> hydrate) and crash recovery.

However, the current implementation is not yet suitable for TEE use cases without additional layers:

No encryption at rest -- snapshots are plaintext. Workspace content (memory, identity, potentially secrets) would be exposed outside the TEE boundary when written to persistent storage.
No integrity verification -- no HMAC or signature on the snapshot file. A tampered snapshot could inject content on hydration.
No attestation gating -- hydration runs unconditionally at startup. In a TEE context, the new VM instance should prove its integrity before receiving snapshot data.

For TEE support, we'd need to layer on: (a) encrypt snapshots with a key sealed to a KMS with attestation-gated access, (b) HMAC/sign the snapshot for tamper detection, (c) gate hydration on remote attestation of the new instance.

The snapshot format and lifecycle design here is the right foundation -- these crypto/attestation layers could be added on top without changing the core serialize/parse logic. Worth keeping this use case in mind as the feature evolves.

- Use tokio::fs instead of std::fs in async functions (C1) - Hoist Regex to static LazyLock (C2) - Default snapshot enabled to false (I1) - Set 0600 file permissions on Unix, add security docs (I2) - Add HEARTBEAT.md and TOOLS.md to snapshot allowlist (I3) - Reject paths containing ' length:' in marker validation (I4)

reidliu41 · 2026-03-16T04:58:42Z

Thanks for the note on confidential VM / TEE applicability. Agreed that the current snapshot format and lifecycle are the right foundation, and that crypto / attestation layers can be added on top without changing the core serialize / parse flow.
I've opened #1235 to track the hardening that IronClaw can do standalone: snapshot encryption at rest plus integrity verification/ tamper detection. Attestation-gated hydration is intentionally out of scope there, since that depends on external platform / infrastructure decisions.

gemini-code-assist bot reviewed Mar 15, 2026

View reviewed changes

zmanian requested changes Mar 15, 2026

View reviewed changes

github-actions bot added contributor: core 20+ merged PRs and removed contributor: experienced 6-19 merged PRs labels Mar 16, 2026

reidliu41 added 2 commits March 16, 2026 10:59

fix the code style check

f8e0cca

fix the code style check - 1

962088c

reidliu41 mentioned this pull request Mar 16, 2026

Snapshot encryption at rest with integrity verification #1235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198

feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198
reidliu41 wants to merge 4 commits intonearai:stagingfrom
reidliu41:feat/167-workspace-snapshot

reidliu41 commented Mar 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 15, 2026

Uh oh!

gemini-code-assist bot Mar 15, 2026

Uh oh!

gemini-code-assist bot Mar 15, 2026

Uh oh!

gemini-code-assist bot Mar 15, 2026

Uh oh!

zmanian left a comment

Uh oh!

zmanian commented Mar 15, 2026

Uh oh!

reidliu41 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);
+use std::sync::Arc;
+use crate::workspace::hygiene::HygieneConfig;
+let hygiene_config = HygieneConfig::default();
+let store: Option<Arc<dyn crate::db::Database>> = None;
+spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);

Conversation

reidliu41 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Linked Issue

Validation

Security Impact

Database Impact

Blast Radius

Rollback Plan

Uh oh!

gemini-code-assist bot commented Mar 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

zmanian left a comment

Choose a reason for hiding this comment

Review: REQUEST CHANGES

Critical (blocking)

Important (should fix)

What's good

Suggestions

Uh oh!

zmanian commented Mar 15, 2026

Maintainer note: confidential VM use case

Uh oh!

reidliu41 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reidliu41 commented Mar 15, 2026 •

edited

Loading