Skip to content

feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198

Open
reidliu41 wants to merge 4 commits intonearai:stagingfrom
reidliu41:feat/167-workspace-snapshot
Open

feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198
reidliu41 wants to merge 4 commits intonearai:stagingfrom
reidliu41:feat/167-workspace-snapshot

Conversation

@reidliu41
Copy link
Contributor

@reidliu41 reidliu41 commented Mar 15, 2026

Summary

  • Add periodic export of identity docs + context/** to length-prefixed Markdown file on disk
  • Add startup hydration: restore from snapshot when workspace database is empty
  • Use byte-length framing so embedded markers/HTML in content round-trip exactly
  • Validate user_id has no whitespace before writing metadata (split_whitespace safety)
  • Validate marker values reject control chars and --> to prevent format injection
  • Detect malformed begin markers with non-numeric length as explicit format errors
  • Skip content regions during parsing so fake markers inside documents are never matched
  • Guard concurrent snapshot passes with AtomicBool; update cadence state only on success
  • Wire snapshot config through Agent, HeartbeatRunner, and spawn_heartbeat
  • Insert hydration step in app.rs startup chain: hydrate → import → seed → backfill
  • Each recovery step is best-effort — failure warns but never blocks subsequent steps
  • Add 31 unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency

Change Type

  • Bug fix
  • New feature
  • Refactor
  • Documentation
  • CI/Infrastructure
  • Security
  • Dependencies

Linked Issue

Closes #167

Validation

  • cargo fmt
  • cargo clippy --all --benches --tests --examples --all-features
  • x ] Relevant tests pass:
  • Manual testing:

Security Impact

Security Impact: Writes sensitive workspace content to local disk for recovery. Files remain plaintext; Unix permissions are
restricted to 0600. No encryption/signing/attestation in this PR.

Database Impact

None

Blast Radius

Rollback Plan


Review track:

…#167)

  - Add periodic export of identity docs + context/** to length-prefixed Markdown file on disk
  - Add startup hydration: restore from snapshot when workspace database is empty
  - Use byte-length framing so embedded markers/HTML in content round-trip exactly
  - Validate user_id has no whitespace before writing metadata (split_whitespace safety)
  - Validate marker values reject control chars and --> to prevent format injection
  - Detect malformed begin markers with non-numeric length as explicit format errors
  - Skip content regions during parsing so fake markers inside documents are never matched
  - Guard concurrent snapshot passes with AtomicBool; update cadence state only on success
  - Wire snapshot config through Agent, HeartbeatRunner, and spawn_heartbeat
  - Insert hydration step in app.rs startup chain: hydrate → import → seed → backfill
  - Each recovery step is best-effort — failure warns but never blocks subsequent steps
  - Add 31 unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency
@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: workspace Persistent memory / workspace scope: docs Documentation size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Mar 15, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical disaster recovery mechanism for the workspace by enabling periodic snapshots of essential documents and an automatic hydration process upon application startup. This ensures that even if the primary workspace database is lost or empty, core user data and context can be restored from a file-based backup. The implementation focuses on data integrity through a robust, byte-length-prefixed format and includes comprehensive validation to prevent data corruption or injection, making the system more resilient and reliable.

Highlights

  • Disaster Recovery via Workspace Snapshot: Implemented a new feature for periodic export of core workspace documents (identity files, MEMORY.md, and context/**) to a length-prefixed Markdown snapshot file on disk.
  • Startup Hydration: Added a startup hydration mechanism that restores the workspace from a snapshot file if the database is found to be empty, ensuring data persistence across restarts.
  • Robust Snapshot Format: Designed the snapshot format using byte-length framing to guarantee exact round-trip fidelity for document content, including embedded markers or HTML.
  • Security and Strictness: Incorporated validation for user IDs to prevent whitespace issues and for marker values to reject control characters or HTML comment terminators, mitigating format injection risks. Malformed markers are explicitly detected as errors.
  • Concurrency and Cadence Management: Introduced an AtomicBool guard to prevent concurrent snapshot passes and updated cadence state only upon successful snapshot completion, ensuring reliability.
  • Integration and Configuration: Integrated snapshot configuration through the Agent, HeartbeatRunner, and spawn_heartbeat functions, and added a hydration step to the application's startup chain (hydrate → import → seed → backfill), with each recovery step designed to be best-effort.
  • Comprehensive Testing: Included 31 new unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency aspects of the snapshot and hydration functionality.
Changelog
  • src/agent/agent_loop.rs
    • Added snapshot_config field to the Agent struct.
    • Updated Agent::new constructor to accept snapshot_config.
    • Modified spawn_heartbeat call to pass the resolved snapshot_config.
  • src/agent/commands.rs
    • Updated HeartbeatRunner::new instantiation to include a default, disabled SnapshotConfig for manual heartbeats.
  • src/agent/dispatcher.rs
    • Updated test Agent::new calls to accommodate the new snapshot_config parameter with None.
  • src/agent/heartbeat.rs
    • Added snapshot_config field to the HeartbeatRunner struct.
    • Updated HeartbeatRunner::new constructor to accept snapshot_config.
    • Integrated snapshot_if_due into the run_heartbeat_loop to perform snapshots after hygiene.
    • Modified spawn_heartbeat function signature and call to accept and pass snapshot_config.
  • src/app.rs
    • Introduced a 'Startup recovery chain' including a hydration step from a snapshot if the workspace is empty and a snapshot file exists.
  • src/config/mod.rs
    • Added snapshot module to mod.rs.
    • Exported SnapshotConfig from the config module.
    • Added snapshot: SnapshotConfig field to the main Config struct.
    • Initialized and resolved SnapshotConfig within Config::default() and Config::resolve().
  • src/config/snapshot.rs
    • Added new module defining SnapshotConfig for application-level settings.
    • Implemented Default trait for SnapshotConfig with reasonable defaults.
    • Provided resolve() method to load configuration from environment variables.
    • Added to_workspace_config() method to convert application config to workspace-specific config, including user ID sanitization for file paths.
    • Included unit tests for default config, user ID sanitization, and path resolution.
  • src/main.rs
    • Updated Agent::new call in async_main to pass the config.snapshot.
  • src/workspace/README.md
    • Updated the spawn_heartbeat example to reflect the new snapshot_config parameter.
  • src/workspace/mod.rs
    • Added pub mod snapshot; to expose the new snapshot module.
  • src/workspace/snapshot.rs
    • Added new module implementing workspace snapshot and hydration logic.
    • Defined SnapshotConfig (workspace-side), SnapshotReport, HydrationReport, and SnapshotError.
    • Implemented snapshot_if_due for periodic document export with cadence and concurrency control.
    • Implemented hydrate_from_snapshot for restoring documents from a snapshot file, including version and user ID validation.
    • Defined document allowlists (SNAPSHOT_DOCS, SNAPSHOT_PREFIXES) for inclusion in snapshots.
    • Introduced byte-length framing for robust content round-tripping.
    • Added validate_marker_safe to prevent format injection and sanitize_user_id for safe file paths.
    • Included extensive unit tests for format fidelity, parsing strictness, marker safety, allowlisting, state persistence, and concurrency.
  • tests/e2e_routine_heartbeat.rs
    • Updated HeartbeatRunner::new calls in end-to-end tests to include a default SnapshotConfig.
  • tests/heartbeat_integration.rs
    • Updated HeartbeatRunner::new call in integration tests to include a default SnapshotConfig.
  • tests/support/gateway_workflow_harness.rs
    • Updated Agent::new call in test harness to include None for the new snapshot_config parameter.
  • tests/support/test_rig.rs
    • Updated Agent::new call in test rig to include None for the new snapshot_config parameter.
Activity
  • The pull request introduces a new feature for workspace snapshot and hydration.
  • The PR description provides a detailed summary of changes and validation steps.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new workspace snapshot feature, enabling periodic export of core workspace documents to a structured Markdown file and hydration from this snapshot on startup. This involves adding a SnapshotConfig struct, a dedicated src/workspace/snapshot.rs module for snapshot/hydration logic, and integrating this functionality into the agent's heartbeat process and application startup. Review comments suggest several improvements: simplifying snapshot configuration creation in agent_loop.rs, adding error logging for JSON serialization failures in snapshot.rs, optimizing regex compilation in snapshot.rs for performance, and completing the spawn_heartbeat example in README.md for clarity.

Comment on lines +418 to +425
let snapshot = self
.snapshot_config
.as_ref()
.map(|s| s.to_workspace_config(workspace.user_id()))
.unwrap_or_else(|| {
crate::config::SnapshotConfig::default()
.to_workspace_config(workspace.user_id())
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for creating the snapshot configuration can be simplified to be more concise and readable, reducing redundancy.

                    let snapshot = self
                        .snapshot_config
                        .as_ref()
                        .cloned()
                        .unwrap_or_default()
                        .to_workspace_config(workspace.user_id());

Comment on lines +371 to +373
let Ok(json) = serde_json::to_string_pretty(&state) else {
return;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function returns silently if JSON serialization fails. It would be better to log a warning to aid in debugging, consistent with the error handling for I/O operations in this function.

    let json = match serde_json::to_string_pretty(&state) {
        Ok(j) => j,
        Err(e) => {
            tracing::warn!("snapshot: failed to serialize state: {e}");
            return;
        }
    };
References
  1. Create specific error variants for different failure modes (e.g., DownloadFailed with a URL string vs. ManifestRead with a file path) to provide semantically correct and clear error messages. Logging a warning for serialization failure aligns with the goal of providing clear error messages for debugging.

Comment on lines +451 to +462
let re =
Regex::new(r"<!-- begin: (.+) length:(\d+) -->").map_err(|e| SnapshotError::Format {
reason: format!("regex compile error: {e}"),
})?;

let mut results = Vec::new();
let mut pos = 0;

let malformed_re =
Regex::new(r"<!-- begin: .+ length:\S+ -->").map_err(|e| SnapshotError::Format {
reason: format!("regex compile error: {e}"),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These regular expressions are compiled on every function call, which can impact performance. For constant patterns, it's more efficient to compile them once using std::sync::LazyLock.

use std::sync::LazyLock;

static BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"<!-- begin: (.+) length:(\d+) -->").expect("BEGIN_MARKER_RE regex is valid")
});

static MALFORMED_BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"<!-- begin: .+ length:\S+ -->").expect("MALFORMED_BEGIN_MARKER_RE regex is valid")
});

/// Parse documents by byte-length framing.
///
/// Strict: all format errors produce `SnapshotError::Format`.
fn parse_snapshot_documents(text: &str) -> Result<Vec<(String, String)>, SnapshotError> {
    let mut results = Vec::new();
    let mut pos = 0;

snapshot_path: PathBuf::new(),
state_path: PathBuf::new(),
};
spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code example for spawn_heartbeat is incomplete as hygiene_config and store are used without being defined. This could be confusing for developers trying to use the example. Please define these variables to make the example self-contained and runnable.

Suggested change
spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);
use std::sync::Arc;
use crate::workspace::hygiene::HygieneConfig;
let hygiene_config = HygieneConfig::default();
let store: Option<Arc<dyn crate::db::Database>> = None;
spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store);

Copy link
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: REQUEST CHANGES

Solid design -- periodic export of workspace docs to a length-prefixed Markdown file with startup hydration when the DB is empty. Fits naturally into the heartbeat lifecycle. But two blocking issues need fixing.

Critical (blocking)

C1: Synchronous filesystem I/O in async context
snapshot_if_due and hydrate_from_snapshot are async but use std::fs::write, std::fs::read_to_string, std::fs::rename, std::fs::create_dir_all. These block the tokio executor thread. For snapshot files with many documents, this can starve other async tasks. Use tokio::fs equivalents or wrap in tokio::task::spawn_blocking.

C2: Regex compiled on every call
Two Regex::new() calls inside parse_snapshot_documents, which runs on every hydration. Should use std::sync::LazyLock for one-time compilation.

Important (should fix)

I1: Snapshot enabled by default
SnapshotConfig::default() sets enabled: true. Every existing deployment will start writing snapshot files to ~/.ironclaw/ after upgrade without opt-in. For a new feature that writes to disk, defaulting to enabled: false would be more conservative.

I2: Snapshot file is plaintext -- security impact understated
The PR says "Security Impact: None", but secrets or sensitive content in MEMORY.md, IDENTITY.md, or identity docs will be written to disk in cleartext with default file permissions. At minimum: (a) document that snapshot files may contain sensitive workspace content, (b) consider setting restrictive permissions (0600) on the snapshot file.

I3: Missing HEARTBEAT.md and TOOLS.md from snapshot allowlist
These are identity/configuration files per the workspace spec. Losing them after a database wipe is a significant disaster recovery gap. Intentional exclusion?

I4: Greedy regex ambiguity with space-containing paths
<!-- begin: (.+) length:(\d+) --> with greedy .+ breaks if a path contains the literal string length:. The test marker_allows_space_in_path explicitly allows spaces. Reject paths containing length: in validate_marker_safe to prevent parsing ambiguity.

What's good

  • Byte-length framing (not delimiter-based) -- content round-trips exactly
  • Atomic writes via tmp-file + rename for crash safety
  • Path traversal protection via sanitize_user_id, marker injection prevention, hydration allowlist
  • Concurrency guard via AtomicBool with RAII drop matches existing hygiene pattern
  • 31 unit tests covering round-trip fidelity, parse strictness, marker safety
  • No .unwrap() in production code, CI all green

Suggestions

  • Add SnapshotConfig::disabled() constructor for test ergonomics (currently duplicated across 5 test files)
  • Add an integration test for the full hydrate flow with a real Workspace instance
  • Consider SnapshotOutcome::Skipped | Exported { path, count } enum instead of Option<PathBuf> + skipped: bool

With the sync I/O fix (#C1) and regex compilation fix (#C2), this would be ready to merge.

@zmanian
Copy link
Collaborator

zmanian commented Mar 15, 2026

Maintainer note: confidential VM use case

This snapshot/hydration architecture is potentially valuable for confidential VM (TEE) deployments, where all in-memory state is lost on shutdown. The serialize-restore lifecycle is exactly what's needed for VM upgrades (snapshot -> terminate -> launch new attested instance -> hydrate) and crash recovery.

However, the current implementation is not yet suitable for TEE use cases without additional layers:

  1. No encryption at rest -- snapshots are plaintext. Workspace content (memory, identity, potentially secrets) would be exposed outside the TEE boundary when written to persistent storage.
  2. No integrity verification -- no HMAC or signature on the snapshot file. A tampered snapshot could inject content on hydration.
  3. No attestation gating -- hydration runs unconditionally at startup. In a TEE context, the new VM instance should prove its integrity before receiving snapshot data.

For TEE support, we'd need to layer on: (a) encrypt snapshots with a key sealed to a KMS with attestation-gated access, (b) HMAC/sign the snapshot for tamper detection, (c) gate hydration on remote attestation of the new instance.

The snapshot format and lifecycle design here is the right foundation -- these crypto/attestation layers could be added on top without changing the core serialize/parse logic. Worth keeping this use case in mind as the feature evolves.

  - Use tokio::fs instead of std::fs in async functions (C1)
  - Hoist Regex to static LazyLock (C2)
  - Default snapshot enabled to false (I1)
  - Set 0600 file permissions on Unix, add security docs (I2)
  - Add HEARTBEAT.md and TOOLS.md to snapshot allowlist (I3)
  - Reject paths containing ' length:' in marker validation (I4)
@github-actions github-actions bot added contributor: core 20+ merged PRs and removed contributor: experienced 6-19 merged PRs labels Mar 16, 2026
@reidliu41
Copy link
Contributor Author

Thanks for the note on confidential VM / TEE applicability. Agreed that the current snapshot format and lifecycle are the right foundation, and that crypto / attestation layers can be added on top without changing the core serialize / parse flow.
I've opened #1235 to track the hardening that IronClaw can do standalone: snapshot encryption at rest plus integrity verification/ tamper detection. Attestation-gated hydration is intentionally out of scope there, since that depends on external platform / infrastructure decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: docs Documentation scope: workspace Persistent memory / workspace size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add workspace snapshot/hydration for disaster recovery

2 participants