feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198
feat(workspace): add snapshot/hydration for disaster recovery (#167)#1198reidliu41 wants to merge 4 commits intonearai:stagingfrom
Conversation
…#167) - Add periodic export of identity docs + context/** to length-prefixed Markdown file on disk - Add startup hydration: restore from snapshot when workspace database is empty - Use byte-length framing so embedded markers/HTML in content round-trip exactly - Validate user_id has no whitespace before writing metadata (split_whitespace safety) - Validate marker values reject control chars and --> to prevent format injection - Detect malformed begin markers with non-numeric length as explicit format errors - Skip content regions during parsing so fake markers inside documents are never matched - Guard concurrent snapshot passes with AtomicBool; update cadence state only on success - Wire snapshot config through Agent, HeartbeatRunner, and spawn_heartbeat - Insert hydration step in app.rs startup chain: hydrate → import → seed → backfill - Each recovery step is best-effort — failure warns but never blocks subsequent steps - Add 31 unit tests covering round-trip fidelity, parse strictness, marker safety, and concurrency
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical disaster recovery mechanism for the workspace by enabling periodic snapshots of essential documents and an automatic hydration process upon application startup. This ensures that even if the primary workspace database is lost or empty, core user data and context can be restored from a file-based backup. The implementation focuses on data integrity through a robust, byte-length-prefixed format and includes comprehensive validation to prevent data corruption or injection, making the system more resilient and reliable. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a new workspace snapshot feature, enabling periodic export of core workspace documents to a structured Markdown file and hydration from this snapshot on startup. This involves adding a SnapshotConfig struct, a dedicated src/workspace/snapshot.rs module for snapshot/hydration logic, and integrating this functionality into the agent's heartbeat process and application startup. Review comments suggest several improvements: simplifying snapshot configuration creation in agent_loop.rs, adding error logging for JSON serialization failures in snapshot.rs, optimizing regex compilation in snapshot.rs for performance, and completing the spawn_heartbeat example in README.md for clarity.
| let snapshot = self | ||
| .snapshot_config | ||
| .as_ref() | ||
| .map(|s| s.to_workspace_config(workspace.user_id())) | ||
| .unwrap_or_else(|| { | ||
| crate::config::SnapshotConfig::default() | ||
| .to_workspace_config(workspace.user_id()) | ||
| }); |
| let Ok(json) = serde_json::to_string_pretty(&state) else { | ||
| return; | ||
| }; |
There was a problem hiding this comment.
The function returns silently if JSON serialization fails. It would be better to log a warning to aid in debugging, consistent with the error handling for I/O operations in this function.
let json = match serde_json::to_string_pretty(&state) {
Ok(j) => j,
Err(e) => {
tracing::warn!("snapshot: failed to serialize state: {e}");
return;
}
};References
- Create specific error variants for different failure modes (e.g., DownloadFailed with a URL string vs. ManifestRead with a file path) to provide semantically correct and clear error messages. Logging a warning for serialization failure aligns with the goal of providing clear error messages for debugging.
src/workspace/snapshot.rs
Outdated
| let re = | ||
| Regex::new(r"<!-- begin: (.+) length:(\d+) -->").map_err(|e| SnapshotError::Format { | ||
| reason: format!("regex compile error: {e}"), | ||
| })?; | ||
|
|
||
| let mut results = Vec::new(); | ||
| let mut pos = 0; | ||
|
|
||
| let malformed_re = | ||
| Regex::new(r"<!-- begin: .+ length:\S+ -->").map_err(|e| SnapshotError::Format { | ||
| reason: format!("regex compile error: {e}"), | ||
| })?; |
There was a problem hiding this comment.
These regular expressions are compiled on every function call, which can impact performance. For constant patterns, it's more efficient to compile them once using std::sync::LazyLock.
use std::sync::LazyLock;
static BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"<!-- begin: (.+) length:(\d+) -->").expect("BEGIN_MARKER_RE regex is valid")
});
static MALFORMED_BEGIN_MARKER_RE: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"<!-- begin: .+ length:\S+ -->").expect("MALFORMED_BEGIN_MARKER_RE regex is valid")
});
/// Parse documents by byte-length framing.
///
/// Strict: all format errors produce `SnapshotError::Format`.
fn parse_snapshot_documents(text: &str) -> Result<Vec<(String, String)>, SnapshotError> {
let mut results = Vec::new();
let mut pos = 0;| snapshot_path: PathBuf::new(), | ||
| state_path: PathBuf::new(), | ||
| }; | ||
| spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store); |
There was a problem hiding this comment.
The code example for spawn_heartbeat is incomplete as hygiene_config and store are used without being defined. This could be confusing for developers trying to use the example. Please define these variables to make the example self-contained and runnable.
| spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store); | |
| use std::sync::Arc; | |
| use crate::workspace::hygiene::HygieneConfig; | |
| let hygiene_config = HygieneConfig::default(); | |
| let store: Option<Arc<dyn crate::db::Database>> = None; | |
| spawn_heartbeat(config, hygiene_config, snapshot, workspace, llm, response_tx, store); |
zmanian
left a comment
There was a problem hiding this comment.
Review: REQUEST CHANGES
Solid design -- periodic export of workspace docs to a length-prefixed Markdown file with startup hydration when the DB is empty. Fits naturally into the heartbeat lifecycle. But two blocking issues need fixing.
Critical (blocking)
C1: Synchronous filesystem I/O in async context
snapshot_if_due and hydrate_from_snapshot are async but use std::fs::write, std::fs::read_to_string, std::fs::rename, std::fs::create_dir_all. These block the tokio executor thread. For snapshot files with many documents, this can starve other async tasks. Use tokio::fs equivalents or wrap in tokio::task::spawn_blocking.
C2: Regex compiled on every call
Two Regex::new() calls inside parse_snapshot_documents, which runs on every hydration. Should use std::sync::LazyLock for one-time compilation.
Important (should fix)
I1: Snapshot enabled by default
SnapshotConfig::default() sets enabled: true. Every existing deployment will start writing snapshot files to ~/.ironclaw/ after upgrade without opt-in. For a new feature that writes to disk, defaulting to enabled: false would be more conservative.
I2: Snapshot file is plaintext -- security impact understated
The PR says "Security Impact: None", but secrets or sensitive content in MEMORY.md, IDENTITY.md, or identity docs will be written to disk in cleartext with default file permissions. At minimum: (a) document that snapshot files may contain sensitive workspace content, (b) consider setting restrictive permissions (0600) on the snapshot file.
I3: Missing HEARTBEAT.md and TOOLS.md from snapshot allowlist
These are identity/configuration files per the workspace spec. Losing them after a database wipe is a significant disaster recovery gap. Intentional exclusion?
I4: Greedy regex ambiguity with space-containing paths
<!-- begin: (.+) length:(\d+) --> with greedy .+ breaks if a path contains the literal string length:. The test marker_allows_space_in_path explicitly allows spaces. Reject paths containing length: in validate_marker_safe to prevent parsing ambiguity.
What's good
- Byte-length framing (not delimiter-based) -- content round-trips exactly
- Atomic writes via tmp-file + rename for crash safety
- Path traversal protection via
sanitize_user_id, marker injection prevention, hydration allowlist - Concurrency guard via
AtomicBoolwith RAII drop matches existing hygiene pattern - 31 unit tests covering round-trip fidelity, parse strictness, marker safety
- No
.unwrap()in production code, CI all green
Suggestions
- Add
SnapshotConfig::disabled()constructor for test ergonomics (currently duplicated across 5 test files) - Add an integration test for the full hydrate flow with a real
Workspaceinstance - Consider
SnapshotOutcome::Skipped | Exported { path, count }enum instead ofOption<PathBuf>+skipped: bool
With the sync I/O fix (#C1) and regex compilation fix (#C2), this would be ready to merge.
Maintainer note: confidential VM use caseThis snapshot/hydration architecture is potentially valuable for confidential VM (TEE) deployments, where all in-memory state is lost on shutdown. The serialize-restore lifecycle is exactly what's needed for VM upgrades (snapshot -> terminate -> launch new attested instance -> hydrate) and crash recovery. However, the current implementation is not yet suitable for TEE use cases without additional layers:
For TEE support, we'd need to layer on: (a) encrypt snapshots with a key sealed to a KMS with attestation-gated access, (b) HMAC/sign the snapshot for tamper detection, (c) gate hydration on remote attestation of the new instance. The snapshot format and lifecycle design here is the right foundation -- these crypto/attestation layers could be added on top without changing the core serialize/parse logic. Worth keeping this use case in mind as the feature evolves. |
- Use tokio::fs instead of std::fs in async functions (C1) - Hoist Regex to static LazyLock (C2) - Default snapshot enabled to false (I1) - Set 0600 file permissions on Unix, add security docs (I2) - Add HEARTBEAT.md and TOOLS.md to snapshot allowlist (I3) - Reject paths containing ' length:' in marker validation (I4)
|
Thanks for the note on confidential VM / TEE applicability. Agreed that the current snapshot format and lifecycle are the right foundation, and that crypto / attestation layers can be added on top without changing the core serialize / parse flow. |
Summary
Change Type
Linked Issue
Closes #167
Validation
cargo fmtcargo clippy --all --benches --tests --examples --all-featuresSecurity Impact
Security Impact: Writes sensitive workspace content to local disk for recovery. Files remain plaintext; Unix permissions are
restricted to
0600. No encryption/signing/attestation in this PR.Database Impact
None
Blast Radius
Rollback Plan
Review track: