-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Problem Statement
Hive's core value proposition is a self-healing, self-improving agent loop. The README describes it as:
"When things break, the framework captures failure data, evolves the agent through the coding agent, and redeploys."
However, the Failure Recording mechanism is explicitly unbuilt (docs/roadmap.md, Eval System section):
- Failure capture mechanism
- Failure analysis tools
- Historical failure tracking
- Continuous improvement loop
Without this, the evolution cycle has no structured input to learn from. The Queen/coding agent triggers graph evolution reactively but has no programmatic access to historical failure patterns. Each evolution cycle essentially starts blind, with no memory of what failed before or what fixes were already attempted.
This means the "self-improving" loop described in the README is currently dependent on ad-hoc recovery rather than structured feedback, which undermines the framework's core reliability guarantee.
Proposed Solution
Implement a FailureStore that integrates with the existing storage patterns (session_store.py, checkpoint_store.py) to capture structured failure data at the point of node failure.
A minimal FailureRecord could look like:
@DataClass
class FailureRecord:
session_id: str
agent_id: str
node_id: str
timestamp: datetime
error_type: str
error_message: str
graph_snapshot: dict # state of graph at time of failure
input_context: dict # what the node received
evolution_triggered: bool # did this cause a graph evolution?
resolution: Optional[str] # what fix was applied, if any
The Queen/coding agent would query FailureStore before triggering graph evolution, enabling pattern-aware improvements rather than blind retries. This also directly unblocks the roadmap items for Custom Failure Conditions SDK and the Continuous Improvement Loop.
Alternatives Considered
-
Extending the existing runtime logger (runtime/runtime_logger.py) - rejected because logs are text-based and not queryable by the Queen agent in a structured way. Logs are for humans while failure records are for the evolution engine.
-
Using the checkpoint system alone (storage/checkpoint_store.py) - rejected because checkpoints capture state snapshots but not failure semantics (error type, resolution, whether evolution was triggered). Both can coexist.
-
External observability tools (e.g. Sentry, Datadog) - rejected as a primary solution because Hive is self-hostable and should not depend on third-party services for its core self-healing loop.
Additional Context
This gap affects both product integrity and developer trust:
- Developers building production agents have no way to audit why an agent evolved the way it did
- Repeated failures of the same type cannot be detected or prevented without historical tracking
- The roadmap's Continuous Improvement Loop item is directly blocked by this missing foundation
Relevant roadmap items this would unblock:
- Failure capture mechanism
- Failure analysis tools
- Historical failure tracking
- Continuous improvement loop
- Custom Failure Conditions SDK
Implementation Ideas
- Create storage/failure_store.py following the same async patterns as storage/checkpoint_store.py
- Hook into graph/executor.py at the point where ExecutionResult returns an error status, capture the FailureRecord there
- Expose a query interface so the Queen/coding agent (agents/hive_coder/) can retrieve recent failures for a given agent_id before deciding on evolution strategy
- Add a failure_count and last_failure_at field to the existing SharedState so nodes can self-check before retrying
- Wire into the existing L2/L3 runtime logging levels so failure records are also reflected in human-readable logs
This is intentionally minimal — the goal is to close the feedback loop first, with richer analysis tools built on top.