Skip to content

[Feature]: Failure recording mechanism missing, self-healing loop is incomplete without structured failure capture #5942

@hetmmehta

Description

@hetmmehta

Problem Statement

Hive's core value proposition is a self-healing, self-improving agent loop. The README describes it as:

"When things break, the framework captures failure data, evolves the agent through the coding agent, and redeploys."

However, the Failure Recording mechanism is explicitly unbuilt (docs/roadmap.md, Eval System section):

  • Failure capture mechanism
  • Failure analysis tools
  • Historical failure tracking
  • Continuous improvement loop

Without this, the evolution cycle has no structured input to learn from. The Queen/coding agent triggers graph evolution reactively but has no programmatic access to historical failure patterns. Each evolution cycle essentially starts blind, with no memory of what failed before or what fixes were already attempted.

This means the "self-improving" loop described in the README is currently dependent on ad-hoc recovery rather than structured feedback, which undermines the framework's core reliability guarantee.

Proposed Solution

Implement a FailureStore that integrates with the existing storage patterns (session_store.py, checkpoint_store.py) to capture structured failure data at the point of node failure.

A minimal FailureRecord could look like:

@DataClass
class FailureRecord:
session_id: str
agent_id: str
node_id: str
timestamp: datetime
error_type: str
error_message: str
graph_snapshot: dict # state of graph at time of failure
input_context: dict # what the node received
evolution_triggered: bool # did this cause a graph evolution?
resolution: Optional[str] # what fix was applied, if any

The Queen/coding agent would query FailureStore before triggering graph evolution, enabling pattern-aware improvements rather than blind retries. This also directly unblocks the roadmap items for Custom Failure Conditions SDK and the Continuous Improvement Loop.

Alternatives Considered

  1. Extending the existing runtime logger (runtime/runtime_logger.py) - rejected because logs are text-based and not queryable by the Queen agent in a structured way. Logs are for humans while failure records are for the evolution engine.

  2. Using the checkpoint system alone (storage/checkpoint_store.py) - rejected because checkpoints capture state snapshots but not failure semantics (error type, resolution, whether evolution was triggered). Both can coexist.

  3. External observability tools (e.g. Sentry, Datadog) - rejected as a primary solution because Hive is self-hostable and should not depend on third-party services for its core self-healing loop.

Additional Context

This gap affects both product integrity and developer trust:

  • Developers building production agents have no way to audit why an agent evolved the way it did
  • Repeated failures of the same type cannot be detected or prevented without historical tracking
  • The roadmap's Continuous Improvement Loop item is directly blocked by this missing foundation

Relevant roadmap items this would unblock:

  • Failure capture mechanism
  • Failure analysis tools
  • Historical failure tracking
  • Continuous improvement loop
  • Custom Failure Conditions SDK

Implementation Ideas

  1. Create storage/failure_store.py following the same async patterns as storage/checkpoint_store.py
  2. Hook into graph/executor.py at the point where ExecutionResult returns an error status, capture the FailureRecord there
  3. Expose a query interface so the Queen/coding agent (agents/hive_coder/) can retrieve recent failures for a given agent_id before deciding on evolution strategy
  4. Add a failure_count and last_failure_at field to the existing SharedState so nodes can self-check before retrying
  5. Wire into the existing L2/L3 runtime logging levels so failure records are also reflected in human-readable logs

This is intentionally minimal — the goal is to close the feedback loop first, with richer analysis tools built on top.

Metadata

Metadata

Assignees

No one assigned

    Labels

    duplicateThis issue or pull request already exists

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions