[Feature]: Failure recording mechanism missing, self-healing loop is incomplete without structured failure capture

## Problem Statement

Hive's core value proposition is a self-healing, self-improving agent loop. The README describes it as:

"When things break, the framework captures failure data, evolves the agent through the coding agent, and redeploys."

However, the Failure Recording mechanism is explicitly unbuilt (docs/roadmap.md, Eval System section):

- [ ] Failure capture mechanism
- [ ] Failure analysis tools
- [ ] Historical failure tracking
- [ ] Continuous improvement loop

Without this, the evolution cycle has no structured input to learn from. The Queen/coding agent triggers graph evolution reactively but has no programmatic access to historical failure patterns. Each evolution cycle essentially starts blind, with no memory of what failed before or what fixes were already attempted.

This means the "self-improving" loop described in the README is currently dependent on ad-hoc recovery rather than structured feedback, which undermines the framework's core reliability guarantee.

## Proposed Solution

Implement a FailureStore that integrates with the existing storage patterns (session_store.py, checkpoint_store.py) to capture structured failure data at the point of node failure.

A minimal FailureRecord could look like:

@dataclass
class FailureRecord:
    session_id: str
    agent_id: str
    node_id: str
    timestamp: datetime
    error_type: str
    error_message: str
    graph_snapshot: dict      # state of graph at time of failure
    input_context: dict       # what the node received
    evolution_triggered: bool # did this cause a graph evolution?
    resolution: Optional[str] # what fix was applied, if any

The Queen/coding agent would query FailureStore before triggering graph evolution, enabling pattern-aware improvements rather than blind retries. This also directly unblocks the roadmap items for Custom Failure Conditions SDK and the Continuous Improvement Loop.

## Alternatives Considered

1. Extending the existing runtime logger (runtime/runtime_logger.py) - rejected because logs are text-based and not queryable by the Queen agent in a structured way. Logs are for humans while failure records are for the evolution engine.

2. Using the checkpoint system alone (storage/checkpoint_store.py) - rejected because checkpoints capture state snapshots but not failure semantics (error type, resolution, whether evolution was triggered). Both can coexist.

3. External observability tools (e.g. Sentry, Datadog) - rejected as a primary solution because Hive is self-hostable and should not depend on third-party services for its core self-healing loop.

## Additional Context

This gap affects both product integrity and developer trust:

- Developers building production agents have no way to audit why an agent evolved the way it did
- Repeated failures of the same type cannot be detected or prevented without historical tracking
- The roadmap's Continuous Improvement Loop item is directly blocked by this missing foundation

Relevant roadmap items this would unblock:
- [ ] Failure capture mechanism
- [ ] Failure analysis tools  
- [ ] Historical failure tracking
- [ ] Continuous improvement loop
- [ ] Custom Failure Conditions SDK

## Implementation Ideas

1. Create storage/failure_store.py following the same async patterns as storage/checkpoint_store.py
2. Hook into graph/executor.py at the point where ExecutionResult returns an error status, capture the FailureRecord there
3. Expose a query interface so the Queen/coding agent (agents/hive_coder/) can retrieve recent failures for a given agent_id before deciding on evolution strategy
4. Add a failure_count and last_failure_at field to the existing SharedState so nodes can self-check before retrying
5. Wire into the existing L2/L3 runtime logging levels so failure records are also reflected in human-readable logs

This is intentionally minimal — the goal is to close the feedback loop first, with richer analysis tools built on top.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Failure recording mechanism missing, self-healing loop is incomplete without structured failure capture #5942

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Implementation Ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Failure recording mechanism missing, self-healing loop is incomplete without structured failure capture #5942

Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Implementation Ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions