Skip to content

epic: platform foundation roadmap — P0 through P3 #7

@mdproctor

Description

@mdproctor

Overview

Platform-level epic tracking all foundation and application work identified in the cross-repo coherence audit and Gastown gap analysis. Organised by phase — each phase has a gate condition that must be met before P1 can begin. Phases P0/P1 address correctness and scale; P2 addresses production quality; P3 expands capability.

Complexity: S = 2–5 days · M = 1 week · L = 2–3 weeks · XL = months
Priority within phase: items are listed in recommended execution order.


P0 — Wiring (do first — breaks immediately with multiple agents)

Gate: Normative layer functional end-to-end. Trust accumulates from real agent behaviour. Same actor identity across all repos.

# Item Repo Complexity Effort Issue Untracked deps
1 Commitment outcomes → LedgerAttestation (FULFILLED→SOUND, FAILED→FLAGGED) quarkus-qhorus Low S qhorus#123
2 ActorTypeResolver utility — unified ActorType derivation across all consumers quarkus-ledger Low S ledger#47
3 InstanceActorIdProvider SPI — map Qhorus instanceId → ledger actorId (persona format) quarkus-qhorus Medium S qhorus#124 ledger#47 done first
4 Normative→prescriptive wiring — CaseHub work assignments send Qhorus COMMAND casehub-engine High L engine#186 qhorus#123 done first

Why this order: 1 and 2 are standalone with no inter-dependencies and highest leverage. 3 depends conceptually on the identity model from 2. 4 is last because it depends on the commitment lifecycle being functional (1) and is the most invasive change — it touches CaseContextChangedEventHandler, WorkOrchestrator, and requires a WorkerResponseHandler.

Risk in P0: Item 4 (engine#186) requires CaseLedgerEntry to be on main for the ledger side to work. If the branch merge (P1.4) is not done first, item 4 must be implemented without ledger integration and revisited. Consider pulling P1.4 into P0 if the branch is close to mergeable.


P1 — Scale (breaks at 10+ concurrent cases/agents)

Gate: Can run 10+ simultaneous cases without manual intervention, API exhaustion, or stuck agents. Trust scores drive routing decisions.

# Item Repo Complexity Effort Issue Notes
1 Merge CaseLedgerEntry branch (feat/casehub-ledger-integration) casehub-engine Medium M (not tracked) Resolve merge conflicts; verify OTel propagation via @EntityListeners inheritance; add invariant test
2 Agent concurrency throttling — SpawnThrottle in ClaudonyConfig (global + per-case ceiling, back-pressure queue) claudony Medium M (not tracked) No inter-dependencies; pure Claudony addition
3 RecoveryPolicy SPI — detect stalled workers and take action (REPROVISION / ESCALATE / CANCEL / WAIT) casehub-engine + claudony Medium M (not tracked) SPI in engine api/spi/; ReprovisioningRecoveryPolicy in claudony-casehub
4 Trust routing wired — WorkerSelectionStrategy injectable in CaseContextChangedEventHandler + TrustWeightedSelectionStrategy casehub-engine Medium M (not tracked) Depends on P0.1 (trust scores must be computed before routing them)

Why this order: 1 unblocks the compliance story and should be done first. 2 and 3 are independent and can run in parallel. 4 depends on P0.1 being complete — routing by trust is pointless if trust scores are never updated from behaviour.

Risk in P1: Item 3 (RecoveryPolicy) requires careful design — what constitutes "stalled" at the casehub-engine level vs the qhorus Watchdog level needs to be clearly defined to avoid double-recovery. The three tiers (qhorus Watchdog → casehub-engine WorkerStatusListener → claudony fleet health) need coordinated stall detection thresholds.


P2 — Production quality (full observability, audit trail, cross-deployment trust)

Gate: Full audit trail complete. Case spans correlatable in Jaeger/Grafana. Compliance story holds end-to-end.

# Item Repo Complexity Effort Issue Notes
1 OTel trace alignment — PropagationContext.traceId from LedgerTraceIdProvider at case creation casehub-engine Low S engine#185 One-line change + fallback; quick win
2 Cross-deployment trust federation — TrustExportService / TrustImportService SPIs quarkus-ledger Medium L (not tracked) Canonical format design is the hard part; transport (webhook/Kafka) is pluggable
3 Cross-repo causal chain — causedByEntryId at provisioning; CaseLineageQuery JPA implementation claudony High L claudony#94 CaseLineageQuery JPA is non-trivial; requires casehub datasource configured in claudony

Why this order: 1 is a quick win with no dependencies. 2 and 3 are both high-value but complex — run in parallel if capacity allows. 3 depends on CaseLedgerEntry being merged (P1.1).


P3 — Capability expansion (new capabilities on a solid foundation)

Gate: P0 and P1 complete. Foundation is solid. Team has capacity for new work.

# Item Repo Complexity Effort Issue Notes
1 Notification consolidation — quarkus-work-notifications delegates Slack/Teams to casehub-connectors quarkus-work + casehub-connectors Medium M parent#5 Unblocks P3.3 and P3.5
2 SLA propagation — case budget bounds child WorkItem and Commitment deadlines casehub-engine + quarkus-work Medium M parent#6 Adapter-level change; no foundation changes needed
3 Critical event notifications — stalled obligations, case faults, escalations → casehub-connectors qhorus + engine + work + connectors Medium M (not tracked) Depends on P3.1 (unified delivery pipeline first)
4 Human-in-the-loop end-to-end — casehub-work-adapter: WorkItem COMPLETED → CaseHubReactor.signal() → case continues casehub-engine + quarkus-work High L (not tracked) Most important HITL integration; currently blocked on engine stability
5 casehub-assisteddev — AI-assisted development application (merge queue, code review orchestration) new repo Very High XL (not tracked — needs its own epic) Separate repo; uses foundation primitives; needs domain design first

Hypothesis test (parallel track — not a blocker for P0-P3)

# Item Repo Complexity Effort Issue Notes
Normative layer interoperability experiment — LangChain4j vs CaseHub on production incident scenario casehub-engine High L engine#189 Can proceed once P0.1 (qhorus#123) is done; generates external evidence for normative layer claims

Untracked issues to create (P1–P3)

The following items are specified in the roadmap but not yet tracked as GitHub issues:

Item Recommended repo Notes
Merge CaseLedgerEntry branch casehub/engine May warrant a PR not an issue
Agent concurrency throttling (SpawnThrottle) casehubio/claudony
RecoveryPolicy SPI casehubio/engine
Trust routing wired (injectable WorkerSelectionStrategy) casehubio/engine
Cross-deployment trust federation casehubio/quarkus-ledger
Critical event notifications casehubio/casehub-parent (cross-repo)
HITL end-to-end (casehub-work-adapter completion) casehubio/engine
casehub-assisteddev new repo Needs its own epic

Summary

Phase Items Estimated total effort Gate condition
P0 — Wiring 4 items ~3–4 weeks Normative layer functional; trust accumulates
P1 — Scale 4 items ~4–5 weeks 10+ agents; no manual intervention needed
P2 — Quality 3 items ~4–6 weeks Full audit trail; Jaeger correlation
P3 — Expand 5 items ~3 months + XL New capabilities; casehub-assisteddev is a separate product epic

Total to production-quality foundation: ~3–4 months of focused engineering.
casehub-assisteddev is a separate product investment beyond the foundation.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions