Skip to content

hiclaw-controller 重构:Reconciler 缺乏步骤级幂等性,部分失败会产生孤儿资源 || hiclaw-controller Refactoring: Reconciler lacks step-level idempotence, and partial failure will generate orphan resources #555

@Jing-ze

Description

@Jing-ze

问题描述

当前 WorkerReconciler.handleCreate 是一个线性 8 步流程(Matrix 注册 → Room 创建 → Higress Consumer → AI Route → 配置生成 → OSS 推送 → Skills 推送 → 容器创建),没有步骤级别的 checkpoint 机制。如果中间某一步失败,下次 reconcile 会从第 1 步重新执行。

这个问题在当前 bash 脚本版本和即将进行的 Go 重构版本(docs/design/hiclaw-controller-refactor.md Section 3.3)中都存在。

具体场景

以创建 Worker 为例:

  1. ✅ 步骤 1:Matrix 注册用户 alice — 成功
  2. ✅ 步骤 2:创建 Matrix Room — 成功
  3. ✅ 步骤 3:创建 Higress Consumer — 成功
  4. ❌ 步骤 4:配置 AI Route — 失败(比如 Higress 暂时不可用)
  5. Controller 将 Worker 重新入队,下次 reconcile 从步骤 1 开始
  6. 步骤 1 再次执行:尝试注册 alice → 可能 409 冲突,也可能创建出重复账号(取决于 Matrix Server 的幂等性保证)
  7. 步骤 3 再次执行:创建 Higress Consumer → 可能创建出重复的 consumer

每次部分失败 + 重试,都可能在 Matrix Server 和 Higress 中留下孤儿资源(已创建但未被完整关联的用户、Room、Consumer、Route)。

影响范围

  • WorkerReconciler.handleCreate — 8 步线性流程
  • TeamReconciler.handleCreate — 类似问题(Leader 创建 → Worker 创建 → Team Room → 权限配置)
  • HumanReconciler.handleCreate — 类似问题(Matrix 注册 → 权限配置 → Room 邀请 → 邮件通知)

为什么现在必须解决

  • embedded 模式下规模小(通常几个 Worker),手动清理孤儿资源还能接受
  • K8s incluster 模式下 reconcile 频率更高、并发更多,孤儿资源会快速积累,且没有简单的手动清理手段
  • 重构设计文档(Section 3.3)中的 Go 版 handleCreate 沿用了相同的线性结构,如果不在重构时解决,问题会被带入新架构

当前代码位置

  • hiclaw-controller/internal/controller/worker_controller.gohandleCreate 方法
  • hiclaw-controller/internal/controller/team_controller.gohandleCreate 方法
  • hiclaw-controller/internal/controller/human_controller.gohandleCreate 方法
  • 重构设计:docs/design/hiclaw-controller-refactor.md Section 3.3

相关


Problem description

Currently WorkerReconciler.handleCreate is a linear 8-step process (Matrix registration → Room creation → Higress Consumer → AI Route → Configuration generation → OSS push → Skills push → Container creation), there is no step-level checkpoint mechanism. If one of the intermediate steps fails, the next reconciliation will be executed again from step 1.

This problem exists in both the current bash script version and the upcoming Go refactor version (docs/design/hiclaw-controller-refactor.md Section 3.3).

Specific scenario

Take creating a Worker as an example:

  1. ✅ Step 1: Matrix registered user alice — Success
  2. ✅ Step 2: Create Matrix Room — Success
  3. ✅ Step 3: Create Higress Consumer — Success
  4. ❌ Step 4: Configure AI Route — Failed (e.g. Higress is temporarily unavailable)
  5. The Controller re-enqueues the Worker, and the next reconciliation starts from step 1.
  6. Step 1 is executed again: try to register alice → 409 conflict may occur, or duplicate accounts may be created (depending on the idempotence guarantee of Matrix Server)
  7. Step 3 is executed again: Create Higress Consumer → Duplicate consumers may be created

Each partial failure + retry may leave orphan resources (Users, Rooms, Consumers, Routes that have been created but not fully associated) in Matrix Server and Higress.

Scope of influence

  • WorkerReconciler.handleCreate — 8-step linear process
  • TeamReconciler.handleCreate — similar issues (Leader creation → Worker creation → Team Room → Permission configuration)
  • HumanReconciler.handleCreate — Similar issues (Matrix registration → Permission configuration → Room invitation → Email notification)

Why it must be solved now

  • In embedded mode, the scale is small (usually several Workers), and manual cleaning of orphan resources is still acceptable.
  • In K8s incluster mode, the reconciliation frequency is higher and the concurrency is more. Orphan resources will accumulate quickly, and there is no simple manual cleanup method.
  • The Go version of handleCreate in the refactoring design document (Section 3.3) follows the same linear structure. If not solved during refactoring, the problem will be brought into the new architecture

Current code location

  • hiclaw-controller/internal/controller/worker_controller.gohandleCreate method
  • hiclaw-controller/internal/controller/team_controller.gohandleCreate method
  • hiclaw-controller/internal/controller/human_controller.gohandleCreate method
  • Refactor design: docs/design/hiclaw-controller-refactor.md Section 3.3

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions