问题描述
当前 WorkerReconciler.handleCreate 是一个线性 8 步流程(Matrix 注册 → Room 创建 → Higress Consumer → AI Route → 配置生成 → OSS 推送 → Skills 推送 → 容器创建),没有步骤级别的 checkpoint 机制。如果中间某一步失败,下次 reconcile 会从第 1 步重新执行。
这个问题在当前 bash 脚本版本和即将进行的 Go 重构版本(docs/design/hiclaw-controller-refactor.md Section 3.3)中都存在。
具体场景
以创建 Worker 为例:
- ✅ 步骤 1:Matrix 注册用户
alice — 成功
- ✅ 步骤 2:创建 Matrix Room — 成功
- ✅ 步骤 3:创建 Higress Consumer — 成功
- ❌ 步骤 4:配置 AI Route — 失败(比如 Higress 暂时不可用)
- Controller 将 Worker 重新入队,下次 reconcile 从步骤 1 开始
- 步骤 1 再次执行:尝试注册
alice → 可能 409 冲突,也可能创建出重复账号(取决于 Matrix Server 的幂等性保证)
- 步骤 3 再次执行:创建 Higress Consumer → 可能创建出重复的 consumer
每次部分失败 + 重试,都可能在 Matrix Server 和 Higress 中留下孤儿资源(已创建但未被完整关联的用户、Room、Consumer、Route)。
影响范围
WorkerReconciler.handleCreate — 8 步线性流程
TeamReconciler.handleCreate — 类似问题(Leader 创建 → Worker 创建 → Team Room → 权限配置)
HumanReconciler.handleCreate — 类似问题(Matrix 注册 → 权限配置 → Room 邀请 → 邮件通知)
为什么现在必须解决
- embedded 模式下规模小(通常几个 Worker),手动清理孤儿资源还能接受
- K8s incluster 模式下 reconcile 频率更高、并发更多,孤儿资源会快速积累,且没有简单的手动清理手段
- 重构设计文档(Section 3.3)中的 Go 版
handleCreate 沿用了相同的线性结构,如果不在重构时解决,问题会被带入新架构
当前代码位置
hiclaw-controller/internal/controller/worker_controller.go — handleCreate 方法
hiclaw-controller/internal/controller/team_controller.go — handleCreate 方法
hiclaw-controller/internal/controller/human_controller.go — handleCreate 方法
- 重构设计:
docs/design/hiclaw-controller-refactor.md Section 3.3
相关
Problem description
Currently WorkerReconciler.handleCreate is a linear 8-step process (Matrix registration → Room creation → Higress Consumer → AI Route → Configuration generation → OSS push → Skills push → Container creation), there is no step-level checkpoint mechanism. If one of the intermediate steps fails, the next reconciliation will be executed again from step 1.
This problem exists in both the current bash script version and the upcoming Go refactor version (docs/design/hiclaw-controller-refactor.md Section 3.3).
Specific scenario
Take creating a Worker as an example:
- ✅ Step 1: Matrix registered user
alice — Success
- ✅ Step 2: Create Matrix Room — Success
- ✅ Step 3: Create Higress Consumer — Success
- ❌ Step 4: Configure AI Route — Failed (e.g. Higress is temporarily unavailable)
- The Controller re-enqueues the Worker, and the next reconciliation starts from step 1.
- Step 1 is executed again: try to register
alice → 409 conflict may occur, or duplicate accounts may be created (depending on the idempotence guarantee of Matrix Server)
- Step 3 is executed again: Create Higress Consumer → Duplicate consumers may be created
Each partial failure + retry may leave orphan resources (Users, Rooms, Consumers, Routes that have been created but not fully associated) in Matrix Server and Higress.
Scope of influence
WorkerReconciler.handleCreate — 8-step linear process
TeamReconciler.handleCreate — similar issues (Leader creation → Worker creation → Team Room → Permission configuration)
HumanReconciler.handleCreate — Similar issues (Matrix registration → Permission configuration → Room invitation → Email notification)
Why it must be solved now
- In embedded mode, the scale is small (usually several Workers), and manual cleaning of orphan resources is still acceptable.
- In K8s incluster mode, the reconciliation frequency is higher and the concurrency is more. Orphan resources will accumulate quickly, and there is no simple manual cleanup method.
- The Go version of
handleCreate in the refactoring design document (Section 3.3) follows the same linear structure. If not solved during refactoring, the problem will be brought into the new architecture
Current code location
hiclaw-controller/internal/controller/worker_controller.go — handleCreate method
hiclaw-controller/internal/controller/team_controller.go — handleCreate method
hiclaw-controller/internal/controller/human_controller.go — handleCreate method
- Refactor design:
docs/design/hiclaw-controller-refactor.md Section 3.3
Related
问题描述
当前
WorkerReconciler.handleCreate是一个线性 8 步流程(Matrix 注册 → Room 创建 → Higress Consumer → AI Route → 配置生成 → OSS 推送 → Skills 推送 → 容器创建),没有步骤级别的 checkpoint 机制。如果中间某一步失败,下次 reconcile 会从第 1 步重新执行。这个问题在当前 bash 脚本版本和即将进行的 Go 重构版本(
docs/design/hiclaw-controller-refactor.mdSection 3.3)中都存在。具体场景
以创建 Worker 为例:
alice— 成功alice→ 可能 409 冲突,也可能创建出重复账号(取决于 Matrix Server 的幂等性保证)每次部分失败 + 重试,都可能在 Matrix Server 和 Higress 中留下孤儿资源(已创建但未被完整关联的用户、Room、Consumer、Route)。
影响范围
WorkerReconciler.handleCreate— 8 步线性流程TeamReconciler.handleCreate— 类似问题(Leader 创建 → Worker 创建 → Team Room → 权限配置)HumanReconciler.handleCreate— 类似问题(Matrix 注册 → 权限配置 → Room 邀请 → 邮件通知)为什么现在必须解决
handleCreate沿用了相同的线性结构,如果不在重构时解决,问题会被带入新架构当前代码位置
hiclaw-controller/internal/controller/worker_controller.go—handleCreate方法hiclaw-controller/internal/controller/team_controller.go—handleCreate方法hiclaw-controller/internal/controller/human_controller.go—handleCreate方法docs/design/hiclaw-controller-refactor.mdSection 3.3相关
Problem description
Currently
WorkerReconciler.handleCreateis a linear 8-step process (Matrix registration → Room creation → Higress Consumer → AI Route → Configuration generation → OSS push → Skills push → Container creation), there is no step-level checkpoint mechanism. If one of the intermediate steps fails, the next reconciliation will be executed again from step 1.This problem exists in both the current bash script version and the upcoming Go refactor version (
docs/design/hiclaw-controller-refactor.mdSection 3.3).Specific scenario
Take creating a Worker as an example:
alice— Successalice→ 409 conflict may occur, or duplicate accounts may be created (depending on the idempotence guarantee of Matrix Server)Each partial failure + retry may leave orphan resources (Users, Rooms, Consumers, Routes that have been created but not fully associated) in Matrix Server and Higress.
Scope of influence
WorkerReconciler.handleCreate— 8-step linear processTeamReconciler.handleCreate— similar issues (Leader creation → Worker creation → Team Room → Permission configuration)HumanReconciler.handleCreate— Similar issues (Matrix registration → Permission configuration → Room invitation → Email notification)Why it must be solved now
handleCreatein the refactoring design document (Section 3.3) follows the same linear structure. If not solved during refactoring, the problem will be brought into the new architectureCurrent code location
hiclaw-controller/internal/controller/worker_controller.go—handleCreatemethodhiclaw-controller/internal/controller/team_controller.go—handleCreatemethodhiclaw-controller/internal/controller/human_controller.go—handleCreatemethoddocs/design/hiclaw-controller-refactor.mdSection 3.3Related