Skip to content

hiclaw-controller 重构:WorkerBackend 接口缺少 Start/Stop/Update,无法支撑 Worker 生命周期管理 || hiclaw-controller refactoring: WorkerBackend interface lacks Start/Stop/Update and cannot support Worker life cycle management #556

@Jing-ze

Description

@Jing-ze

问题描述

重构设计文档(docs/design/hiclaw-controller-refactor.md Section 3.2)定义的 WorkerBackend 接口只有 5 个方法:CreateDeleteStatusExecLogs

但设计文档的其他章节描述了依赖 Start/Stop/Update 能力的核心功能,这些功能在当前接口定义下无法实现。

缺失的能力与依赖它们的功能

缺少 Start/Stop

设计文档 Section 5.2-5.3 定义了 Worker 的 sleep/wake 生命周期:

Running ──idle timeout──> Sleeping ──wake──> Running

Team Leader 通过 hiclaw worker wake / hiclaw worker sleep 管理 Worker 容器的启停。Section 5.3 的 CLI 命令表也明确列出了:

操作 CLI 命令 Controller 行为
唤醒 hiclaw worker wake --name W Backend.Start(W)
休眠 hiclaw worker sleep --name W Backend.Stop(W)

WorkerBackend 接口中没有 StartStop 方法。

当前 docker-proxy(docker-proxy/security.go)已经支持容器的 start/stop/restart 操作,说明这个能力在现有系统中已经存在,只是没有被抽象到接口层。

缺少 Update

设计文档 Section 8.4 描述了 Runtime 引擎的滚动升级流程:

WorkerReconciler 检测到 image 变化 → 创建新版本 Worker 实例 → 等待新实例就绪 → 删除旧实例

这个"创建新 + 删除旧"的模式在 Docker 模式下是合理的,但在 K8s 模式下,更自然的做法是直接 patch Pod spec(原地升级)。没有 Update 方法意味着 K8sBackend 无法利用 K8s 原生的滚动更新能力,只能走"删旧建新"的路径。

为什么是接口设计问题而不是实现问题

WorkerBackend 接口在 Phase 1 定义,Phase 2(K8sBackend)、Phase 3(Team Leader 生命周期管理)、Phase 4(滚动升级)都依赖它。如果 Phase 1 定义的接口不完整,后续 Phase 加方法时会导致:

  • 所有已实现的 Backend(DockerBackend、K8sBackend)都需要补充实现
  • Reconciler 中的状态机逻辑需要重写(从只有 Create/Delete 两个动作变成 Create/Start/Stop/Update/Delete 五个动作)
  • Worker 的 Phase 状态机(Pending → Running → Sleeping → Updating → Running)需要重新设计

接口是架构的骨架,必须在 Phase 1 一次设计到位。

当前代码位置

  • 重构设计:docs/design/hiclaw-controller-refactor.md Section 3.2(WorkerBackend 接口定义)
  • 依赖 Start/Stop:Section 5.2-5.3(Team Leader Worker 生命周期管理)
  • 依赖 Update:Section 8.4(Runtime 引擎升级)
  • 现有 start/stop 能力:docker-proxy/security.go(已支持容器 start/stop/restart)

相关


Problem description

The WorkerBackend interface defined in the refactoring design document (docs/design/hiclaw-controller-refactor.md Section 3.2) has only 5 methods: Create, Delete, Status, Exec, Logs.

But other sections of the design document describe core functionality that relies on the Start/Stop/Update capabilities, which cannot be implemented under the current interface definition.

Missing capabilities and functionality that relies on them

Missing Start/Stop

Design document Section 5.2-5.3 defines the sleep/wake life cycle of Worker:

Running ──idle timeout──> Sleeping ──wake──> Running

Team Leader manages the start and stop of Worker containers through hiclaw worker wake / hiclaw worker sleep. The CLI command list in Section 5.3 also explicitly lists:

Operations CLI Commands Controller Behavior
Wake hiclaw worker wake --name W Backend.Start(W)
Hibernate hiclaw worker sleep --name W Backend.Stop(W)

But there are no Start and Stop methods in the WorkerBackend interface.

Currently, docker-proxy (docker-proxy/security.go) already supports the start/stop/restart operations of containers, indicating that this capability already exists in existing systems, but has not been abstracted to the interface layer.

Missing Update

Design document Section 8.4 describes the rolling upgrade process of the Runtime engine:

WorkerReconciler detects the image change → creates a new version of the Worker instance → waits for the new instance to be ready → deletes the old instance

This "create new + delete old" model is reasonable in Docker mode, but in K8s mode, the more natural approach is to directly patch Pod spec (in-place upgrade). The absence of the Update method means that K8sBackend cannot take advantage of K8s's native rolling update capability and can only take the path of "delete the old and create the new".

Why is it an interface design issue rather than an implementation issue?

The WorkerBackend interface is defined in Phase 1, and Phase 2 (K8sBackend), Phase 3 (Team Leader life cycle management), and Phase 4 (rolling upgrade) all rely on it. If the interface defined in Phase 1 is incomplete, subsequent addition of methods in Phase will result in:

  • All implemented Backends (DockerBackend, K8sBackend) need to be implemented additionally
  • The state machine logic in Reconciler needs to be rewritten (from only two actions Create/Delete to five actions Create/Start/Stop/Update/Delete)
  • Worker's Phase state machine (Pending → Running → Sleeping → Updating → Running) needs to be redesigned

The interface is the backbone of the architecture and must be designed once in Phase 1.

Current code location

  • Refactoring design: docs/design/hiclaw-controller-refactor.md Section 3.2 (WorkerBackend interface definition)
  • Depends on Start/Stop: Section 5.2-5.3 (Team Leader Worker life cycle management)
  • Depends on Update: Section 8.4 (Runtime engine upgrade)
  • Existing start/stop capability: docker-proxy/security.go (container start/stop/restart is already supported)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions