etcd: HorizontalScaling scale-out leaves cluster stuck in Updating (memberJoin false positive)

## Summary

After a horizontal scale-out (3 → 4 replicas), the KubeBlocks component controller enters an infinite loop reporting that the new member has not joined, even though `etcdctl member list` shows the new pod is fully started and part of the cluster.

## Environment

- KubeBlocks: v1.0.2
- etcd addon: v1.0.2
- Kubernetes: EKS (ap-southeast-1)

## Steps to Reproduce

1. Create a 3-replica etcd cluster
2. Apply a HorizontalScaling OpsRequest with `scaleOut.replicaChanges: 1`

```yaml
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: etcd-scale-out
  namespace: demo
spec:
  clusterName: etcd-cluster
  type: HorizontalScaling
  horizontalScaling:
  - componentName: etcd
    scaleOut:
      replicaChanges: 1
```

3. OpsRequest completes with `Succeed 1/1` (~78 seconds)
4. Observe the component controller logs

## Observed Behavior

The OpsRequest reports `Succeed`, but the cluster remains in `Updating` phase indefinitely. The component controller loops on the `memberJoin` lifecycle action:

```
action failed some replicas have not joined: [etcd-cluster-etcd-3]
```

This message repeats continuously. Meanwhile, `etcdctl member list` from inside the pod shows etcd-3 is a healthy, started member:

```
f3e7454e8da39e57, started, etcd-cluster-etcd-3, http://etcd-cluster-etcd-3...:2380, ...
```

The cluster never transitions back to `Running`.

## Root Cause (confirmed by code analysis)

Two bugs interact to produce the infinite loop:

### Bug 1 — Annotation not persisted on transient failure (KubeBlocks core)

File: `controllers/apps/component/transformer_component_workload.go`, `handleUpdate()`

When `joinMember4ScaleOut()` returns a `RequeueError` (e.g., because kbagent is not yet ready — `dial tcp …:3501: connection refused`), `handleUpdate()` returns early **before** `cli.Update()` is called:

```go
if err := t.handleWorkloadUpdate(...); err != nil {
    return err          // ← RequeueError exits here
}
objCopy := copyAndMergeITS(runningITS, protoITS, ...)
if objCopy != nil {
    cli.Update(dag, nil, objCopy, ...)   // ← NEVER REACHED on error
}
```

Because `cli.Update()` is never called, the `MemberJoined=true` annotation is never written to the running InstanceSet on the API server. On the next reconcile, `BuildReplicasStatus(runningITS, protoITS)` copies `MemberJoined=false` from the unchanged running InstanceSet, resetting the state. This creates a retry loop.

### Bug 2 — `member-join.sh` is not idempotent (etcd addon)

File: `addons/etcd/scripts/member-join.sh`, `add_member()`

```bash
exec_etcdctl "$leader_endpoint:3379" member add "$KB_JOIN_MEMBER_POD_NAME" \
  --peer-urls="$peer_protocol://$join_member_endpoint:2380" || error_exit "Failed to join member"
```

There is no guard against calling `member add` when the member already exists. The exact sequence that creates the infinite loop:

1. Reconcile R1: kbagent not yet ready → `connection refused` → `RequeueError` → `cli.Update()` skipped → `MemberJoined=false` stays in running InstanceSet.
2. ...repeated N times until kbagent starts...
3. Reconcile RN: kbagent ready → `etcdctl member add etcd-3` **SUCCEEDS** → etcd-3 is now in the member list → `joinMemberForPod()` returns nil → function returns nil → `cli.Update()` called with `MemberJoined=true`.
4. If `cli.Update()` succeeds → **done, cluster recovers**.
5. If `cli.Update()` fails (API server conflict, transient error) → reconcile retried.
6. Reconcile RN+1: `BuildReplicasStatus` copies `MemberJoined=false` from unchanged running InstanceSet → `member add etcd-3` → **"member already exists"** → `error_exit` → non-zero exit → `joinMemberForPod()` returns error → `RequeueError` → `cli.Update()` skipped → **infinite loop**.

### Confirmed with live test

```
17:51:49 INFO  ... connection refused (attempt 1)
17:51:50 INFO  ... connection refused (attempt 2)
17:51:51 INFO  ... connection refused (attempts 3-5)
17:51:52 INFO  ... connection refused (attempt 6)
17:51:53 INFO  succeed to join member for pod: etcd-cluster-etcd-3
```

In most cases the cluster recovers (step 4 above). In the reporter's case, step 5-6 triggered the infinite loop.

## Fix

### Addon fix (recommended, immediate): Make `member-join.sh` idempotent

Check whether the member is already registered before calling `member add`. If it is, return success immediately:

```bash
add_member() {
  ...
  # Idempotency: skip if member already exists
  if exec_etcdctl "$leader_endpoint:2379" member list | grep -qw "$KB_JOIN_MEMBER_POD_NAME"; then
    log "Member $KB_JOIN_MEMBER_POD_NAME already exists in cluster, skipping"
    return 0
  fi

  exec_etcdctl "$leader_endpoint:2379" member add "$KB_JOIN_MEMBER_POD_NAME" \
    --peer-urls="$peer_protocol://$join_member_endpoint:2380" || error_exit "Failed to join member"
  log "Member $KB_JOIN_MEMBER_POD_NAME joined cluster via leader $leader_endpoint"
}
```

This ensures that even after `MemberJoined=true` fails to persist, subsequent reconciles call `member add` (which is now a no-op for already-registered members), `joinMemberForPod()` returns nil, and `cli.Update()` eventually writes `MemberJoined=true`.

### Core fix (long-term): Persist `MemberJoined=true` independently

In `handleUpdate()`, the `MemberJoined=true` annotation should be written to the API server immediately after `joinMemberForPod()` succeeds, **regardless of whether the InstanceSet spec update succeeds**. This decouples cluster membership tracking from the workload spec update.

There is already a `// TODO: should wait for the data to be loaded before joining the member?` comment in `joinMember4ScaleOut()` (`transformer_component_workload_ops.go:351`) indicating awareness of sequencing issues in this code path.

## Impact

- Cluster stuck in `Updating` indefinitely
- All subsequent OpsRequests (including scale-in) are rejected
- Workaround: delete and recreate the cluster

## Expected Behavior

After scale-out completes and the new member is fully started, the `memberJoin` action should succeed (idempotently), allowing the cluster to transition to `Running`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd: HorizontalScaling scale-out leaves cluster stuck in Updating (memberJoin false positive) #2541

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause (confirmed by code analysis)

Bug 1 — Annotation not persisted on transient failure (KubeBlocks core)

Bug 2 — `member-join.sh` is not idempotent (etcd addon)

Confirmed with live test

Fix

Addon fix (recommended, immediate): Make `member-join.sh` idempotent

Core fix (long-term): Persist `MemberJoined=true` independently

Impact

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

etcd: HorizontalScaling scale-out leaves cluster stuck in Updating (memberJoin false positive) #2541

Description

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause (confirmed by code analysis)

Bug 1 — Annotation not persisted on transient failure (KubeBlocks core)

Bug 2 — member-join.sh is not idempotent (etcd addon)

Confirmed with live test

Fix

Addon fix (recommended, immediate): Make member-join.sh idempotent

Core fix (long-term): Persist MemberJoined=true independently

Impact

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 2 — `member-join.sh` is not idempotent (etcd addon)

Addon fix (recommended, immediate): Make `member-join.sh` idempotent

Core fix (long-term): Persist `MemberJoined=true` independently