FATAL Check failed: bucket_set == db.prime.CVCUponInsert(key) during test_cluster_migrations_sequence

**Context & Metadata:**
* **Test:** `test_cluster_migrations_sequence[df_seeder_factory0-df_factory0]`
* **File:** `dragonfly/cluster_test.py`
* **Environment:** CI (Ubuntu 24.04, Epoll, Debug build)
* **Commit:** `b6a1e21e875d164c71b7590bb3ba48bee2e367a7`

---

### **Executive Summary**
The CI run failed because the master Dragonfly process (`pid: 19309`, port `30099`) crashed with `SIGABRT` (exit code -6). The Python test framework surfaced this as a `ConnectionRefusedError: [Errno 111]`, but this is merely a symptom of the underlying C++ process aborting. 

The crash occurred during concurrent `ZADD` operations pushed by `df_seeder` while a cluster slot migration was taking place, triggering a `DCHECK` failure in the DashTable logic.
(In simple words: The crash occurred because a fiber performing a write command yielded control during a cluster migration callback, allowing a concurrent fiber to grow the internal DashTable and invalidate the first fiber’s bucket calculations before it could finish the insert.).

---

### ** The Fatal Error**
From the `dragonfly.FATAL` log:
```text
F20260429 21:37:30.615155 19311 db_slice.cc:691] Check failed: bucket_set == db.prime.CVCUponInsert(key)
```

### **Stacktrace Context**
The crash originates from `ZADD` hitting a state violation in `DbSlice::AddOrFindInternal` (Captured from pytest stdout via the C++ symbolizer):
```text
30099➜    @           0xf098ee  dfly::DbSlice::AddOrFindInternal()
30099➜    @           0xf092cd  dfly::DbSlice::AddOrFind()
30099➜    @           0xcb765c  dfly::(anonymous namespace)::PrepareZEntry()
30099➜    @           0xcc54da  dfly::ZSetFamily::OpAdd()
30099➜    @           0xcc49aa  dfly::ZSetFamily::ZAddGeneric()
```

---

### **Root Cause Analysis (Fiber Preemption)**
This is a fiber preemption bug, not memory corruption. The DashTable is functioning correctly, but the internal structural version check (`CVCUponInsert`) for the slot fails as a key is being inserted during data migrations.

**The Mechanism:**
1. `CallChangeCallbacks()` in `db_slice.cc` can yield (via `stream_mu_` or `BucketDependencies::Wait` in the `RestoreStreamer` callback). 
2. While yielded, another seeder fiber inserts keys, which triggers a DashTable segment split.
3. When the original fiber resumes, the bucket topology has legitimately changed, causing the `DCHECK(bucket_set == db.prime.CVCUponInsert(key))` to fail.

In production (release builds without `DCHECK`), this preemption means the `RestoreStreamer` is notified about the wrong (pre-split) buckets. Keys bumping out of a post-split bucket could be missed, potentially causing **data loss** during slot migration.

**Suggested Fix Direction:**
Simply re-computing `CVCUponInsert` after the callback returns will only fix the `DCHECK` crash, but it will **not** fix the data loss risk (because the streamer already received notifications for the wrong buckets). 
The real fix must:
1. Prevent the `RestoreStreamer` callback from yielding entirely (e.g., using `FiberAtomicGuard`, though careful attention is needed to avoid deadlocks with `stream_mu_`).
2. OR, make the entire `CVCUponInsert → callbacks → insert` sequence fully atomic with respect to table growth.

---

### ** Evidence & Attached Files**
I have uploaded the relevant artifacts from the CI failure, including the replica logs, to assist with debugging:
* `dragonfly.FATAL`
* `dragonfly.ERROR`
* `dragonfly.WARNING`
* `dragonfly.INFO`
* Replica logs extracted from the CI run
[dragonfly.10696580b72b.root.log.WARNING.20260429-213716.19310.log](https://github.com/user-attachments/files/27242127/dragonfly.10696580b72b.root.log.WARNING.20260429-213716.19310.log)
[dragonfly.10696580b72b.root.log.WARNING.20260429-213716.19309.log](https://github.com/user-attachments/files/27242132/dragonfly.10696580b72b.root.log.WARNING.20260429-213716.19309.log)
[dragonfly.10696580b72b.root.log.INFO.20260429-213716.19310.log](https://github.com/user-attachments/files/27242128/dragonfly.10696580b72b.root.log.INFO.20260429-213716.19310.log)
[dragonfly.10696580b72b.root.log.INFO.20260429-213716.19309.log](https://github.com/user-attachments/files/27242130/dragonfly.10696580b72b.root.log.INFO.20260429-213716.19309.log)
[dragonfly.10696580b72b.root.log.FATAL.20260429-213730.19309.log](https://github.com/user-attachments/files/27242131/dragonfly.10696580b72b.root.log.FATAL.20260429-213730.19309.log)
[dragonfly.10696580b72b.root.log.ERROR.20260429-213730.19309.log](https://github.com/user-attachments/files/27242129/dragonfly.10696580b72b.root.log.ERROR.20260429-213730.19309.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FATAL Check failed: bucket_set == db.prime.CVCUponInsert(key) during test_cluster_migrations_sequence #7245

Executive Summary

The Fatal Error

Stacktrace Context

Root Cause Analysis (Fiber Preemption)

Evidence & Attached Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FATAL Check failed: bucket_set == db.prime.CVCUponInsert(key) during test_cluster_migrations_sequence #7245

Description

Executive Summary

** The Fatal Error**

Stacktrace Context

Root Cause Analysis (Fiber Preemption)

** Evidence & Attached Files**

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The Fatal Error

Evidence & Attached Files