Context & Metadata:
- Test:
test_cluster_migrations_sequence[df_seeder_factory0-df_factory0]
- File:
dragonfly/cluster_test.py
- Environment: CI (Ubuntu 24.04, Epoll, Debug build)
- Commit:
b6a1e21e875d164c71b7590bb3ba48bee2e367a7
Executive Summary
The CI run failed because the master Dragonfly process (pid: 19309, port 30099) crashed with SIGABRT (exit code -6). The Python test framework surfaced this as a ConnectionRefusedError: [Errno 111], but this is merely a symptom of the underlying C++ process aborting.
The crash occurred during concurrent ZADD operations pushed by df_seeder while a cluster slot migration was taking place, triggering a DCHECK failure in the DashTable logic.
(In simple words: The crash occurred because a fiber performing a write command yielded control during a cluster migration callback, allowing a concurrent fiber to grow the internal DashTable and invalidate the first fiber’s bucket calculations before it could finish the insert.).
** The Fatal Error**
From the dragonfly.FATAL log:
F20260429 21:37:30.615155 19311 db_slice.cc:691] Check failed: bucket_set == db.prime.CVCUponInsert(key)
Stacktrace Context
The crash originates from ZADD hitting a state violation in DbSlice::AddOrFindInternal (Captured from pytest stdout via the C++ symbolizer):
30099➜ @ 0xf098ee dfly::DbSlice::AddOrFindInternal()
30099➜ @ 0xf092cd dfly::DbSlice::AddOrFind()
30099➜ @ 0xcb765c dfly::(anonymous namespace)::PrepareZEntry()
30099➜ @ 0xcc54da dfly::ZSetFamily::OpAdd()
30099➜ @ 0xcc49aa dfly::ZSetFamily::ZAddGeneric()
Root Cause Analysis (Fiber Preemption)
This is a fiber preemption bug, not memory corruption. The DashTable is functioning correctly, but the internal structural version check (CVCUponInsert) for the slot fails as a key is being inserted during data migrations.
The Mechanism:
CallChangeCallbacks() in db_slice.cc can yield (via stream_mu_ or BucketDependencies::Wait in the RestoreStreamer callback).
- While yielded, another seeder fiber inserts keys, which triggers a DashTable segment split.
- When the original fiber resumes, the bucket topology has legitimately changed, causing the
DCHECK(bucket_set == db.prime.CVCUponInsert(key)) to fail.
In production (release builds without DCHECK), this preemption means the RestoreStreamer is notified about the wrong (pre-split) buckets. Keys bumping out of a post-split bucket could be missed, potentially causing data loss during slot migration.
Suggested Fix Direction:
Simply re-computing CVCUponInsert after the callback returns will only fix the DCHECK crash, but it will not fix the data loss risk (because the streamer already received notifications for the wrong buckets).
The real fix must:
- Prevent the
RestoreStreamer callback from yielding entirely (e.g., using FiberAtomicGuard, though careful attention is needed to avoid deadlocks with stream_mu_).
- OR, make the entire
CVCUponInsert → callbacks → insert sequence fully atomic with respect to table growth.
** Evidence & Attached Files**
I have uploaded the relevant artifacts from the CI failure, including the replica logs, to assist with debugging:
Context & Metadata:
test_cluster_migrations_sequence[df_seeder_factory0-df_factory0]dragonfly/cluster_test.pyb6a1e21e875d164c71b7590bb3ba48bee2e367a7Executive Summary
The CI run failed because the master Dragonfly process (
pid: 19309, port30099) crashed withSIGABRT(exit code -6). The Python test framework surfaced this as aConnectionRefusedError: [Errno 111], but this is merely a symptom of the underlying C++ process aborting.The crash occurred during concurrent
ZADDoperations pushed bydf_seederwhile a cluster slot migration was taking place, triggering aDCHECKfailure in the DashTable logic.(In simple words: The crash occurred because a fiber performing a write command yielded control during a cluster migration callback, allowing a concurrent fiber to grow the internal DashTable and invalidate the first fiber’s bucket calculations before it could finish the insert.).
** The Fatal Error**
From the
dragonfly.FATALlog:Stacktrace Context
The crash originates from
ZADDhitting a state violation inDbSlice::AddOrFindInternal(Captured from pytest stdout via the C++ symbolizer):Root Cause Analysis (Fiber Preemption)
This is a fiber preemption bug, not memory corruption. The DashTable is functioning correctly, but the internal structural version check (
CVCUponInsert) for the slot fails as a key is being inserted during data migrations.The Mechanism:
CallChangeCallbacks()indb_slice.cccan yield (viastream_mu_orBucketDependencies::Waitin theRestoreStreamercallback).DCHECK(bucket_set == db.prime.CVCUponInsert(key))to fail.In production (release builds without
DCHECK), this preemption means theRestoreStreameris notified about the wrong (pre-split) buckets. Keys bumping out of a post-split bucket could be missed, potentially causing data loss during slot migration.Suggested Fix Direction:
Simply re-computing
CVCUponInsertafter the callback returns will only fix theDCHECKcrash, but it will not fix the data loss risk (because the streamer already received notifications for the wrong buckets).The real fix must:
RestoreStreamercallback from yielding entirely (e.g., usingFiberAtomicGuard, though careful attention is needed to avoid deadlocks withstream_mu_).CVCUponInsert → callbacks → insertsequence fully atomic with respect to table growth.** Evidence & Attached Files**
I have uploaded the relevant artifacts from the CI failure, including the replica logs, to assist with debugging:
dragonfly.FATALdragonfly.ERRORdragonfly.WARNINGdragonfly.INFOdragonfly.10696580b72b.root.log.WARNING.20260429-213716.19310.log
dragonfly.10696580b72b.root.log.WARNING.20260429-213716.19309.log
dragonfly.10696580b72b.root.log.INFO.20260429-213716.19310.log
dragonfly.10696580b72b.root.log.INFO.20260429-213716.19309.log
dragonfly.10696580b72b.root.log.FATAL.20260429-213730.19309.log
dragonfly.10696580b72b.root.log.ERROR.20260429-213730.19309.log