Serialize writes to DLQ #8460

prathyushpv · 2025-10-09T17:39:25Z

What changed?

Add a lock in DLQWriter that will serialize writes to a queue.

Why?

DLQ uses CAS operations in persistence layer. All shards can write tasks to DLQ if multiple tasks start failing with
terminal errors. This can cause a large number of writes to fail because of CAS error. We can prevent this by serializing writes to the queue.

CAS errors are still possible since multiple history hosts can write to the same queue. But less operations will fail with this change.

How did you test it?

carlydf

approved with comment

carlydf · 2025-10-14T23:40:15Z

service/history/queues/dlq_writer.go

+	// from multiple shards causing CAS conflicts in the persistence layer.
+	mu := q.getQueueMutex(queueKey)
+	mu.Lock()
+	defer mu.Unlock()


Looking at all the stuff that happens in this function, it seems to me that it would be slightly more efficient to release the lock on after line 98, or even right after line 95 (basically as soon as q.dlqWriter.EnqueueTask completes).
The rest of the function does metrics and logging which don't need to lock to be correct. If you still want to keep the benefit of being able to defer mu.Unlock(), you could move the write part out into its own function, so that you can release the lock and then update metrics. Maybe the metrics stuff is so fast that it doesnt even affect lock contention, but idk everything helps 🤷‍♀️

like

func doLockedWrite(...) error { // Acquire a process-level lock for this specific DLQ to prevent concurrent writes // from multiple shards causing CAS conflicts in the persistence layer. mu := q.getQueueMutex(queueKey) mu.Lock() defer mu.Unlock() _, err := q.dlqWriter.CreateQueue(ctx, &persistence.CreateQueueRequest{ QueueKey: queueKey, }) if err != nil { if !errors.Is(err, persistence.ErrQueueAlreadyExists) { return fmt.Errorf("%w: %v", ErrCreateDLQ, err) } } resp, err := q.dlqWriter.EnqueueTask(ctx, &persistence.EnqueueTaskRequest{ QueueType: queueKey.QueueType, SourceCluster: queueKey.SourceCluster, TargetCluster: queueKey.TargetCluster, Task: task, SourceShardID: sourceShardID, }) if err != nil { return fmt.Errorf("%w: %v", ErrSendTaskToDLQ, err) } }

That makes sense! Thanks! Let me make the change

Serialize writes to DLQ

2c150ec

prathyushpv requested review from a team as code owners October 9, 2025 17:39

prathyushpv added 4 commits October 9, 2025 10:44

Fix test

2b7963f

Fix test

9de98b1

Fix linter issues

301f7f5

Fix linter issues

add0648

carlydf approved these changes Oct 14, 2025

View reviewed changes

prathyushpv added 5 commits October 14, 2025 17:31

Address comments

51944ea

refactor

1a35852

add func()

a8e4a13

remove newline

20a9fe1

Merge branch 'main' into ppv/dlqCAS

b7fed71

prathyushpv enabled auto-merge (squash) October 15, 2025 16:45

prathyushpv merged commit b5382c3 into main Oct 15, 2025
57 checks passed

prathyushpv deleted the ppv/dlqCAS branch October 15, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialize writes to DLQ #8460

Serialize writes to DLQ #8460

Uh oh!

prathyushpv commented Oct 9, 2025 •

edited

Loading

Uh oh!

carlydf left a comment

Uh oh!

carlydf Oct 14, 2025

Uh oh!

carlydf Oct 14, 2025

Uh oh!

prathyushpv Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Serialize writes to DLQ #8460

Serialize writes to DLQ #8460

Uh oh!

Conversation

prathyushpv commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Uh oh!

carlydf left a comment

Choose a reason for hiding this comment

Uh oh!

carlydf Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

carlydf Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

prathyushpv Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prathyushpv commented Oct 9, 2025 •

edited

Loading