-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Serialize writes to DLQ #8460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize writes to DLQ #8460
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved with comment
service/history/queues/dlq_writer.go
Outdated
| // from multiple shards causing CAS conflicts in the persistence layer. | ||
| mu := q.getQueueMutex(queueKey) | ||
| mu.Lock() | ||
| defer mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all the stuff that happens in this function, it seems to me that it would be slightly more efficient to release the lock on after line 98, or even right after line 95 (basically as soon as q.dlqWriter.EnqueueTask completes).
The rest of the function does metrics and logging which don't need to lock to be correct. If you still want to keep the benefit of being able to defer mu.Unlock(), you could move the write part out into its own function, so that you can release the lock and then update metrics. Maybe the metrics stuff is so fast that it doesnt even affect lock contention, but idk everything helps 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like
func doLockedWrite(...) error {
// Acquire a process-level lock for this specific DLQ to prevent concurrent writes
// from multiple shards causing CAS conflicts in the persistence layer.
mu := q.getQueueMutex(queueKey)
mu.Lock()
defer mu.Unlock()
_, err := q.dlqWriter.CreateQueue(ctx, &persistence.CreateQueueRequest{
QueueKey: queueKey,
})
if err != nil {
if !errors.Is(err, persistence.ErrQueueAlreadyExists) {
return fmt.Errorf("%w: %v", ErrCreateDLQ, err)
}
}
resp, err := q.dlqWriter.EnqueueTask(ctx, &persistence.EnqueueTaskRequest{
QueueType: queueKey.QueueType,
SourceCluster: queueKey.SourceCluster,
TargetCluster: queueKey.TargetCluster,
Task: task,
SourceShardID: sourceShardID,
})
if err != nil {
return fmt.Errorf("%w: %v", ErrSendTaskToDLQ, err)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! Thanks! Let me make the change
What changed?
Add a lock in DLQWriter that will serialize writes to a queue.
Why?
DLQ uses CAS operations in persistence layer. All shards can write tasks to DLQ if multiple tasks start failing with
terminal errors. This can cause a large number of writes to fail because of CAS error. We can prevent this by serializing writes to the queue.
CAS errors are still possible since multiple history hosts can write to the same queue. But less operations will fail with this change.
How did you test it?