Replies: 5 comments 2 replies
-
The existing design is to ensure that once there is corruption, then do not serve any requests until the corruption is resolved. I agree that it isn't perfect, but it's prudent solution to avoid possibly worsen the situation. Corruption is not something that happens often in production; instead it should be rare. FYI. we are working on etcd-operator, one of the goal is to resolve such situation automatically. https://github.com/etcd-io/etcd-operator/blob/main/docs/roadmap.md |
Beta Was this translation helpful? Give feedback.
-
It's a clear gap that was identified some time ago, we need just a design, a pretty detailed test plan and someone to implement it for v3.6. Contributions are welcomed. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ahrtr, etcd is a CP store so data consistency priority is much higher and over availability from engineering perspective. However from business perspective, existing behavior of alarm activation is not acceptable. Corruption could happen on not interested key values from user perspective or service meta-data. The alternative approach is not using default etcd corruption checker, build our own (administrative) checker and block the peer and client traffic once detected. If the community encourages this approach instead, It makes me feel the default etcd corruption checker is still experimental and won't be able to graduate to production ready. So the question is would we spend time making the corruption checker do the right thing hence the feature request is valid? Or encouraging users to build their own detection and traffic blockage via etcd-operator (open-source) or something else (in-house)? |
Beta Was this translation helpful? Give feedback.
-
cc @shyamjvs |
Beta Was this translation helpful? Give feedback.
-
@chaochn47 How often do you see data corruption issues in production (EKS)? Do you have a rough idea of how many times it's happened in the past few years, and roughly how many times per year? The (corruption) alarm relies on corruption detection, which depends on hash computation. Hash computation is closely tied to compaction, and compaction also affects the watch process. It was painful for @fuweid and me to resolve #18089 (comment) in #18274. The reason is the low readability and high complexity. I am not against to improve the alarm system. But we need to make sure of two points below; otherwise, we're just asking for trouble.
Also we need to try to resolve the data corruption issue from storage layer (bbolt), see etcd-io/bbolt#789. Currently only @tjungblu and me are working on it. More contributors are welcome. |
Beta Was this translation helpful? Give feedback.
-
Learnt from #14828 that we are able to identify corrupted members, this is great.
However, the desired behavior of alarm activation should only those corrupted members failed when serving KV APIs.
Instead of the whole etcd cluster availability is impacted now.
The existing alarm activation behavior makes it difficult to adopt the corruption checker feature in production and convince users.
See code reference where once the alarm activation request is agreed upon raft and materialized to each member in apply stage
etcd/server/etcdserver/apply/uber_applier.go
Lines 218 to 226 in 6ea81c1
etcd/server/etcdserver/apply/uber_applier.go
Lines 99 to 109 in 6ea81c1
@ahrtr @serathius @jmhbnz
Beta Was this translation helpful? Give feedback.
All reactions