Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

chaochn47 · 2024-09-20T18:30:57Z

chaochn47
Sep 20, 2024
Collaborator

Learnt from #14828 that we are able to identify corrupted members, this is great.

However, the desired behavior of alarm activation should only those corrupted members failed when serving KV APIs.

Instead of the whole etcd cluster availability is impacted now.

The existing alarm activation behavior makes it difficult to adopt the corruption checker feature in production and convince users.

See code reference where once the alarm activation request is agreed upon raft and materialized to each member in apply stage

etcd/server/etcdserver/apply/uber_applier.go

Lines 218 to 226 in 6ea81c1

    
           func (a *uberApplier) Alarm(ar *pb.AlarmRequest) (*pb.AlarmResponse, error) { 
        
           	resp, err := a.applyV3.Alarm(ar) 
        
           	if ar.Action == pb.AlarmRequest_ACTIVATE || 
        
           		ar.Action == pb.AlarmRequest_DEACTIVATE { 
        
           		a.restoreAlarms() 
        
           	} 
        
           	return resp, err 
        
           }

etcd/server/etcdserver/apply/uber_applier.go

Lines 99 to 109 in 6ea81c1

    
           func (a *uberApplier) restoreAlarms() { 
        
           	noSpaceAlarms := len(a.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 
        
           	corruptAlarms := len(a.alarmStore.Get(pb.AlarmType_CORRUPT)) > 0 
        
           	a.applyV3 = a.applyV3base 
        
           	if noSpaceAlarms { 
        
           		a.applyV3 = newApplierV3Capped(a.applyV3) 
        
           	} 
        
           	if corruptAlarms { 
        
           		a.applyV3 = newApplierV3Corrupt(a.applyV3) 
        
           	} 
        
           }

@ahrtr @serathius @jmhbnz

ahrtr · 2024-09-20T19:39:40Z

ahrtr
Sep 20, 2024
Maintainer

The existing design is to ensure that once there is corruption, then do not serve any requests until the corruption is resolved. I agree that it isn't perfect, but it's prudent solution to avoid possibly worsen the situation. Corruption is not something that happens often in production; instead it should be rare.

FYI. we are working on etcd-operator, one of the goal is to resolve such situation automatically. https://github.com/etcd-io/etcd-operator/blob/main/docs/roadmap.md

1 reply

serathius Sep 24, 2024
Maintainer

Don't know the plans for operator, however I assume that etcd should be able to resolve corruption the best it can within it's scope, and only delegate to administrator when we require manual operation. etcd operator should be able to take over the manual work.

serathius · 2024-09-23T08:15:38Z

serathius
Sep 23, 2024
Maintainer

It's a clear gap that was identified some time ago, we need just a design, a pretty detailed test plan and someone to implement it for v3.6.

Contributions are welcomed.

0 replies

chaochn47 · 2024-09-23T19:05:45Z

chaochn47
Sep 23, 2024
Collaborator Author

Thanks @ahrtr,

etcd is a CP store so data consistency priority is much higher and over availability from engineering perspective. However from business perspective, existing behavior of alarm activation is not acceptable. Corruption could happen on not interested key values from user perspective or service meta-data.

The alternative approach is not using default etcd corruption checker, build our own (administrative) checker and block the peer and client traffic once detected.

If the community encourages this approach instead, It makes me feel the default etcd corruption checker is still experimental and won't be able to graduate to production ready.

So the question is would we spend time making the corruption checker do the right thing hence the feature request is valid?

Or encouraging users to build their own detection and traffic blockage via etcd-operator (open-source) or something else (in-house)?

1 reply

serathius Sep 24, 2024
Maintainer

I understand your frustration with the current behavior of the corruption checker, @chaochn47. It's definitely not ideal to have the entire cluster impacted when corruption might only affect a small portion of the data. You're right, the flags clearly indicate its experimental status (#9190), and while it's been around for a while, it hasn't seen feedback from a widespread testing in production environments.

The v3.5 corruption issues (https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md) highlighted the critical importance of this work, and improving corruption checking was identified as a key action item. Over the past two years, I've been steadily working through those action items, which led to the development of robust testing frameworks. Now, with that foundation in place, I'm eager to tackle the challenge of enhancing corruption detection. I have some promising ideas to explore, including methods for immediate detection to prevent any reads or writes on a corrupted member. I'd be happy to discuss these in more detail.

While building an in-house solution might seem faster in the short term, especially when driven by immediate business needs, I believe that collaborating with the community often leads to better outcomes in the long run. Think of it this way: we're all facing similar challenges with etcd, and by working together, we can pool our knowledge and resources to create solutions that benefit everyone.

Community-driven development takes time – we need to explain the problem, discuss potential solutions, and refine designs together. But this collaborative approach often results in higher quality, more robust solutions that are easier to maintain and evolve over time. In the case of corruption checking, I'm confident that we can find a solution that addresses your concerns and improves etcd for everyone.

Some previous work in this area:

I'm excited to explore the ideas you mentioned and collaborate on making the corruption checker even better! Would you be interested in creating a dedicated GitHub issue to discuss this further? Perhaps we could also schedule a brief call to brainstorm some ideas together?

dims · 2024-09-24T10:39:32Z

dims
Sep 24, 2024

cc @shyamjvs

0 replies

ahrtr · 2024-09-28T19:04:19Z

ahrtr
Sep 28, 2024
Maintainer

@chaochn47 How often do you see data corruption issues in production (EKS)? Do you have a rough idea of how many times it's happened in the past few years, and roughly how many times per year?

The (corruption) alarm relies on corruption detection, which depends on hash computation. Hash computation is closely tied to compaction, and compaction also affects the watch process. It was painful for @fuweid and me to resolve #18089 (comment) in #18274. The reason is the low readability and high complexity.

I am not against to improve the alarm system. But we need to make sure of two points below; otherwise, we're just asking for trouble.

we should have a clear understanding/summary on the existing corruption detection and and ensure it's stable/reliable.
it shouldn't worse the readability or complicate the project too much.

Also we need to try to resolve the data corruption issue from storage layer (bbolt), see etcd-io/bbolt#789. Currently only @tjungblu and me are working on it. More contributors are welcome.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

chaochn47 Sep 20, 2024 Collaborator

Replies: 5 comments · 2 replies

ahrtr Sep 20, 2024 Maintainer

serathius Sep 24, 2024 Maintainer

serathius Sep 23, 2024 Maintainer

chaochn47 Sep 23, 2024 Collaborator Author

serathius Sep 24, 2024 Maintainer

dims Sep 24, 2024

ahrtr Sep 28, 2024 Maintainer

chaochn47
Sep 20, 2024
Collaborator

Replies: 5 comments 2 replies

ahrtr
Sep 20, 2024
Maintainer

serathius Sep 24, 2024
Maintainer

serathius
Sep 23, 2024
Maintainer

chaochn47
Sep 23, 2024
Collaborator Author

serathius Sep 24, 2024
Maintainer

dims
Sep 24, 2024

ahrtr
Sep 28, 2024
Maintainer