Watch Request Blocked When Member Cluster Offline #5672

xigang · 2024-10-11T07:45:02Z

What happened:

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

What you expected to happen:

Should we set a timeout: Set a reasonable timeout for cache.Watch() calls using context.WithTimeout or context.WithDeadline to control the operation time?

https://github.com/karmada-io/karmada/blob/master/pkg/search/proxy/store/multi_cluster_cache.go#L354

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Karmada version:
kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
Others:

The text was updated successfully, but these errors were encountered:

xigang · 2024-10-11T07:46:53Z

/cc @RainbowMango @XiShanYongYe-Chang @ikaven1024 Let's take a look at this issue together.

XiShanYongYe-Chang · 2024-10-11T08:22:59Z

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

Are events in other normal clusters affected?

xigang · 2024-10-11T08:50:07Z

@XiShanYongYe-Chang Unable to receive events from all member clusters through the aggregated apiserver, suspecting that the watch is blocked in cache.watch().

	clusters := c.getClusterNames()
	for i := range clusters {
		cluster := clusters[i]
		options.ResourceVersion = resourceVersion.get(cluster)
		cache := c.cacheForClusterResource(cluster, gvr)
		if cache == nil {
			continue
		}
                //串行执行 cache.Watch()
		w, err := cache.Watch(ctx, options)
		if err != nil {
			return nil, err
		}

		mux.AddSource(w, func(e watch.Event) {
			setObjectResourceVersionFunc(cluster, e.Object)
			addCacheSourceAnnotation(e.Object, cluster)
		})
	}

XiShanYongYe-Chang · 2024-10-11T10:07:00Z

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

xigang · 2024-10-11T11:11:26Z

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

xigang · 2024-10-12T02:16:44Z

/close

karmada-bot · 2024-10-12T02:16:47Z

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

XiShanYongYe-Chang · 2024-10-12T02:25:06Z

Hi @xigang, why close this issue?

xigang · 2024-10-14T07:13:13Z

/reopen

karmada-bot · 2024-10-14T07:13:17Z

@xigang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xigang · 2024-10-14T07:14:24Z

Hi @xigang, why close this issue?

@XiShanYongYe-Chang I will submit a fix PR later.

xigang · 2024-10-23T14:27:01Z

Hi @xigang, why close this issue?

@XiShanYongYe-Chang PR submitted. PTAL.

RainbowMango · 2024-10-24T09:32:21Z

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

Has this been confirmed? If so, they can use this case to reproduce it.

xigang added the kind/bug Categorizes issue or PR as related to a bug. label Oct 11, 2024

karmada-bot closed this as completed Oct 12, 2024

karmada-bot reopened this Oct 14, 2024

xigang linked a pull request Oct 23, 2024 that will close this issue

Add watch request timeout to prevent watch request hang #5732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch Request Blocked When Member Cluster Offline #5672

Watch Request Blocked When Member Cluster Offline #5672

xigang commented Oct 11, 2024 •

edited

Loading

xigang commented Oct 11, 2024

XiShanYongYe-Chang commented Oct 11, 2024

xigang commented Oct 11, 2024 •

edited

Loading

XiShanYongYe-Chang commented Oct 11, 2024

xigang commented Oct 11, 2024 •

edited

Loading

xigang commented Oct 12, 2024

karmada-bot commented Oct 12, 2024

XiShanYongYe-Chang commented Oct 12, 2024

xigang commented Oct 14, 2024

karmada-bot commented Oct 14, 2024

xigang commented Oct 14, 2024

xigang commented Oct 23, 2024

RainbowMango commented Oct 24, 2024

Watch Request Blocked When Member Cluster Offline #5672

Watch Request Blocked When Member Cluster Offline #5672

Comments

xigang commented Oct 11, 2024 • edited Loading

xigang commented Oct 11, 2024

XiShanYongYe-Chang commented Oct 11, 2024

xigang commented Oct 11, 2024 • edited Loading

XiShanYongYe-Chang commented Oct 11, 2024

xigang commented Oct 11, 2024 • edited Loading

xigang commented Oct 12, 2024

karmada-bot commented Oct 12, 2024

XiShanYongYe-Chang commented Oct 12, 2024

xigang commented Oct 14, 2024

karmada-bot commented Oct 14, 2024

xigang commented Oct 14, 2024

xigang commented Oct 23, 2024

RainbowMango commented Oct 24, 2024

xigang commented Oct 11, 2024 •

edited

Loading

xigang commented Oct 11, 2024 •

edited

Loading

xigang commented Oct 11, 2024 •

edited

Loading