Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch Request Blocked When Member Cluster Offline #5672

Open
xigang opened this issue Oct 11, 2024 · 13 comments · May be fixed by #5732
Open

Watch Request Blocked When Member Cluster Offline #5672

xigang opened this issue Oct 11, 2024 · 13 comments · May be fixed by #5732
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@xigang
Copy link
Member

xigang commented Oct 11, 2024

What happened:

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

What you expected to happen:

Should we set a timeout: Set a reasonable timeout for cache.Watch() calls using context.WithTimeout or context.WithDeadline to control the operation time?

https://github.com/karmada-io/karmada/blob/master/pkg/search/proxy/store/multi_cluster_cache.go#L354

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Karmada version:
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:
@xigang xigang added the kind/bug Categorizes issue or PR as related to a bug. label Oct 11, 2024
@xigang
Copy link
Member Author

xigang commented Oct 11, 2024

/cc @RainbowMango @XiShanYongYe-Chang @ikaven1024 Let's take a look at this issue together.

@XiShanYongYe-Chang
Copy link
Member

When the member cluster goes offline, there is a scenario where the client's Watch request gets blocked and does not receive pod events.

Are events in other normal clusters affected?

@xigang
Copy link
Member Author

xigang commented Oct 11, 2024

@XiShanYongYe-Chang Unable to receive events from all member clusters through the aggregated apiserver, suspecting that the watch is blocked in cache.watch().

	clusters := c.getClusterNames()
	for i := range clusters {
		cluster := clusters[i]
		options.ResourceVersion = resourceVersion.get(cluster)
		cache := c.cacheForClusterResource(cluster, gvr)
		if cache == nil {
			continue
		}
                //串行执行 cache.Watch()
		w, err := cache.Watch(ctx, options)
		if err != nil {
			return nil, err
		}

		mux.AddSource(w, func(e watch.Event) {
			setObjectResourceVersionFunc(cluster, e.Object)
			addCacheSourceAnnotation(e.Object, cluster)
		})
	}

@XiShanYongYe-Chang
Copy link
Member

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

@xigang
Copy link
Member Author

xigang commented Oct 11, 2024

Thank you for your reply.

According to your method, the watch connection will be disconnected after a certain period of time, and then the client needs to initiate a watch request again. Do I understand it correctly?

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

@xigang
Copy link
Member Author

xigang commented Oct 12, 2024

/close

@karmada-bot
Copy link
Collaborator

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@XiShanYongYe-Chang
Copy link
Member

Hi @xigang, why close this issue?

@xigang
Copy link
Member Author

xigang commented Oct 14, 2024

/reopen

@karmada-bot karmada-bot reopened this Oct 14, 2024
@karmada-bot
Copy link
Collaborator

@xigang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@xigang
Copy link
Member Author

xigang commented Oct 14, 2024

Hi @xigang, why close this issue?

@XiShanYongYe-Chang I will submit a fix PR later.

@xigang
Copy link
Member Author

xigang commented Oct 23, 2024

Hi @xigang, why close this issue?

@XiShanYongYe-Chang PR submitted. PTAL.

@RainbowMango
Copy link
Member

Yes, but my scenario is quite special. The member cluster has gone offline, but since it hasn't been removed from the ResourceRegistry, the rewatch requests still request the offline cluster, causing the watch requests to get stuck. I suspect this is the reason.

I will conduct a test to verify.

Has this been confirmed? If so, they can use this case to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants