-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some images pull failed and some pull takes a very long time #551
Comments
{ It seems the latency sometime would be very large which should with 1minutes and handler sometime is empty which seems not possible for v0.0.23 but it happens maybe kusto log issue, but the latency is really high |
Without having access to you environment it would be difficult to debug this. What you are hitting is most likely Containerd image pull progress deadline. Something is however not working with Spegel if it is not serving data. Or serving data very slowly. Setting timeouts is difficult because there will be large images that could take a long time to serve. Best thing I can do right now is to try to reproduce this case locally somehow. |
Thanks for your response @phillebaba |
It is a completely different question when we are talking about 1000 node clusters. It introduces a lot of potential bottle necks that are hard to theorize about. In that situation we would definitely see a bottleneck if 250 nodes are all hitting the same node for an image. I could see a couple of solutions to this problem. The easy one would be rate limiting http requests so that a single node could not be overrun. The other more complicated solution would be for the router to be more aware about the load on nodes. I do think this is solvable but I dont think it will be easy. A problem is that I do not have access to a 1000 node cluster. Right now I am paying to run the benchmarks on a 100 node cluster myself. I have been trying to figure out a way to get credits to run benchmarks but have not found one yet. |
I have more findings, from the log, it seems sometimes some nodes has high latency when copy blob between nodes. PreciseTimeStamp message RoleInstance |
I suspect if a node server is pulling images and the blob download maybe slow since I found at that time this node also download other images. Our scenario is distributed job which has 6 container per replica so a node may cached one image and pull other 5 and at the same time serve the cached images as registry server. |
I have a proposal for this issue. Add a status reporter like containerd and if spegel has no progress within 10s then failed early so that containerd will fallback. @phillebaba what do you think? If you have no time, I would like to contribute this if make sense |
This sounds like we could optimize the configuration in the Spegel proxy to timeout faster in these types of scenarios. Adding artificial delay to verify that Containerd will still wait was a good idea. I am going to do the same to see if I can tweak some timeout settings to get the connection to close when this happens. Long term we obviously want to know why this happens in the first place. Regarding the status reporter feature, how are you planning on implementing this? I think it would be fine as long as I understand the design. |
@Calotte I have finally had some time to sit down and do some proper testing in controlled environments. Through these tests I have gained some insights in why you are seeing these problems, especially in large clusters. Instead of running a large cluster I ran a lot of concurrent requests for the same layer from a couple of nodes. By working backwards from expected network and disk performances I have identified three different factors that are playing a part in this.
From these discoveries there are three separate issues that need to be worked on to improve the experience for users, especially those running Spegel in very large clusters. I will go ahead and create them and we can have further discussions in the respective issues. |
Spegel version
v0.0.23
Kubernetes distribution
AKS
Kubernetes version
v1.28.5
CNI
Cilium
Describe the bug
Hi, recently we observed with spegel there has some image download issue.
kubelet download image failed due to connection refused.
2024-07-31 21:36:50.0000000 | Pulling image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154"
2024-07-31 21:41:51.0000000 | Failed to pull image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to pull and unpack image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to copy: httpReadSeeker: failed open: failed to do request: Get "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/sha256:25e9665de2c2bec5aa7aa3a38b0ad258abb30016ef973695b66afce127ae1ec7?ns=myofficial.azurecr.io": dial tcp 192.168.64.18:30020: connect: connection refused
I dig this further more from containerd log and find the following:
2024-07-31 21:36:50.0000000 time="2024-07-31T21:36:50.461891485Z" level=info msg="PullImage "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154""
2024-07-31 21:36:50.0000000 time="2024-07-31T21:36:50.462857165Z" level=info msg="trying next host" error="failed to do request: Head "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/20240607T065901154?ns=myofficial.azurecr.io\": dial tcp 192.168.64.18:30020: connect: connection refused" host="192.168.64.18:30020"
2024-07-31 21:41:50.0000000 time="2024-07-31T21:41:50.463003321Z" level=error msg="cancel pulling image myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154 because of no progress in 5m0s"
2024-07-31 21:41:50.0000000 time="2024-07-31T21:41:50.464570509Z" level=error msg="PullImage "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154" failed" error="failed to pull and unpack image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to copy: httpReadSeeker: failed open: failed to do request: Get "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/sha256:25e9665de2c2bec5aa7aa3a38b0ad258abb30016ef973695b66afce127ae1ec7?ns=myofficial.azurecr.io\": dial tcp 192.168.64.18:30020: connect: connection refused"
Another issue is sometimes the image pull is very slow:
2024-08-01 00:52:17.0000000 Pulling image "myofficial.azurecr.io/system/base/edgeproxy/nginx:202401032024-08-01 01:24:12.0000000 Successfully pulled image "myofficial.azurecr.io/system/base/edgeproxy/nginx:20240103T113448224" in 31m54.616s (31m54.616s including waiting)T113448224"
This is a very small image which I think we should complete within 1min.
Could you take a look when get chance, thanks
The text was updated successfully, but these errors were encountered: