Some images pull failed and some pull takes a very long time #551

Calotte · 2024-08-02T03:34:05Z

Spegel version

v0.0.23

Kubernetes distribution

AKS

Kubernetes version

v1.28.5

CNI

Cilium

Describe the bug

Hi, recently we observed with spegel there has some image download issue.
kubelet download image failed due to connection refused.

2024-07-31 21:36:50.0000000 | Pulling image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154"
2024-07-31 21:41:51.0000000 | Failed to pull image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to pull and unpack image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to copy: httpReadSeeker: failed open: failed to do request: Get "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/sha256:25e9665de2c2bec5aa7aa3a38b0ad258abb30016ef973695b66afce127ae1ec7?ns=myofficial.azurecr.io": dial tcp 192.168.64.18:30020: connect: connection refused

I dig this further more from containerd log and find the following:
2024-07-31 21:36:50.0000000 time="2024-07-31T21:36:50.461891485Z" level=info msg="PullImage "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154""
2024-07-31 21:36:50.0000000 time="2024-07-31T21:36:50.462857165Z" level=info msg="trying next host" error="failed to do request: Head "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/20240607T065901154?ns=myofficial.azurecr.io\": dial tcp 192.168.64.18:30020: connect: connection refused" host="192.168.64.18:30020"
2024-07-31 21:41:50.0000000 time="2024-07-31T21:41:50.463003321Z" level=error msg="cancel pulling image myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154 because of no progress in 5m0s"
2024-07-31 21:41:50.0000000 time="2024-07-31T21:41:50.464570509Z" level=error msg="PullImage "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154" failed" error="failed to pull and unpack image "myofficial.azurecr.io/system/base/job/awsome-sidecar:20240607T065901154": failed to copy: httpReadSeeker: failed open: failed to do request: Get "http://192.168.64.18:30020/v2/system/base/job/awsome-sidecar/manifests/sha256:25e9665de2c2bec5aa7aa3a38b0ad258abb30016ef973695b66afce127ae1ec7?ns=myofficial.azurecr.io\": dial tcp 192.168.64.18:30020: connect: connection refused"

Another issue is sometimes the image pull is very slow:
2024-08-01 00:52:17.0000000 Pulling image "myofficial.azurecr.io/system/base/edgeproxy/nginx:202401032024-08-01 01:24:12.0000000 Successfully pulled image "myofficial.azurecr.io/system/base/edgeproxy/nginx:20240103T113448224" in 31m54.616s (31m54.616s including waiting)T113448224"

This is a very small image which I think we should complete within 1min.

Could you take a look when get chance, thanks

Calotte · 2024-08-12T08:26:42Z

{
"time": "2024-07-23T01:58:05.6412250Z",
"level": "INFO",
"source": {
"function": "github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1",
"file": "/build/pkg/registry/registry.go",
"line": 137
},
"msg": "",
"path": "/v2/system/base/job/awsome-sidecar/blobs/sha256:725aac633332a1caecb201331077e8f3891b63fe937c342cb6940aa390c2e81f",
"status": 200,
"method": "GET",
"latency": "15m38.932650294s",
"ip": "100.64.87.42",
"handler": ""
}

It seems the latency sometime would be very large which should with 1minutes and handler sometime is empty which seems not possible for v0.0.23 but it happens maybe kusto log issue, but the latency is really high

phillebaba · 2024-08-28T20:29:24Z

Without having access to you environment it would be difficult to debug this. What you are hitting is most likely Containerd image pull progress deadline. Something is however not working with Spegel if it is not serving data. Or serving data very slowly.

Setting timeouts is difficult because there will be large images that could take a long time to serve.

Best thing I can do right now is to try to reproduce this case locally somehow.

Calotte · 2024-09-03T08:55:32Z

Thanks for your response @phillebaba
One concern for me is if there are 1k nodes cluster, if only 1-2 nodes have a large image, when we submit a distributed job with 256 nodes, does this 1-2 nodes will be overloaded?

phillebaba · 2024-09-03T11:39:54Z

It is a completely different question when we are talking about 1000 node clusters. It introduces a lot of potential bottle necks that are hard to theorize about. In that situation we would definitely see a bottleneck if 250 nodes are all hitting the same node for an image.

I could see a couple of solutions to this problem. The easy one would be rate limiting http requests so that a single node could not be overrun. The other more complicated solution would be for the router to be more aware about the load on nodes. I do think this is solvable but I dont think it will be easy.

A problem is that I do not have access to a 1000 node cluster. Right now I am paying to run the benchmarks on a 100 node cluster myself. I have been trying to figure out a way to get credits to run benchmarks but have not found one yet.

Calotte · 2024-09-08T02:14:59Z

I have more findings, from the log, it seems sometimes some nodes has high latency when copy blob between nodes.
Those nodes are exactly the same, not sure why this would happen and any way to avoid this happen.
This is a 300 or so nodes cluster, and submit a job with 128 or 32 nodes job occurred this issue. It's also hard for me to access, but I can get spegel and syslog, let me know if you need more info.

PreciseTimeStamp message RoleInstance
2024-08-20 16:50:57.0000000 {"time":"2024-08-20T16:50:57.397539363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"1.061328343s","ip":"100.64.113.210","handler":"blob"} nd96hsrv50100000t
PreciseTimeStamp message RoleInstance
2024-08-20 16:50:58.0000000 {"time":"2024-08-20T16:50:58.39101305Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"7.52914922s","ip":"100.64.147.114","handler":"mirror"} nd96hsrv50100002j
PreciseTimeStamp message RoleInstance
2024-08-20 16:51:00.0000000 {"time":"2024-08-20T16:51:00.070434299Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"829.286542ms","ip":"100.64.43.133","handler":"mirror"} nd96hsrv50100002v
PreciseTimeStamp message RoleInstance
2024-08-20 16:51:00.0000000 {"time":"2024-08-20T16:51:00.496573138Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"46.098711588s","ip":"100.64.161.93","handler":"blob"} nd96hsrv50100001n
PreciseTimeStamp message RoleInstance
2024-08-20 16:56:54.0000000 {"time":"2024-08-20T16:56:54.952562713Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"5m22.184722277s","ip":"100.64.166.15","handler":"blob"} nd96hsrv50100001n
2024-08-20 16:56:54.0000000 {"time":"2024-08-20T16:56:54.984723083Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"5m22.226544212s","ip":"100.64.166.15","handler":"mirror"} nd96hsrv501000021
2024-08-20 16:56:58.0000000 {"time":"2024-08-20T16:56:58.731936579Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"5m25.641542833s","ip":"100.64.20.46","handler":"mirror"} nd96hsrv50100003y
2024-08-20 16:56:58.0000000 {"time":"2024-08-20T16:56:58.726275247Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"5m25.627588497s","ip":"100.64.20.46","handler":"blob"} nd96hsrv50100001n
2024-08-20 16:57:01.0000000 {"time":"2024-08-20T16:57:01.362757614Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"4m49.159767256s","ip":"100.64.25.136","handler":"mirror"} nd96hsrv501000053
2024-08-20 16:57:01.0000000 {"time":"2024-08-20T16:57:01.356938532Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":137},"msg":"","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":200,"method":"GET","latency":"4m49.135293963s","ip":"100.64.25.136","handler":"blob"} nd96hsrv50100001n
2024-08-20 17:32:39.0000000 {"time":"2024-08-20T17:32:39.674657097Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":140},"msg":"","err":"mirror with image component sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97 could not be found","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":404,"method":"GET","latency":"4.943022ms","ip":"100.64.160.138","handler":"mirror"} nd96hsrv50100000d
2024-08-20 17:39:18.0000000 {"time":"2024-08-20T17:39:18.028329709Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":140},"msg":"","err":"mirror with image component sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97 could not be found","path":"/v2/system/base/job/official-sidecar/blobs/sha256:389ad29125f007347455213dc87c734dd4bbde54ec48f879a7f894ee89d1ff97","status":404,"method":"GET","latency":"8.623893ms","ip":"100.64.131.128","handler":"mirror"} nd96hsrv501000032

Calotte · 2024-09-10T02:04:23Z

I suspect if a node server is pulling images and the blob download maybe slow since I found at that time this node also download other images. Our scenario is distributed job which has 6 container per replica so a node may cached one image and pull other 5 and at the same time serve the cached images as registry server.
Those images are 2.32G, 133MB, 1.07GB, 140G, 124MB, and one main container which size range from 5G-50GB or so.
BTW, does the hostPort is critical? In our cluster cilium don't hostPort, so I only use nodePort.

Calotte · 2024-09-23T01:48:27Z

I have a proposal for this issue.
The key issue that prevent us to enable spegel is containerd pull image failed with spegel due to timeout and canceled.
It's hard to repro, so I insert an error into spegel that sleep 5m10s and prove that containerd pull image will failed in this case, that would break "Spegel acts as a best-effort cache and the worst-case scenario is always that images are pulled from the upstream registry". I'm not sure why sometimes spegel will stuck at blob sharing, but I think we may need to make the following refinement:

Add a status reporter like containerd and if spegel has no progress within 10s then failed early so that containerd will fallback.

@phillebaba what do you think? If you have no time, I would like to contribute this if make sense

phillebaba · 2024-09-30T07:30:46Z

This sounds like we could optimize the configuration in the Spegel proxy to timeout faster in these types of scenarios. Adding artificial delay to verify that Containerd will still wait was a good idea. I am going to do the same to see if I can tweak some timeout settings to get the connection to close when this happens. Long term we obviously want to know why this happens in the first place.

Regarding the status reporter feature, how are you planning on implementing this? I think it would be fine as long as I understand the design.

phillebaba · 2024-12-17T20:13:57Z

@Calotte I have finally had some time to sit down and do some proper testing in controlled environments. Through these tests I have gained some insights in why you are seeing these problems, especially in large clusters. Instead of running a large cluster I ran a lot of concurrent requests for the same layer from a couple of nodes. By working backwards from expected network and disk performances I have identified three different factors that are playing a part in this.

Cross zone requests are noticeably slower which leads to worse than expected performance. Currently the default deployment of Spegel is not zone aware leading to sub optimal node selection as it is mostly random. In a perfect scenario we would always chose a node in the same zone before any other nodes. There is some interesting research that has been done in adding topology as a factor in lookups which should be researched more.
The proxy component of Spegel is a larger bottle neck than I expected it to be. This is because the copying of data between response and request is done in user space. More research into how other proxies do this will have to be done to better understand how this can be optimized.
The last but probably most contributing factor is the limited bandwidth that a node has. Both network and disk bandwidth can be a factor here. More commonly though it will the the network bandwidth. During my tests I observed the slowing down as I added more concurrent requests. Doing calculations afterwards show that the bandwidth is the same, either way. When you think about it, its pretty logical that more requests will make the download slower.

From these discoveries there are three separate issues that need to be worked on to improve the experience for users, especially those running Spegel in very large clusters. I will go ahead and create them and we can have further discussions in the respective issues.

Calotte added the bug Something isn't working label Aug 2, 2024

phillebaba mentioned this issue Oct 13, 2024

Allow timeouts to be set in hostConfig in Configure Image Registry containerd/containerd#10660

Open

This was referenced Dec 17, 2024

Network topology awareness #669

Open

Bandwidth aware routing #670

Open

Improve proxy performance #671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some images pull failed and some pull takes a very long time #551

Some images pull failed and some pull takes a very long time #551

Calotte commented Aug 2, 2024

Calotte commented Aug 12, 2024

phillebaba commented Aug 28, 2024

Calotte commented Sep 3, 2024

phillebaba commented Sep 3, 2024

Calotte commented Sep 8, 2024 •

edited

Loading

Calotte commented Sep 10, 2024 •

edited

Loading

Calotte commented Sep 23, 2024 •

edited

Loading

phillebaba commented Sep 30, 2024

phillebaba commented Dec 17, 2024

Some images pull failed and some pull takes a very long time #551

Some images pull failed and some pull takes a very long time #551

Comments

Calotte commented Aug 2, 2024

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

Calotte commented Aug 12, 2024

phillebaba commented Aug 28, 2024

Calotte commented Sep 3, 2024

phillebaba commented Sep 3, 2024

Calotte commented Sep 8, 2024 • edited Loading

Calotte commented Sep 10, 2024 • edited Loading

Calotte commented Sep 23, 2024 • edited Loading

phillebaba commented Sep 30, 2024

phillebaba commented Dec 17, 2024

Calotte commented Sep 8, 2024 •

edited

Loading

Calotte commented Sep 10, 2024 •

edited

Loading

Calotte commented Sep 23, 2024 •

edited

Loading