[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

craig-seeman · 2024-10-14T15:54:06Z

Spegel version

v0.0.27

Kubernetes distribution

N/A

Kubernetes version

N/A

CNI

N/A

Describe the bug

Pretty benign issue - but wanted to bring it up for investigation. While looking through logs, I noticed a lot of 404 requests showing at the 20.x ms mark, even though they were of 404 type.

Through some looking around, I think there may be a bug where if the mirrorAttempts is met/exceeded, the function will still hang around until the timeout is met; meaning the resolve time for the 404 could have been for example 5ms to reach all 3 attempts on other nodes instead of 20ms in the event of images not being anywhere.

phillebaba · 2024-10-14T19:19:40Z

I have had a look at the code and I am not sure if this is a bug. There are two or three situations that can really lead to a 404 response.

This first is that we are not able to resolve enough peers for the peer count within the timeout. By default this would be 3 peers within 20ms. In this situation the lookup context would timeout and the peer channel is closed.

The second option is that we resolve three peers but every request to those peers returns a non 200 response because the layer has since been deleted from the peer but the TTL has not expired. I think this is a less likely scenario to happen.

There is a third situation that can happen which is that libp2p has searched the whole network and concluded that the key cannot be found within any peer before the timeout happens. This does happen at times but is less likely in large clusters.

In my head it is pretty reasonable that most 404 response happen at the timeout duration. This is one reason why I have set it to so low by default. In clusters with 100 nodes keys that exist are resolved in 1-3 ms so setting it to 20 ms gives a pretty good buffer for most people without adding too much latency.

Could you share where you think the code would continue running past all attempts?

craig-seeman · 2024-10-15T13:56:16Z

There is a third situation that can happen which is that libp2p has searched the whole network and concluded that the key cannot be found within any peer before the timeout happens. This does happen at times but is less likely in large clusters.

Ah! Yep! So, now that you bring this up, which I should have mentioned, is this is exactly what is happening here for my/our use case and I think why I spotted / see it so regularly! Right now as we're trialing it out, we're on a rather decent sized (~50 node, ~3-5,000 digest advertised) development cluster and it's getting deploys nearly constantly, so there are quite a few new layers being pulled all the time.

In my head it is pretty reasonable that most 404 response happen at the timeout duration. This is one reason why I have set it to so low by default. In clusters with 100 nodes keys that exist are resolved in 1-3 ms so setting it to 20 ms gives a pretty good buffer for most people without adding too much latency.

This is really useful information to know :D Thank you! I completely agree that your current default of 20ms is completely fine / acceptable - in fact, I felt a bit silly opening this issue for this exact reason~ however I was curious if people would run into this who had a higher timeout set (well, and lets be honest, its fun to tune thing to the max at times 😄)

Thanks as always for the awesome information and quick replies!

phillebaba · 2024-10-17T08:12:21Z

The timeout was a lot higher initially. I had to set a default so pulled 5 seconds out from nowhere without really thinking about it. In small test clusters this worked, because the lookup would exhaust all hosts before the timeout would actually happen. It wasn't until I started looking at larger clusters and eventually the benchmarks that I discovered that the timeout actually mattered and I started looking for a better default.

Fine tuning the timeout could be useful at times but you are really saving a couple of milliseconds at the risk of higher cache misses.

All good, I will take some of the information from this issue and add it to the FAQ as it could be useful for future users to understand the significance of the timeout.

craig-seeman added the bug Something isn't working label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

craig-seeman commented Oct 14, 2024 •

edited

Loading

phillebaba commented Oct 14, 2024

craig-seeman commented Oct 15, 2024 •

edited

Loading

phillebaba commented Oct 17, 2024

[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

Comments

craig-seeman commented Oct 14, 2024 • edited Loading

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

phillebaba commented Oct 14, 2024

craig-seeman commented Oct 15, 2024 • edited Loading

phillebaba commented Oct 17, 2024

craig-seeman commented Oct 14, 2024 •

edited

Loading

craig-seeman commented Oct 15, 2024 •

edited

Loading