Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607

Open
craig-seeman opened this issue Oct 14, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@craig-seeman
Copy link

craig-seeman commented Oct 14, 2024

Spegel version

v0.0.27

Kubernetes distribution

N/A

Kubernetes version

N/A

CNI

N/A

Describe the bug

Pretty benign issue - but wanted to bring it up for investigation. While looking through logs, I noticed a lot of 404 requests showing at the 20.x ms mark, even though they were of 404 type.

Through some looking around, I think there may be a bug where if the mirrorAttempts is met/exceeded, the function will still hang around until the timeout is met; meaning the resolve time for the 404 could have been for example 5ms to reach all 3 attempts on other nodes instead of 20ms in the event of images not being anywhere.

@craig-seeman craig-seeman added the bug Something isn't working label Oct 14, 2024
@phillebaba
Copy link
Member

I have had a look at the code and I am not sure if this is a bug. There are two or three situations that can really lead to a 404 response.

This first is that we are not able to resolve enough peers for the peer count within the timeout. By default this would be 3 peers within 20ms. In this situation the lookup context would timeout and the peer channel is closed.

The second option is that we resolve three peers but every request to those peers returns a non 200 response because the layer has since been deleted from the peer but the TTL has not expired. I think this is a less likely scenario to happen.

There is a third situation that can happen which is that libp2p has searched the whole network and concluded that the key cannot be found within any peer before the timeout happens. This does happen at times but is less likely in large clusters.

In my head it is pretty reasonable that most 404 response happen at the timeout duration. This is one reason why I have set it to so low by default. In clusters with 100 nodes keys that exist are resolved in 1-3 ms so setting it to 20 ms gives a pretty good buffer for most people without adding too much latency.

Could you share where you think the code would continue running past all attempts?

@craig-seeman
Copy link
Author

craig-seeman commented Oct 15, 2024

There is a third situation that can happen which is that libp2p has searched the whole network and concluded that the key cannot be found within any peer before the timeout happens. This does happen at times but is less likely in large clusters.

Ah! Yep! So, now that you bring this up, which I should have mentioned, is this is exactly what is happening here for my/our use case and I think why I spotted / see it so regularly! Right now as we're trialing it out, we're on a rather decent sized (~50 node, ~3-5,000 digest advertised) development cluster and it's getting deploys nearly constantly, so there are quite a few new layers being pulled all the time.

In my head it is pretty reasonable that most 404 response happen at the timeout duration. This is one reason why I have set it to so low by default. In clusters with 100 nodes keys that exist are resolved in 1-3 ms so setting it to 20 ms gives a pretty good buffer for most people without adding too much latency.

This is really useful information to know :D Thank you! I completely agree that your current default of 20ms is completely fine / acceptable - in fact, I felt a bit silly opening this issue for this exact reason~ however I was curious if people would run into this who had a higher timeout set (well, and lets be honest, its fun to tune thing to the max at times 😄)

Thanks as always for the awesome information and quick replies!

@phillebaba
Copy link
Member

The timeout was a lot higher initially. I had to set a default so pulled 5 seconds out from nowhere without really thinking about it. In small test clusters this worked, because the lookup would exhaust all hosts before the timeout would actually happen. It wasn't until I started looking at larger clusters and eventually the benchmarks that I discovered that the timeout actually mattered and I started looking for a better default.

Fine tuning the timeout could be useful at times but you are really saving a couple of milliseconds at the risk of higher cache misses.

All good, I will take some of the information from this issue and add it to the FAQ as it could be useful for future users to understand the significance of the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants