Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow verifier log size to reach max to prevent infinite loop #1679

Closed

Conversation

brycekahle
Copy link
Contributor

In the case of:

  • needing the max verifier log size
  • kernel does not support log_true_size yet

ebpf/prog.go

Lines 453 to 455 in 29fedc5

const factor = 2
logSize = internal.Between(logSize, minVerifierLogSize, maxVerifierLogSize/factor)
logSize *= factor

maxVerifierLogSize is 1073741823, so integer divided by 2 = 536870911. Multiplying by 2 equals 1073741822. Thus we never reach maxVerifierLogSize in the loop. The kernel keeps returning ENOSPC but the size never changes and you get an infinite loop.

@brycekahle brycekahle requested a review from a team as a code owner February 12, 2025 21:07
@brycekahle brycekahle force-pushed the bryce.kahle/verifier-log-max branch from 4fccb08 to bc35014 Compare February 12, 2025 21:09
@brycekahle
Copy link
Contributor Author

If it wasn't clear, we are seeing the infinite loop in one of our tools.

prog.go Outdated
const factor = 2
logSize = internal.Between(logSize, minVerifierLogSize, maxVerifierLogSize/factor)
logSize *= factor
logSize = internal.Between(logSize*2, minVerifierLogSize, maxVerifierLogSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bring back the overflow check that used to be here?

e8b05c5#diff-348ab86651d6b4babd4e62cbe504d85db37f9aea172e2b72a570b8d5d73b8bc5L443-L445

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is possible to overflow? maxVerifierLogSize is 2^30, so 2^31 is the max it can be, which fits in a uint32.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. There are two problems with the current code:

  • The maxVerifierLogSize may change at some point, as the comment points out. It'd be nice to not have this problem again.
  • There is no protection against an infinite loop, as you've discovered.

As I wrote the code originally the intention was that:

  1. Passing a too large number for the buffer would give EINVAL, aborting the loop.
  2. The buffer exceeding uint32 would also abort it.

(1) doesn't happen anymore since we clamp to the maximum, (2) was removed. IMO I'd prefer bringing back the original code, which doesn't clamp to a max value (because hardcoding other system's limit makes code brittle) and checks for overflow. We can keep the LogSizeStart logic, although that might need adjustment because with LogLevel == 0 we actually use LogSizeStart * 2 for the first call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I thought about this some more. Having the knowledge about the kernel max allows us to work up to that value before erroring with EINVAL. If we just keep doubling, then we could blow right by a value that would actually function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do that only for older kernels where log_true_size is not available?

Signed-off-by: Bryce Kahle <[email protected]>
@brycekahle brycekahle force-pushed the bryce.kahle/verifier-log-max branch from d3e0ac3 to 544a72a Compare February 14, 2025 22:25
@ti-mo
Copy link
Collaborator

ti-mo commented Feb 18, 2025

Ugh, the amount of time I've spent on these 30 lines of code over the years is getting ridiculous.. Discussed this offline with Lorenz, we came up with the following steps:

  • Roll back the LogSizeStart field (again..) in favor of a LogSizeOverride field that disables the retry logic and does a one-shot load. Downside is that we don't have VerifierError.Truncated anymore, so it's not as clear for the user if the log is incomplete. We need an escape hatch anyway, though.
  • Extract the retry decision-making logic into a function that can be tested independently without making the buffer allocations to catch silly mistakes like this rounding error.
  • Limit the loop to 32 retries (1 for each bit in the LogSize field) to prevent this from going nuclear.

That should, hopefully, make this a bit more robust. I'll come up with something this week.

@ti-mo
Copy link
Collaborator

ti-mo commented Feb 19, 2025

@brycekahle Something I'm wondering about, though. How is 1073741822 too small but 1073741823 big enough? I'm not sure if reaching the max log size is important here, it's that the lib doesn't currently give up if the log doesn't fit in the kernel's max log size at all. Interesting case!

@brycekahle
Copy link
Contributor Author

I'm not sure if reaching the max log size is important here

I believe it is, because if you use the max size the kernel will not error, even if it actually needed more space.

@ti-mo
Copy link
Collaborator

ti-mo commented Feb 20, 2025

@brycekahle I just put up #1693, please give it a try. Closing this one for now.

@ti-mo ti-mo closed this Feb 20, 2025
@ti-mo
Copy link
Collaborator

ti-mo commented Feb 20, 2025

I'm not sure if reaching the max log size is important here

I believe it is, because if you use the max size the kernel will not error, even if it actually needed more space.

Hmm, interesting. It's a shame this is so hard to test due to the memory requirements and having to produce a program large and complex enough to trigger the limit.. Anyway, I tried covering as many cases as possible in my PR, please give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants