Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM/CRRL should depend on base and length #96

Open
vmurali opened this issue Jan 30, 2025 · 14 comments
Open

CRAM/CRRL should depend on base and length #96

vmurali opened this issue Jan 30, 2025 · 14 comments

Comments

@vmurali
Copy link
Collaborator

vmurali commented Jan 30, 2025

Given how CRAM and CRRL is used in https://github.com/CHERIoT-Platform/cheriot-rtos/blob/c994de3d1a476d824ffb7ab8711a52307ac7a9f0/sdk/core/loader/boot.cc#L502-L513, the base is also needed to get a valid roundedBase and roundedSize that encompasses the heap.

Moreover there are a bunch of related issues in that code:

All these problems could be solved (and you can get rid of the invariants) if CRAM and CRRL takes a base as an argument and returns a valid size/mask to work manipulate the input base/size.

@nwf @rmn30 @davidchisnall

@rmn30
Copy link
Collaborator

rmn30 commented Jan 31, 2025

Hi @vmurali and thanks for raising this.

CRRL and CRAM are part of the (draft) upstream riscv-cheri spec. and are designed for allocators so they can increase the size of a requested allocation to meet alignment requirements. I believe the cheriot-rtos allocator uses them in that way, so we can't just redefine them. The code you're looking at does look a bit suspicious and I'm looking into it. The situation is slightly different than the one the allocator faces so it's possible we're using the instructions incorrectly. We already defined #74 to handle another specific situation but I don't think that applies here. If the loader is the only place where this situation arises I doubt a new instruction is appropriate as it's not a fast path, but we do need a way of doing it correctly and would ideally avoiding embedding too much knowledge of the capability encoding (e.g. the precision) into the loader. Therefore, I think this is probably an RTOS issue, not an ISA one, but we'll see.

Some observations:

  • __export_mem_heap_end comes from the board configuration and in all current configurations is highly aligned because it is the end of a memory block (e.g. 128 / 256 / 512 KiB aligned). Therefore it should be safe to assume it is representable for any given heap size that is less than the entire memory block containing the heap, given that the address range for the block itself is representable. I'd be interested to know if you have a counterexample to this.
  • heap_start / __export_mem_heap is set to the point after all other data so is arbitrarily aligned. The loader code attempts to round this up such that the heap range is representable while preserving the same heap_end. I'm not sure using CRAM is quite right here as it is designed for allocations that round up in size whereas here we want to round down. There may be an edge case where it fails but I need to think about it.
  • We don't yet support heaps spread across multiple ranges although it might be useful and could be used here.

@vmurali
Copy link
Collaborator Author

vmurali commented Jan 31, 2025

We already defined #74 to handle another specific situation but I don't think that applies here.

I did see CSetBoundsRoundDown, but the semantics for that seems more straightforward:

getBoundsRoundDown(inputBase, inputLength) = Cap with base = inputBase, length = y such that y <= inputLength

The semantics in the heap (boot.cc) seems to be (based on my chat with Wes offline):

getHeapAllocatorBounds(inputBase, inputLength) =
     Cap with base = x such that x >= inputBase, length = y such that x + y <= inputBase + inputLength

@vmurali
Copy link
Collaborator Author

vmurali commented Jan 31, 2025

CRRL and CRAM are part of the (draft) upstream riscv-cheri spec. and are designed for allocators so they can increase the size of a requested allocation to meet alignment requirements. I believe the cheriot-rtos allocator uses them in that way, so we can't just redefine them.

I should read their spec before it's too late :( .

If the goal is to use it for malloc-like allocators, then the semantics for that seems to be:

getMallocBounds(inputBase, inputLength) =
     Cap with base = x such that x >= inputBase, length = y such that y >= inputLength

Here inputBase would be useful to ensure that the allocated region is outside all previous allocations (by setting inputBase to be the end of the last allocation). Alternatively, you will have to explicitly compute a new base that is outside all previous allocations.

@vmurali
Copy link
Collaborator Author

vmurali commented Jan 31, 2025

Regarding __export_mem_heap and __export_mem_heap_end: the code seems to ignore __export_mem_heap_end and just use the difference (entry.size() = __export_mem_heap_end - __export_mem_heap), i.e. the length, to obtain a new output length. So even if __export_mem_heap_end is aligned, that information is lost.

I am not entirely sure about a counter example to the heap bounds but here's what I discussed with Wes offline:

if entry.size() is a "good" value such that roundedSize = entry.size(), but __export_mem_heap is a "bad" value (consistent with your assumption) such that it has to be aligned explicitly, i.e. roundedBase > __export_mem_heap, then you will break the invariant that roundedBase + roundedSize <= __export_mem_heap_end. For this to happen, assuming __export_mem_heap_end is a "good" value and __export_mem_heap is a "bad" value, the difference must still be a "good" value for length. I think that situation is impossible with the current bounds calculation algorithm. But it's on shaky grounds:

  • __export_mem_heap_end is appropriately aligned always
  • the bounds algorithm wouldn't change in the future to create a size that is consistent with a non-aligned input size

@vmurali
Copy link
Collaborator Author

vmurali commented Jan 31, 2025

It's interesting to note that only the following cases seem to be useful:

  • allocate memory for an object of at least the given size beyond all previous allocations, i.e. at or after start
  • allocate the biggest region within start and end
  • allocate the biggest region exactly at start which is less than or equal to a given size

The current version of setCapBounds, which allocates the smallest region encompassing start and end is actually less useful because you need to check if the output base is disjoint from all previous allocations

@vmurali
Copy link
Collaborator Author

vmurali commented Feb 1, 2025

I see the current version of setCapBounds used in code in the following places. Where do you check disjointness of the cap from previously allocated caps in these places? It looks like disjointness is being assumed in all this code.

@nwf
Copy link
Member

nwf commented Feb 1, 2025

then you will break the invariant that roundedBase + roundedSize <= __export_mem_heap_end

I don't think that's possible with the current code? (Note what it calls end is, thanks to the test in the conjunct on lines 495-496, just a different name for __export_mem_heap_end, even if that's a little obfuscated):

We check, with Debug::Invariant on line 514, that __export_mem_heap_end address is suitably aligned for roundedSize to be representable and while there's probably some encoding in which representability is directional, I think, in all the ones proposed so far, that if an address is suitably aligned to represent a given length, then that address can be either endpoint of a representable capability of that length.

It's interesting to note that only the following cases seem to be useful

I'd add "allocate precisely size bytes at start" (i.e., CSetBoundsExact) as useful as a way of conveying intention.

The "smaller than size" variants are indeed useful but seem... niche:

  • "biggest region exactly at start and up to size" (i.e., CSetBoundsRoundDown) is seems useful only for slicing up (ring) buffers, and maybe buffers of bytes at that. (Things get awkward if the buffer is holding larger objects and we can't represent even one of them, but that's probably a sign that some alignment constraints are being violated already.)

  • "biggest region within" seems useful only for setting up allocation arenas and almost surely can be built out of other primitives, even if CRAM/CRRL are slightly misshapen for the task at hand.

Do you have other uses in mind?

Where do you check disjointness of the cap from previously allocated caps in these places?

I believe (but @davidchisnall is more authoritative) that the loader presumes that the compiler has generally laid objects out such that the inexact bounds computation does not violate disjointness.

@vmurali
Copy link
Collaborator Author

vmurali commented Feb 1, 2025

then you will break the invariant that roundedBase + roundedSize <= __export_mem_heap_end

I don't think that's possible with the current code? (Note what it calls end is, thanks to the test in the conjunct on lines 495-496, just a different name for __export_mem_heap_end, even if that's a little obfuscated):

We check, with Debug::Invariant on line 514, that __export_mem_heap_end address is suitably aligned for roundedSize to be representable and while there's probably some encoding in which representability is directional, I think, in all the ones proposed so far, that if an address is suitably aligned to represent a given length, then that address can be either endpoint of a representable capability of that length.

Right, it relies on end being aligned. If not, when roundedSize = entry.size(), then end - roundedSize gives an unaligned base.

I'd add "allocate precisely size bytes at start" (i.e., CSetBoundsExact) as useful as a way of conveying intention.

Where do you actually use CSetBoundsExact? This was my search result: https://github.com/search?q=repo%3ACHERIoT-Platform%2Fcheriot-rtos+set_exact&type=code

Do you have other uses in mind?

No, those were the uses:

  • Object allocation in heap and stack: allocate memory for an object of at least the given size beyond all previous allocations, i.e. at or after start
  • Create a cap for a mmapped region: allocate the biggest region within start and end
  • Create arbitrary slices of a region, size no bar (note that this is computationally easy to implement in software, especially if one has to tighten "size no bar"): allocate the biggest region exactly at start which is less than or equal to a given size

@nwf
Copy link
Member

nwf commented Feb 1, 2025

Where do you actually use CSetBoundsExact? This was my search result: https://github.com/search?q=repo%3ACHERIoT-Platform%2Fcheriot-rtos+set_exact&type=code

We sure do like our C++ wrappers. Unfortunately, they make code more readable at the expense of being essentially unsearchable without compiler support, and AFAIK nobody's DTRT for tooling here. This search finds many (but probably not all) of the uses of assignment to BoundsProxy, which is syntactic sugar for __builtin_cheri_bounds_set_exact().

@rmn30
Copy link
Collaborator

rmn30 commented Feb 3, 2025

It's interesting to note that only the following cases seem to be useful:

  • allocate memory for an object of at least the given size beyond all previous allocations, i.e. at or after start
  • allocate the biggest region within start and end
  • allocate the biggest region exactly at start which is less than or equal to a given size

The current version of setCapBounds, which allocates the smallest region encompassing start and end is actually less useful because you need to check if the output base is disjoint from all previous allocations

I think this is an interesting analysis and would add to it a bit. The ISA doesn't have explicit support the first two at the moment and it's interesting to ask why we don't seem to have needed them.

  • The first one might be useful for a bump allocator but I guess we don't have one or have done it using existing ISA instructions?
  • The second one sounds interesting and is a kind of dual of the existing set bounds (we could call it MaximumSubset and the current set bounds MinimumSuperset). It is probably at least as complex for hardware as MinimumSuperset (i.e. quite complex).
  • The third one is the existing CSetBoundsRoundDown (which would be more precisely called SetBoundsRoundTopDown). We invented that for handing out an existing buffer in chunks. It's slightly easier in hardware than MinimumSuperset.
  • A fourth one, which would apply when calculating the capability for the entire heap, would be SetBoundsRoundBaseUp. This would be like RoundTopDown but would round up the base instead of rounding down the top. We could also use SetBoundsMaximumSubset but could get away with the simpler RoundBaseUp because we know the top is at least as aligned as the desired base.

I analysed an objdump of the test suite and made some notes on current usage of set bounds instructions:

  • The original MinimumSuperset version of csetbounds is currently used only in the loader. We should audit these uses and check they are safe, or whether they can be replaced with CSetBoundsExact. @davidchisnall points out that MinimumSuperset is useful when we want "best effort" bounds that mustn't break existing code but can tolerate potential overlapping at the cost of some safety. For example the compiler will use it for sub-object bounds (currently disabled by default).
  • The compiler frequently uses set bounds with small (immediate) lengths that are guaranteed to be exact (<=511 bytes). I'm not sure what it will do for larger lengths that might not be exact?
  • The allocator uses crrl, cram and SetBoundsExact to guarantee alignment. Likewise other explicit set bounds operations use the exact variant and (presumably) have reasons for expecting this to work.

@vmurali
Copy link
Collaborator Author

vmurali commented Feb 3, 2025

  • The first one might be useful for a bump allocator but I guess we don't have one or have done it using existing ISA instructions?
  • The second one sounds interesting and is a kind of dual of the existing set bounds (we could call it MaximumSubset and the current set bounds MinimumSuperset). It is probably at least as complex for hardware as MinimumSuperset (i.e. quite complex).

I think MinimumSuperset and MaximumSubset can share a lot of hardware. So it may not be that bad to implement this.

  • The third one is the existing CSetBoundsRoundDown (which would be more precisely called SetBoundsRoundTopDown). We invented that for handing out an existing buffer in chunks. It's slightly easier in hardware than MinimumSuperset.
  • A fourth one, which would apply when calculating the capability for the entire heap, would be SetBoundsRoundBaseUp. This would be like RoundTopDown but would round up the base instead of rounding down the top. We could also use SetBoundsMaximumSubset but could get away with the simpler RoundBaseUp because we know the top is at least as aligned as the desired base.

Assuming MaximumSubset shares code with MinimumSuperset and hence not complex, we can skip this. But I think this can share hardware with RoundBaseUp - essentially check the largest exponent for size against the alignment of top instead of alignment of base.

I analysed an objdump of the test suite and made some notes on current usage of set bounds instructions:

  • The original MinimumSuperset version of csetbounds is currently used only in the loader. We should audit these uses and check they are safe, or whether they can be replaced with CSetBoundsExact. @davidchisnall points out that MinimumSuperset is useful when we want "best effort" bounds that mustn't break existing code but can tolerate potential overlapping at the cost of some safety. For example the compiler will use it for sub-object bounds (currently disabled by default).

Okay I understand the need to use this for sub-object bounds, especially with array splicing.

  • The compiler frequently uses set bounds with small (immediate) lengths that are guaranteed to be exact (<=511 bytes). I'm not sure what it will do for larger lengths that might not be exact?

Yeah I saw .bounds()= having really small values. I guess there are no use cases where it's not <= 511 bytes?

  • The allocator uses crrl, cram and SetBoundsExact to guarantee alignment. Likewise other explicit set bounds operations use the exact variant and (presumably) have reasons for expecting this to work.

I haven't seen the allocator code; I am guessing it's a bumping allocator (may not be linear bumping, but you still need to know the bounds of the surrounding object to allocate correctly). How do you guarantee the base is correctly aligned when using crrl and cram?

@nwf
Copy link
Member

nwf commented Feb 3, 2025

I haven't seen the allocator code; I am guessing it's a bumping allocator (may not be linear bumping, but you still need to know the bounds of the surrounding object to allocate correctly). How do you guarantee the base is correctly aligned when using crrl and cram?

First, an apology: the allocator code is (and has been for a long while) sort of half way through a refactoring from a very C-style codebase into a C++ codebase and it shows.

The allocator is essentially dlmalloc -- it's a linked list of chunks of the heap arena, with free chunks being additionally (invasively, with the cons cells within the free space) indexed into "small bins" for the very smallest sizes of objects and a tree of linked lists for larger sizes. The invasive indexing is safe because after an object is freed back to the allocator, and before the invasive node for quarantine tracking is emplaced, the object's backing memory is marked as revoked, ensuring that any outstanding references to said memory are invalidated. (When objects come out of quarantine, we replace the initial invasive node for quarantine tracking with the appropriate free-space indexing node). Additionally, adjacent free chunks coalesce. When allocating, large free chunks sufficiently larger than the requested allocation size are split and sufficiently large small chunks, too (similar splits occur below and above allocation requests that had to be padded to ensure alignment).

When servicing an allocation request, we round up the size using CRRL and ensure that it is aligned according to CRAM.
After we've done all that, we bound the capability to allocated objects, either to the entire object or to a sufficiently large subset of the underlying allocation to satisfy the request.

@vmurali
Copy link
Collaborator Author

vmurali commented Feb 4, 2025

I tried to understand the allocator code, but I am being too stupid today to fully understand it. But the gist seems to be as follows: malloc tries to find an object in the heap with alignment given by CRAM, and minimum size given by CRRL. If such an object doesn't exist, https://github.com/CHERIoT-Platform/cheriot-rtos/blob/main/sdk/core/allocator/alloc.h#L2708 will fail. If so, the heap allocator does not need anything other than the current CRAM and CRRL.

Regarding the distinction between MaximumSubset and RoundBaseUp, the implementation of MaximumSubset seems to be simple at least with the bounds invariant in https://github.com/CHERIoT-Platform/cheriot-sail/pulls. Given a base: Bit 33 and length: Bit 33, the pseudocode is:

let lenTrunc: Bit 24 = RemoveLsbBits(length)
let e: Bit 5 = lg2 lenTrunc
let diff: Bit 9 = RemoveMsbBits(length >> e)
let mask_for_e: Bit 33 = ~(ones << e)
let baseShouldIncrement: Bool = (base & mask_for_e) != 0
Return {B: Bit 9 = RemoveMsbBits((base >> e) + baseShouldIncrement), M: Bit 9 = diff - baseShouldIncrement, T: Bit 9 = B+M, E = e}

You don't have to worry about top being aligned with what was requested (i.e. final top may not be the same as base + length)

@nwf
Copy link
Member

nwf commented Feb 4, 2025

I tried to understand the allocator code, but I am being too stupid today to fully understand it. But the gist seems to be as follows: malloc tries to find an object in the heap with alignment given by CRAM, and minimum size given by CRRL. If such an object doesn't exist, https://github.com/CHERIoT-Platform/cheriot-rtos/blob/main/sdk/core/allocator/alloc.h#L2708 will fail. If so, the heap allocator does not need anything other than the current CRAM and CRRL.

No, not stupid. The code is... dense and could be significantly more readable.

Your comment about mspace_memalign(size, align) is right, I think, but to rephrase just to be sure we're on the same page: it tries to find a suitably align-ed and size-d region within a free chunk of the heap (with a slightly greedy heuristic that might fail when it didn't have to). It's fine if no existing chunk is suitably align-ed so long as there's one sufficiently large to certainly have a suitable sub-span. (The heuristic is that we won't go testing ones that might happen to have suitable sub-spans.) Its caller ensures that CRAM constraints have been applied, it just follows along with generic code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants