-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Wasm RyuJIT] Initial writeup on the calling convention #122988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,8 @@ The LoongArch64 ABI documentation is [here](https://github.com/loongson/LoongArc | |
|
|
||
| The RISC-V ABIs Specification: [latest release](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/releases/latest), [latest draft](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/releases), [document source repo](https://github.com/riscv-non-isa/riscv-elf-psabi-doc). | ||
|
|
||
| Web Assembly Basic C ABI: [Basic C ABI](https://github.com/WebAssembly/tool-conventions/blob/main/BasicCABI.md) | ||
|
|
||
| # General Unwind/Frame Layout | ||
|
|
||
| For all non-x86 platforms, all methods must have unwind information so the garbage collector (GC) can unwind them (unlike native code in which a leaf method may be omitted). | ||
|
|
@@ -698,3 +700,178 @@ The stack elements are always aligned to at least `INTERP_STACK_SLOT_SIZE` and n | |
| Primitive types smaller than 4 bytes are always zero or sign extended to 4 bytes when on the stack. | ||
|
|
||
| When a function is async it will have a continuation return. This return is not done using the data stack, but instead is done by setting the Continuation field in the `InterpreterFrame`. Thunks are responsible for setting/resetting this value as we enter/leave code compiled by the JIT. | ||
|
|
||
| # Web Assembly ABI (R2R and JIT) | ||
|
|
||
| For managed methods compiled to Web Assembly (hereafter "managed code") the CLR generally follows the [Wasm Basic C ABI](https://github.com/WebAssembly/tool-conventions/blob/main/BasicCABI.md). | ||
|
|
||
| Managed code uses the same linear stack as C code. The stack grows down. | ||
|
|
||
| ## Incoming argument ABI | ||
|
|
||
| The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For PInvokes, I expect that this update will be done around the callsite in the managed code. Where do you expect it to be done for FCalls? We assume that FCalls have the same managed calling convention. Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention? If we are not able to do that, I guess we will need to create some sort of FCalls wrappers. It is doable, but it is not pretty - we have been there in the past. For reference, what does native AOT / LLVM do for FCalls currently?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For NAOT/LLVM it looks like FCalls have an extra initial arg that they ignore: I don't see where the stack pointer global is updated; maybe NAOT/LLVM doesn't need this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
NAOT-LLVM shadow stack is allocated separately from the
I don't know if it is possible with |
||
|
|
||
| A frame pointer, if used, points at the bottom of the "fixed" portion of the stack to facilitate use of Wasm addressing modes, which only allow positive offsets. | ||
|
|
||
| Structs are generally passed by-reference, unless they happen to exactly contain a single primitive field (or be a struct exactly containing such a struct). The linear stack provides the backing storage for the by-reference structs. | ||
|
|
||
| Structs are generally returned via hidden buffers, whose address is supplied by the caller and passed just after the `sp` argument. In such cases the return value of the method is the address of the return value. But if the struct can be passed on the wasm stack it is returned on the wasm stack. | ||
|
|
||
| (TBD: ABI for vector types) | ||
|
|
||
| ### Prolog | ||
|
|
||
| The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed. | ||
AaronRobinsonMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Frame descriptor is GCInfo, yes?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will also refer to the EH info, so a bit more general.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't we need the stack frames to be linked together in order to provide support for walking managed frames ? So, from the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the last frame's base address, the plan is that the frames are self-descriptive. To get the base address (using the scheme above where The frame descriptor will be at a known offset from this saved For dynamic-sized frames a copy of the prior If we follow Katelyn's proposal of keeping
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is not clear what a "method with GC" means. Should this say for methods with calls or EH (ie non-leaf methods)?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, methods with GC safe points (which will always be at calls) or EH. |
||
|
|
||
| ### Epilog | ||
|
|
||
| Generally the epilogs will be empty. There is no notion of callee-save registers in Wasm, and no other global state to update. | ||
|
|
||
| ## Outgoing call ABI | ||
|
|
||
| For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the long run will we optimize this for cases where we know both the caller and callee were crossgen'd? I'm fine with not specifying that yet though.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes we can optimize that case. Though in general there is no guarantee the runtime will use R2R compiled callee code. On Wasm this may be less of an issue because the cases where R2R method bodies end up being disqualified may not be possible.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes.
The fixups for the caller would have to verify that the directly callled method is going to use R2R too. If the fixup fails, R2R code for the caller would have to be rejected as well. (We do something similar for ReJIT. If there is a ReJIT request for a method that got inlined, all methods that inlined it must be invalidated as well.) |
||
|
|
||
| The call sequence will then be | ||
| ``` | ||
| local.get sp | ||
| push arg 0 | ||
| ... | ||
| push arg N-1 | ||
| load PortableEntryPointPtr | ||
| dup | ||
| load CellIndex (from pe) | ||
AaronRobinsonMSFT marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| call_indirect <tableIndex> <sigIndex> (sig is: int32 (sp) arg0... argN-1 int32 (pe)) | ||
| ``` | ||
| Initially the cell will contain code to determine if the target method has R2R code or must be interpreted. If there is R2R code for the method it is fixed up as needed. Once the target is resolved the cell can be updated to just refer to the R2R code directly, if there is any, or to a thunk for invoking the interpreter. | ||
|
|
||
| For indirect managed calls the sequence is similar, but the portable entry point is obtained by calling a resolve helper: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should say "virtual managed calls". Indirect calls ( Also, there is a potential optimization for vtable-based virtual calls to just fetch the entrypoint by indexing into vtable like we do everywhere else. |
||
| ``` | ||
| local.get sp | ||
| push arg 0 | ||
| ... | ||
| push arg N-1 | ||
| ... push args for resolution ... | ||
| call resolve | ||
| dup | ||
| load CellIndex (from pe) | ||
| call_indirect <tableIndex> <sigIndex> (sig is: int32 (sp) arg0... argN-1 int32 (pe)) | ||
| ``` | ||
| Because the `pe` arg must be passed to the portable entrypoint, all method signatures must reflect the extra final argument (even though it will be unused). Thus for example a managed method like `int F(int x)` will have a Wasm signature `(func (param int32 int32 int32) (result int32))`. | ||
|
|
||
| Alternatively we may choose to pass the `pe` via a Wasm global. | ||
pavelsavara marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Helper calls known to be native code can be called directly. The calling sequence must re-establish the `$__stack_pointer` global: | ||
|
|
||
| ``` | ||
| local.get sp | ||
| global.set $__stack_pointer | ||
|
|
||
| push arg 0 | ||
| ... | ||
| push arg N-1 | ||
| call <tableIndex> <sigIndex> (sig is: arg0... argN-1) | ||
| ``` | ||
|
|
||
| Helper calls that are managed use the managed calling sequence. | ||
|
|
||
| ## GC References at Call Sites | ||
|
|
||
| Wasm does not allow for outside access to the Wasm stack. So, before call sites that may trigger GC, all GC references live after the call (and all untracked GC references, which are effectively always live) must be saved to the linear stack. These GC references will be reported as pinned to the GC so that if they normally live in Wasm locals they do not need to be updated after the call. The live GC slots on the linear stack will be identified by the virtual IP (also stored on the linear stack) and the GC info (accessible from the frame descriptor, also on the linear stack). | ||
|
|
||
| So for example if we have code like `x(a, y(b)); ... a; ... b;` where `a` and `b` are gc refs that initially are in wasm locals, this fragment would compile into something like | ||
| ``` | ||
| ;; sp for call to x | ||
| local.get sp | ||
|
|
||
| ;; spill a to linear memory | ||
| local.get sp | ||
| local.get a | ||
| i32.store offset=(a's offset in gc area of stack) | ||
|
|
||
| ;; arg a for call to x | ||
| local.get a | ||
|
|
||
| ;; sp for call to y | ||
| local.get sp | ||
|
|
||
| ;; spill b to linear memory | ||
| local.get sp | ||
| local.get b | ||
| i32.store offset=(b's offset in gc area of stack) | ||
|
|
||
| ;; arg b for call to y | ||
| local.get b | ||
|
|
||
| ;; update virtual IP for call to y with live gc refs | ||
| local.get sp | ||
| i32.const virtual-ip-for-call-to-y (gc info : a and b slots live) | ||
| i32.store offset=(virtual-ip offset) | ||
|
|
||
| ;; fetch PE for y and cell index from PE, call y | ||
| load PortableEntryPointPtr for y | ||
| dup | ||
| load CellIndex (from pe) | ||
| call_indirect <tableIndex> <sigIndex> (sig is: int32 (sp) int32 int32 (pe) : returns int32) | ||
|
|
||
| ;; update virtual IP for call to x with live gc refs [can be optimized out] | ||
| local.get sp | ||
| i32.const virtual-ip-for-call-to-x (gc info : a and b slots live) | ||
| i32.store offset=(virtual-ip offset) | ||
|
|
||
| ;; fetch PE for x and cell index from PE, call x | ||
| load PortableEntryPointPtr for x | ||
| dup | ||
| load CellIndex (from pe) | ||
| call_indirect <tableIndex> <sigIndex> (sig is: int32 (sp) int32 int32 (pe) : returns int32) | ||
| ``` | ||
| Notes: | ||
| * As an optimization, we can avoid updating the virtual IP when the GC/EH info it refers to is unchanged from the last update. | ||
| * We may want to un-nest calls, relying on a Wasm local instead of the Wasm stack to convey nested call results to the parent call. | ||
| * As an optimization, we will try and minimize storing gc refs to the linear stack (eg if the value already there hasn't changed from the last update). | ||
| * As an optimization, we may try and have some gc refs primarily live on the linear stack, and not be held in Wasm locals. | ||
|
|
||
| ## Tail Calls | ||
|
|
||
| For tail calls the only differences are the use of the `return_call_indirect` in the call, and passing the original `sp` value to the callee: | ||
| ``` | ||
| local.get sp | ||
| i32.const <frameSize> | ||
| i32.add | ||
|
|
||
| push arg 0 | ||
| ... | ||
| push arg N-1 | ||
| load PortableEntryPointPtr | ||
| dup | ||
| load CellIndex (from pe) | ||
| return_call_indirect <tableIndex> <sigIndex> (sig is: int32 (sp) arg0... argN-1 int32 (pe)) | ||
| ``` | ||
| and similarly for indirect managed calls. | ||
|
|
||
| ## PInvoke | ||
|
|
||
| TBD | ||
|
|
||
| ## Reverse PInvoke | ||
|
|
||
| TBD | ||
|
|
||
| ## Async | ||
|
|
||
| TBD | ||
|
|
||
| ## Interpreter Stubs | ||
|
|
||
| There will be stubs involved in both managed code->interpreter and interpreter->managed code calls. For R2R these will be per signature, generated by crossgen2. | ||
|
|
||
| ### Interpreted -> Managed | ||
|
|
||
| The interpreter->managed stub will load the global `$__stack_pointer`, then the method arguments from the interpreter stack, and finally `int32.const 0` for the final pe argument, which will be ignored by managed code (that last part can be omitted, if we pass this via a wasm global instead), and then call the managed method. | ||
|
|
||
| On return the global `$__stack_pointer` is reset to the value it had on stub entry. | ||
|
|
||
| ### Managed->Interpreted | ||
|
|
||
| This stub will be passed the current managed `sp` and must store it into the global `$__stack_pointer`. The interpreter stack (see above) will be extended with a new `InterpMethodContextFrame` frame, and arguments will be moved from Wasm locals to the frame. The `pe` argument will then be used to invoke the interpreter on the proper IL method body. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means helper calls all become managed->native boundaries that require a stack pointer update at start and end, right? Is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a code size cost for each helper call... it could be amortized with a custom wrapper for each helper that just does the
$__stack_pointermaintenance.