-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221
base: main
Are you sure you want to change the base?
Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221
Conversation
@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you. |
…ng hardware. As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that *is* legal might succeed. This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above. Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V. This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect. This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified. On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running `spidermonkey.wasm` with a simple recursive Fibonacci program. It can be used via the `-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime. [here]: WebAssembly/design#1490 (comment)
7d6193e
to
18e2a57
Compare
offset, | ||
} => { | ||
let ty = covering_type_for_value(&pos.func, data); | ||
let _ = pos.ins().stack_load(ty, stack_slot, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stack stores are guaranteed to succeed when stack probing (or some other kind of stack size check) is used.
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "cranelift", "cranelift:meta", "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
How does this only have a 2% perf impact? Is some optimization pass removing part of the loads or is it the M2 chip being able to exploit a lot of instruction level parallelism? |
Probably the latter -- the loads are not removed, and make it all the way to machine code. |
(More specifically, a load inserted by this pass could later be removed by alias analysis, but only if subsumed by another earlier load to the same address, so there is still always some load before the store.) |
One other performance-related note: this will possibly have worse impact when the pooling allocator with copy-on-write is enabled in Wasmtime, as the "first touch" to a page matters: if demanded in by a load first, a read-only mapping is created, which might immediately fault again on the store (doing the actual CoW). We've seen such "mapping change" operations become bottlenecks in multithreaded processes due to IPIs. I don't expect we would turn this option on in any production setting though; rather we'd ensure (as is already the case on x86) any production hardware does not have the torn-write behavior. But that's something to consider if we ever discuss having this on by default in all cases. |
I can confirm that this fixes the torn writes at least on QEMU. 🎉 Interestingly QEMU RISC-V only shows this torn write test when using the vector extensions to perform the write ( I'm going to test this on real hardware over the weekend, and also see if I can run some benchmarks. For completeness here's the WAST test that I used. WAST Test
|
let _ = pos.ins().load(ty, flags, addr, offset); | ||
} | ||
&InstructionData::StoreNoOffset { | ||
opcode: Opcode::AtomicStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think atomics can be exempted here because they are already required to be aligned? At least at the wasm layer that's validated, I'm not sure about the Cranelift layer. This isn't needed for the wasm-level problem though.
Opcode::Istore32 => I32, | ||
_ => unreachable!(), | ||
}; | ||
let _ = pos.ins().load(ty, flags, addr, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be skipped if flags.trap_code().is_none()
or if flags.aligned()
? Only possibly-trapping non-aligned loads (which is all wasm loads, but not anything to a vmctx for example)
I fully assume that I'm missing something and the answer is no, but just in case: would it not also work to first attempt to do the part of the store operation touching the highest page, and then do the rest of the write? ISTM that if any writes would fail that'd be guaranteed to include the highest page, and if that one succeeds, a write to all lower pages for the rest of the store operation is certain to succeed as well? |
The problem I see with that approach is that we only know that dynamically: consider if we have a 64-bit store (allowed to be unaligned) to address |
Hmm, I see, I think. Could we then instead write a single byte (of whatever value) at the highest address, and do the actual store if that succeeded? |
I think that would work, yeah. There are some other interesting performance tradeoffs here -- two that come to mind are "partial-store forwarding" (partially overlapping loads/stores can be problematic for uarch memory perf) and the fact that stores in general are more expensive than loads (store buffers typically fewer entries than load buffers so can stall the OoO window, stores take extra effort at retire) that, balanced against all the other considerations here (we want something we could turn on by default on tier-2 architecture(s) if needed, and aren't as concerned with interactions with IPIs in pooling allocator where we're more likely to be very careful to select the right configuration and avoid this mechanism), I think I'd still prefer the load+store over narrow-store+store. |
That all makes sense to me. I was mainly concerned about the impact on the pooling allocator, and the missing piece was the ability to select a configuration that'd allow us to avoid any of this. |
Ah yes, indeed. This is only a problem on aarch64 and riscv64, and AFAICT all aarch64 hardware one would want to run a high-performance server on (Apple implementations, AWS machines, Ampere CPUs, even RPi4) do not exhibit store-tearing, so on aarch64 it's mostly theoretical and/or on very small (low-power / embedded) cores I think. |
I can confirm that this fix also works in real hardware. I ran a few benchmarks at random, and it does have a bit of impact on smaller cores (as expected). Most benchmarks are 5-10% slower, but there is one outlier at 40% The results below are on a Dual-Issue In Order Core (Sifive u74). Results
|
I'm sorry that I saw this just now - I'll check this in a day or two from now and let you know. Tiny update: I'm sorry to say this, but I'm not available till 5/1. I'll see this again at that time. I'm not sure about this, but I think I observed that this problem persists with QEMU 8.1.5 - need recheck... |
As discussed at in #7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that is legal might succeed.
This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above.
Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V.
This PR implements the idea first proposed here, namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect.
This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified.
On an Apple M2 Pro machine (aarch64), this was measured to have a 2% performance impact when running
spidermonkey.wasm
with a simple recursive Fibonacci program. It can be used via the-Ccranelift-ensure_precise_store_traps=true
flag to Wasmtime.