Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Mar 22, 2024

As discussed at in #7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that is legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V.

This PR implements the idea first proposed here, namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified.

On an Apple M2 Pro machine (aarch64), this was measured to have a 2% performance impact when running spidermonkey.wasm with a simple recursive Fibonacci program. It can be used via the -Ccranelift-ensure_precise_store_traps=true flag to Wasmtime.

@cfallin cfallin requested review from a team as code owners March 22, 2024 17:31
@cfallin cfallin requested review from elliottt and fitzgen and removed request for a team March 22, 2024 17:31
@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.

…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
@cfallin cfallin force-pushed the a-load-a-day-keeps-the-torn-stores-away branch from 7d6193e to 18e2a57 Compare March 22, 2024 17:40
offset,
} => {
let ty = covering_type_for_value(&pos.func, data);
let _ = pos.ins().stack_load(ty, stack_slot, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack stores are guaranteed to succeed when stack probing (or some other kind of stack size check) is used.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:meta Everything related to the meta-language. wasmtime:api Related to the API of the `wasmtime` crate itself labels Mar 22, 2024
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "cranelift", "cranelift:meta", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@bjorn3
Copy link
Contributor

bjorn3 commented Mar 22, 2024

this was measured to have a 2% performance impact

How does this only have a 2% perf impact? Is some optimization pass removing part of the loads or is it the M2 chip being able to exploit a lot of instruction level parallelism?

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

Probably the latter -- the loads are not removed, and make it all the way to machine code.

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

(More specifically, a load inserted by this pass could later be removed by alias analysis, but only if subsumed by another earlier load to the same address, so there is still always some load before the store.)

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

One other performance-related note: this will possibly have worse impact when the pooling allocator with copy-on-write is enabled in Wasmtime, as the "first touch" to a page matters: if demanded in by a load first, a read-only mapping is created, which might immediately fault again on the store (doing the actual CoW). We've seen such "mapping change" operations become bottlenecks in multithreaded processes due to IPIs. I don't expect we would turn this option on in any production setting though; rather we'd ensure (as is already the case on x86) any production hardware does not have the torn-write behavior. But that's something to consider if we ever discuss having this on by default in all cases.

@afonso360
Copy link
Contributor

I can confirm that this fixes the torn writes at least on QEMU. 🎉

Interestingly QEMU RISC-V only shows this torn write test when using the vector extensions to perform the write (v128), but not using the normal i64.store.

I'm going to test this on real hardware over the weekend, and also see if I can run some benchmarks.

For completeness here's the WAST test that I used.

WAST Test
(module
  (memory 1)
  (func (export "i128.store") (param i32 i64)
    local.get 0
    v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff
    v128.store align=1)
  (func (export "i32.load8_u") (param i32) (result i32)
    local.get 0
    i32.load8_u))

(assert_trap (invoke "i128.store"
                     (i32.const 65529)
                     (i64.const 0xffffffffffffffff))
             "out of bounds memory access")

;; Partial bytes were not written.
(assert_return (invoke "i32.load8_u" (i32.const 65529))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65530))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65531))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65532))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65533))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65534))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65535))
               (i32.const 0))

let _ = pos.ins().load(ty, flags, addr, offset);
}
&InstructionData::StoreNoOffset {
opcode: Opcode::AtomicStore,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think atomics can be exempted here because they are already required to be aligned? At least at the wasm layer that's validated, I'm not sure about the Cranelift layer. This isn't needed for the wasm-level problem though.

Opcode::Istore32 => I32,
_ => unreachable!(),
};
let _ = pos.ins().load(ty, flags, addr, offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be skipped if flags.trap_code().is_none() or if flags.aligned()? Only possibly-trapping non-aligned loads (which is all wasm loads, but not anything to a vmctx for example)

@tschneidereit
Copy link
Member

I fully assume that I'm missing something and the answer is no, but just in case: would it not also work to first attempt to do the part of the store operation touching the highest page, and then do the rest of the write? ISTM that if any writes would fail that'd be guaranteed to include the highest page, and if that one succeeds, a write to all lower pages for the rest of the store operation is certain to succeed as well?

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

The problem I see with that approach is that we only know that dynamically: consider if we have a 64-bit store (allowed to be unaligned) to address X, we would have to decompose it into bytes and perform them in reverse order (store X+7, then X+6, then ...) always, because we know nothing about X's alignment. A static analysis could sometimes know X's alignment, but usually not, e.g. if it is a value loaded from memory itself. So that is a store-amplification factor of 8 for 64-bit ops, 4 for 32-bit ops, etc, likely to have quite a big perf impact, I think.

@tschneidereit
Copy link
Member

Hmm, I see, I think. Could we then instead write a single byte (of whatever value) at the highest address, and do the actual store if that succeeded?

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

I think that would work, yeah. There are some other interesting performance tradeoffs here -- two that come to mind are "partial-store forwarding" (partially overlapping loads/stores can be problematic for uarch memory perf) and the fact that stores in general are more expensive than loads (store buffers typically fewer entries than load buffers so can stall the OoO window, stores take extra effort at retire) that, balanced against all the other considerations here (we want something we could turn on by default on tier-2 architecture(s) if needed, and aren't as concerned with interactions with IPIs in pooling allocator where we're more likely to be very careful to select the right configuration and avoid this mechanism), I think I'd still prefer the load+store over narrow-store+store.

@tschneidereit
Copy link
Member

That all makes sense to me. I was mainly concerned about the impact on the pooling allocator, and the missing piece was the ability to select a configuration that'd allow us to avoid any of this.

@cfallin
Copy link
Member Author

cfallin commented Mar 22, 2024

Ah yes, indeed. This is only a problem on aarch64 and riscv64, and AFAICT all aarch64 hardware one would want to run a high-performance server on (Apple implementations, AWS machines, Ampere CPUs, even RPi4) do not exhibit store-tearing, so on aarch64 it's mostly theoretical and/or on very small (low-power / embedded) cores I think.

@afonso360
Copy link
Contributor

I can confirm that this fix also works in real hardware. I ran a few benchmarks at random, and it does have a bit of impact on smaller cores (as expected). Most benchmarks are 5-10% slower, but there is one outlier at 40%

The results below are on a Dual-Issue In Order Core (Sifive u74).

Results
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1459784.00 ± 60785.14 (confidence = 99%)

  noload.so is 1.07x to 1.07x faster than load.so!

  [22663159 22821360.40 22862178] load.so
  [21326450 21361576.40 21405304] noload.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2560349.00 ± 1538031.16 (confidence = 99%)

  noload.so is 1.01x to 1.04x faster than load.so!

  [110537308 111942491.90 113530980] load.so
  [108818829 109382142.90 111609052] noload.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3465 3954.40 6169] load.so
  [3444 4122.00 6204] noload.so

  execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 92094.60 ± 29972.88 (confidence = 99%)

  noload.so is 1.06x to 1.12x faster than load.so!

  [1100973 1120327.60 1169522] load.so
  [1019878 1028233.00 1083605] noload.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2281318 2391173.30 2505976] load.so
  [2212191 2312207.60 2376594] noload.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1028 1098.00 1448] load.so
  [1050 1103.10 1170] noload.so

  execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66417.80 ± 543.64 (confidence = 99%)

  noload.so is 1.39x to 1.40x faster than load.so!

  [233706 234547.90 235270] load.so
  [167648 168130.10 168742] noload.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4541616 4965704.40 5992445] load.so
  [4423989 4751356.50 5875859] noload.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [1714 2113.50 2415] load.so
  [1798 2135.50 2299] noload.so

  
compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 44240.30 ± 40968.63 (confidence = 99%)

  noload.so is 1.00x to 1.03x faster than load.so!

  [2848804 2880309.90 2923861] load.so
  [2803879 2836069.60 2912357] noload.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [1186 1391.10 1495] load.so
  [1185 1445.70 1604] noload.so

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [9675 9782.10 10005] load.so
  [9572 9637.80 9814] noload.so

  instantiation :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 340.10 ± 243.11 (confidence = 99%)

  noload.so is 1.03x to 1.19x faster than load.so!

  [2939 3406.10 3681] load.so
  [2992 3066.00 3281] noload.so

execution :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 245010.10 ± 1279.07 (confidence = 99%)

  noload.so is 1.06x to 1.06x faster than load.so!

  [4076050 4077873.90 4079406] load.so
  [3831298 3832863.80 3834736] noload.so

compilation :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 369482.20 ± 93448.03 (confidence = 99%)

  noload.so is 1.03x to 1.04x faster than load.so!

  [10706461 10872794.20 10932079] load.so
  [10360994 10503312.00 10590922] noload.so

@elliottt elliottt removed their request for review March 27, 2024 00:08
@candymate
Copy link

candymate commented Apr 4, 2024

@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.

I'm sorry that I saw this just now - I'll check this in a day or two from now and let you know.

Tiny update: I'm sorry to say this, but I'm not available till 5/1. I'll see this again at that time.

I'm not sure about this, but I think I observed that this problem persists with QEMU 8.1.5 - need recheck...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants