Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially out-of-bounds writes on ARM and riscv #7237

Open
fitzgen opened this issue Oct 13, 2023 · 17 comments
Open

Partially out-of-bounds writes on ARM and riscv #7237

fitzgen opened this issue Oct 13, 2023 · 17 comments

Comments

@fitzgen
Copy link
Member

fitzgen commented Oct 13, 2023

Apparently on ARM (edit: and riscv) when

  • a store is unaligned
  • and crosses a page boundary
  • and one page faults but the other doesn't

then there is no guarantee that the part of the store to the page that didn't fault will not succeed, despite the other part raising a signal.

This means that partially out-of-bounds Wasm stores that trigger a trap can potentially mutate memory for the in-bounds portion of the write, which is not spec compliant.

Apparently it is implementation-defined behavior, so it may or may not be an issue on any given ARM machine.

Thus far, @cfallin tested on the ARM machines he has access to and none of the following have failed the attached test case:

  • Apple M2
  • RPi4
  • the BA ARM server ("Neoverse N1 core, I think, in an Ampere Altra CPU")

The test case does fail on the following machines we have tested:

Test Case

(module
  (memory 1)
  (func (export "i64.store") (param i32 i64)
    local.get 0
    local.get 1
    i64.store)
  (func (export "i32.load8_u") (param i32) (result i32)
    local.get 0
    i32.load8_u))

(assert_trap (invoke "i64.store"
                     (i32.const 65529)
                     (i64.const 0xffffffffffffffff))
             "out of bounds memory access")

;; Partial bytes were not written.
(assert_return (invoke "i32.load8_u" (i32.const 65529))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65530))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65531))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65532))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65533))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65534))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65535))
               (i32.const 0))

See Also

@fitzgen
Copy link
Member Author

fitzgen commented Oct 13, 2023

Modulo relaxing the spec, the only 100% sure way to fix this that I am aware of would be to use explicit bounds checks for all stores on ARM.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 13, 2023

Oh also, we should double check whether risc-v and s390x give us the stronger guarantees we need to match wasm semantics.

cc @uweigand: happen to know the answer for s390x?

@cfallin
Copy link
Member

cfallin commented Oct 13, 2023

I've been trying to figure out the state of RISC-V's specs on this matter. The privileged-mode spec includes details on page translation, but the translation algorithm and surrounding text (section 4.3.2) talk about a single address being translated, and don't seem to mention what happens when a single usermode instruction requires multiple translations. The user-mode spec is pretty high-level and only talks about "invisible traps", including page-faults, that can happen during execution of an instruction but are unobservable. So it seems to be a bit underspecified.

@cfallin
Copy link
Member

cfallin commented Oct 13, 2023

Two other possible fixes:

  • Somehow detect if the underlying platform actually tears partially-faulting writes, and only enable bounds-checks if so;
  • Disallow access to the memory on which the fault occurred, after-the-fact. This would still then be spec-compliant during Wasm execution, we just don't provide the final memory state as an output on affected systems with the "fast" (guard-page) configuration. I'm not sure exactly how to enforce this (maybe unmap the whole thing at trap-termination?) but perhaps there's a way. Or just document it.

@uweigand
Copy link
Member

Oh also, we should double check whether risc-v and s390x give us the stronger guarantees we need to match wasm semantics.

cc @uweigand: happen to know the answer for s390x?

For most instructions on s390x, the architecture guarantees that they either complete successfully, or else are "nullified" or "suppressed", which means none of the output operands (either in memory or in registers) are changed at all. This also applies to the case of an unaligned store crossing a page boundary. [ The one side effect allowed by the architecture is that a nullified or suppressed instruction may cause the hardware change bit of a page containing (part of) a memory output operand to be set, even if the output actually is not modified. This doesn't have any observable effect on a wasm program, though. ]

The exception to this rule are so-called "interruptible" instructions; these are long-running operations (e.g. copying a large block of memory). Those are allowed to "partially complete", e.g. after copying only part of the total memory block. They will indicate this outcome by modifying the output registers and condition code accordingly. If such an instruction incurs a page fault halfway through, partial completion will usually result. But those instructions are not generated by the wasmtime s390x backend (certainly not to implement a single wasm store).

@afonso360
Copy link
Contributor

afonso360 commented Oct 15, 2023

For RISC-V it looks like this was recently clarified in the ISA (see riscv/riscv-isa-manual#1119). The ISA now states:

On some implementations, misaligned loads, stores, and instruction
fetches may also be decomposed into multiple accesses, some of which may
succeed before a page-fault exception occurs. In particular, a
portion of a misaligned store that passes the exception check may become
visible, even if another portion fails the exception check. The same behavior
may manifest for stores wider than XLEN bits (e.g., the FSD instruction
in RV32D), even when the store address is naturally aligned.

Edit: By the way, I ran the above test on a JH7110 (SiFive u74 core) and it failed with:

Error: failed to run script file './test.wast'

Caused by:
    0: failed directive on ./test.wast:17:1
    1: result 0 didn't match
    2: expected                  0 / 0x0000000000000000
       actual                  255 / 0x00000000000000ff

So it looks like this actually happens for RISC-V cores.

@sparker-arm
Copy link
Contributor

Somehow detect if the underlying platform actually tears partially-faulting writes, and only enable bounds-checks if so.

Just FYI, this is what V8 does, which is on arm64 MacOS.

@cfallin
Copy link
Member

cfallin commented Oct 16, 2023

Somehow detect if the underlying platform actually tears partially-faulting writes, and only enable bounds-checks if so.

Just FYI, this is what V8 does, which is on arm64 MacOS.

It seems like given @akirilov-arm's reply here, V8's behavior is also technically non-conforming then: if the platform doesn't guarantee consistency in its behavior, then one can't test and choose a strategy at startup and rely on it to remain working.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 16, 2023

So it looks like this actually happens for RISC-V cores.

Thanks for testing @afonso360!

Will update the issue title to reflect that this is both an ARM and riscv issue.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 16, 2023

For most instructions on s390x, the architecture guarantees that they either complete successfully, or else are "nullified" or "suppressed",

Thanks @uweigand!

@fitzgen fitzgen changed the title Partially out-of-bounds writes on ARM Partially out-of-bounds writes on ARM and riscv Oct 16, 2023
@sparker-arm
Copy link
Contributor

if the platform doesn't guarantee consistency in its behavior

The Arm architecture doesn't give you any guarantees, but if a specific OS only runs on specific microarchitectures, that do provide the semantics, then it will be conforming. But, of course, there's no guarantee that future microarchitectures on that platform will continue to provide the necessary semantics.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 17, 2023

It also has the issue of migrating between cpu's. Whether for live vm migration, or for something like rr.

@alexcrichton
Copy link
Member

This was rediscovered recently with fuzzing

@cfallin
Copy link
Member

cfallin commented Mar 22, 2024

I'll copy here the idea I posted in the Wasm design thread, which AFAIK is the lowest-overhead potential solution today if the spec doesn't change: insert a load of the same size before every store. This is correct in the sense that if we're about to store to an address, a load to that same address should be legal, and if it traps the store would have trapped; and prevents side-effects because if one instruction occurs before another and traps, the side-effects of the latter must not occur at all, architecturally at least (here we're not thinking about speculation; also there was some discussion in the other thread about memory models and atomics, but this is fully within a single thread and relies only on instruction order in that thread). I think I will prototype this idea just to see what the overhead is, and to allow folks with write-tearing machines to test.

@sparker-arm
Copy link
Contributor

I will be very interested in hearing the results! I've currently got a prototype patch for V8 which enables trap handling for byte stores and for any store for which we know is aligned, and that gives a 2% uplift in some benchmarks.

cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
@cfallin
Copy link
Member

cfallin commented Mar 22, 2024

I've implemented the idea in #8221 and I see a 2% perf degradation (on an Apple M2 Max machine when running spidermonkey.wasm in Wasmtime). That seems reasonable to me to enable under some conditions -- not universally -- if the spec remains the same. It's still a hard question exactly what those conditions are: require a non-default setting for precise post-trap memory state? Perform a torn-store test at startup and fail if option was not enabled during compilation (or turn it on, the same way we infer native flags otherwise -- "non-tearing" hardware is then like an ISA extension in that sense)?

@pervognsen
Copy link

pervognsen commented Dec 7, 2024

Alex Bradbury pointed me to this thread. I was working on the specification of a low-level sandbox ISA and realized that partially visible page-crossing stores would be a potential issue if I overspecified it relative to what guarantees existing ISAs could offer. In my case I have a page-based memory model with both read and read/write permissions per page, so load probing does not suffice Edit: on second thought, load probing would suffice if you kept all loadable bytes in a temp register so the fault handler could undo a partial store by restoring the original bytes. But this requires you to split the unaligned load probe into two aligned load probes which also adds some alignment math.

However, if you specify maximal bytewise store semantics (i.e. store every byte that can be stored given the page permissions) I believe you should be able to implement it with software emulation in the fault handler via idempotent stores: the architectural state at the faulting instruction still has the address and the source operand accessible (in registers in my case or on the stack in Wasm's case). This would have zero performance impact on the non-faulting code path. In Wasm's case I don't know if this would be inconsistent with the specification in other areas (e.g. concurrency) but I thought I'd mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants