[a64] Implement an ARM64 backend #2259

Separates the `Windows` platform into `Windows-x86_64` and `Windows-ARM64`. Adds `--arch` argument to `build`. Removes x64 backend on non-x64 targets.

Marked as TODO for now

Uses intrinsics from https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

Adding the `a64` backend will be a different PR. For now it's stubbed to the null backend to allow the main executable to open without failing initalization.

This value is currently returning `0` on ARM machines and throws an exception.

Addresses a build issue that seems to occur now that xenia-app is not getting SDL2 through one of its submodues

Adds the new `xenia-cpu-backend-a64` build-target with linkage following the x64 backend.

Header-only library for emitting arm64v8 instructions. Enables C++20 only for the a64 backend for now

Mostly element-accessors

First pass framework that gets emitted ARM code executing. Based on the x64 backend, implements an ARM64 JIT backend.

This just reverses the bytes of 32-bit values, not reverse the whole vector.

Wrong register index and vector-register size

These calls need to preserve and restore the `lr` register. Unit tests all run now!

These are stomping over X0 and Q0 which is returning input argument registers as return values. Fixes some guest-to-host calls.

Vector registers are passed as pointers rather than directly in the `Qn` registers. So these functions should be taking pointer-type arguments rather than vector-register types directly. Fixes `OPCODE_VECTOR_SHL` and passes unit tests.

We dont load it back so no need to store it

Passes all unit tests

Uses the emulated fallback for now. Will have to come back to this later. Passes unit tests.

Passes unit tests

There is quite literally an instruction for each and every one of these cases. Passes unit tests

Arguments need to be pointers stored in X0, X1, X2, ... rather than bassed directly in Q0, Q1 etc. There are no unit tests for these functions in particular.

Fails the unit tests due to subtle rounding errors

Fails unit tests due to subtle rounding errors `SHORT_4` unit-test is missing but implementation is the same as `SHORT_4`

Adds support for HIR labels to create actual oaknut labels

Implements control sequences such as conditional branching, breaking, and trapping

Register was getting stomped over

On the x64 side, this is the same as the `reset()` function resetting the label-manager

Resolving the function puts it into X0 and should be called immediately after. We were just calling ResolveFunction on ResolveFunction recursively

Things still get weird at the thunks, but this allows for callstacks between-to-guest calls

Also changes the register to X3 by default

Should be `GUEST_RET_ADDR` not `GUEST_CALL_RET_ADDR`.

Let the register type determine the reverse-size REV32 was also the wrong instruction to use.

`W1` is a possible HIR register allocation and using W1 here was stomping over it. Don't use W1, use the provided "scratch" register.

Derive the reversal-size from the register-size. REV32 is also the wrong one to be using here since it will reverse the bytes of upper and lower 32-bit words.

Share a somewhat similar calling convention as ARM64

Fixes callstacks!!!!

16-bit word rather than 8-bit

These instructions need to use an extra register to generate their constants if they are too large

`x0` was loading the thunk rather than using `xip` Fixes lots of init bugs!

Additionally fixes some instruction forms to use the more general `STR` instruction with an offset

You wouldn't believe how much time this bug costed me

Guest-function calls will use W17 for indirect calls

Fixes some offset generation as well

Fixes indirect branches

Was picking up `W0` rather than src1

Operand order is wrong.

Writing to the wrong register!

Potential input-register stomping and operand order is seemingly wrong. Passes generated unit tests.

Passes generated unit tests

Much more explicit arguments while trying to debug a deadlock

Was not handling constant arguments properly

Values should be modulo-element-size

😳

``` 4.2.2.4 Floating-Point Rounding and Conversion Instructions ... Floating-point conversions to integers (vctuxs, vctsxs) use round-toward-zero (truncate). ... ``` This passes all of the `vctuxs` and `vctsxs` unit tests

Passes 'vmrghh' and `vmrglh` unit-tests

Use `FMADD` and `FMLA` Tests are the same, though now it should run a bit faster. The tests that fail are primarily denormals and other subtle precision issues it seems. Ex: ``` i> 00002358 - vmaddfp_7298_GEN !> 00002358 Register v4 assert failed: !> 00002358 Expected: v4 == [00000000, 00000000, 00000000, 00000000] !> 00002358 Actual: v4 == [000D000E, 00138014, 000E4CDC, 0018B34D] !> 00002358 TEST FAILED ``` Host-To-Guest and Guest-To-Host thunks should probably restore/preserve the FPCR to maintain these roundings.

8 and 16 bit CNTLZ needs its bit-count fixed to its original element-type

…onCoverage` Relies on armv8.1-a atomic features

This fixes 32-bit atomic-compare-exchanges. The upper-half of the input register _must_ be clipped off. This fixes a deadlock in some games.

Just need to store `fp` and `lr`

Uses `0x0000'dead` as an instructon-stepping sentinel value. Support for basic jumping instructions like `b`, `bl`, `br`, and `blr`.

Uses MOVI to optimize some cases of constants rather than EOR. MOVI is a register-renaming idiom on many architectures.

The LSL can be embedded into the ADD to remove an additional instruction. What was `cset`+`lsl`+`add` should now just be `cset`+`add ... LSL 12`

Use pair-stores rather than singular-stores to write 32-bytes of data at a time.

Uses the `CNTVCT_EL0`-register and applies frequency scaling

Passes cpu-ppc-tests

This is a very literal translation from the x64 code into ARM and may not be very optimized. Passes unit test save for a couple off-by-one errors.

Adds two new flags for allowing the use of LSE and FP16C

Narrow-saturation instructions causes off-by-one rounding errors. Using the min+max+shuffle passes more unit tests

Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value. Reduces the amount of instructions for each VConst memory load.

Detect when all bytes are repeating and use `MOVI` when applicable

Indices and non-const tables were using the same scratch-register

Uses `CNTFRQ` and `CNTVCT` system-registers as a raw clock source. On my ThinkPad x13s, the raw clock source returns a tick-frequency of 19,200,000 while the platform clock source(QueryPerformanceFrequency) returns 10,000,000. Almost double the accuracy over the platform-clock!

Misses some during the first pass. Now the config files with mention a64 differences.

Read direction from the ZR in the case that we are just storing a 64 or 32 bit zero

This directly maps to the QC bit in the FPSR. Just have to make sure that the saturated instruction is the very last instruction(which is currently the case for stuff like VECTOR_ADD and such).

The 64-bit cases uses a particular Replicated 8-bit immediate so something else will have to handle that This cases a lot of cases without having to touch memory. Does not catch cases of `1.0`(0x3f800000).

`FMOV` encodes an 8-bit floating point immediate that can be used to accelerate the loading of certain constant floating point values between -31.0 and 32.0. A lot of immediates such as -1.0, 1.0, 0.5, etc fall within this range and this code gets lots of hits in my testing. This is much more optimal than trying to load a 32/64-bit value in W0/X0 and moving it into an FP register.

Uses LSE when available, but provides an armv8.0 baseline implementation.

Removes all comments relating to x64 implementation details

`dc civac` causes an illegal-instruciton on Windows-ARM. This is likely as a security measure against cache-attacks. On Linux this instruction is trapped into an EL1 kernel function. Windows does not seem to have any user-mode cache-maintenance instructions available for data-cache(only instruction-cache via `FlushInstructionCache`). The closest thing we can do for now is a full data memory-barrier with `dsb ish`. Prefetches are implemented using `prfm pldl1keep, ...`.

Out-of-bound shift-values are handled as modulo-element-size

The emitter doesn't actually hold onto executable code, but just generates the assembly-data into a buffer for the currently-resolving function before placing it into a code-cache. When code gets pushed into the code-cache, it can just be copied from an `std::vector` and reset. The code-cache itself maintains the actual executable memory and stack-unwinding code and such. This also fixes a bunch of errornous relative-addressing glitches where relative addresses were calculated based on the address of the unused CodeBlock rather than being position-independent. `MOVP2R` in particular was generating different instructions depending on its distance from the code block when it should always just use `MOV` and not do any relative-address calculations since we can't predict where the actual instruction's offset will be(we cannot predict what the program counter will be). Oaknut probably needs a "position independent" policy or mode or something so that it avoids PC-relative instructions.

These `MOV`->`DUP` splats can just be a singular `MOVI` instruction

Byte-sized constants can utilize the `MOVI` instructions. This makes many cases such as zero-splats much faster since this encodes as just a register-rename(similar to `xor` on x64).

Moves the `FMOV` constant functions into `a64_util` so it is available to other translation units. Optimize constant-splats with conditional use of `MOVI` and `FMOV`.

The last `FADDP` writes into an `S` register, which automatically masks all the other lanes to zero.

The `SUB` instruction can only encode immediates in the form of `0xFFF` or `0xFFF000`. In the case that the stack size is greater than `0xFFF`, then just align the stack-size by `0x1000` to keep the bottom 12 bits clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[a64] Implement an ARM64 backend #2259

[a64] Implement an ARM64 backend #2259

Commits on Apr 27, 2024

Commits on Apr 28, 2024

Commits on Apr 29, 2024

Commits on Jun 23, 2024

[a64] Implement an ARM64 backend #2259

Are you sure you want to change the base?

[a64] Implement an ARM64 backend #2259

Commits on Apr 27, 2024

Commits on Apr 28, 2024

Commits on Apr 29, 2024

Commits on Jun 23, 2024