Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize instruction fetch and decoding #226

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

edubart
Copy link
Contributor

@edubart edubart commented Apr 6, 2024

This optimizes our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.

Here is the GCC x86_64 trace as proof:

//// FENCE GCC x86_64 (2/12 instructions)
// increment mcycle (3 instructions)
=> 0x7ffff7a2e98c <loop+28108>:   add    $0x1,%r15                     // mcycle += 1
=> 0x7ffff7a2e990 <loop+28112>:   cmp    %r13,%r15                     // mcycle < mcycle_tick_end
=> 0x7ffff7a2e993 <loop+28115>:   jae    0x7ffff7a2f230 <loop+30320>   // -> break loop
// fetch (5 instructions)
=> 0x7ffff7a2e999 <loop+28121>:   mov    %r10,%rbx                     // pc
=> 0x7ffff7a2e99c <loop+28124>:   xor    %rbp,%rbx                     // pc ^ fetch_vaddr_page
=> 0x7ffff7a2e99f <loop+28127>:   cmp    $0xffd,%rbx                   // check fetch page
=> 0x7ffff7a2e9a6 <loop+28134>:   ja     0x7ffff7a27d00 <loop+320>     // -> miss fetch
=> 0x7ffff7a2e9ac <loop+28140>:   mov    (%r14,%rbp,1),%ebx            // insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode (2 instructions)
=> 0x7ffff7a2e9b0 <loop+28144>:   movzwl %bx,%ecx                      // insn & 0b1111111111111111
=> 0x7ffff7a2e9b3 <loop+28147>:   jmp    *(%r11,%rcx,8)                // -> jump to instruction
// execute (2 instructions)
=> 0x7ffff7a2ea3b <loop+28283>:   add    $0x4,%rbp                     // pc += 4
=> 0x7ffff7a2ea3f <loop+28287>:   jmp    0x7ffff7a2e98c <loop+28108>   // -> jump to loop begin

And the Clang arm64:

//// FENCE Clang arm64 (2/12 instructions)
// increment mcycle
=> 0xfffff7b8a328 <loop+4568>:    add x25, x25, $0x1
=> 0xfffff7b8a32c <loop+4572>:    cmp x25, x27
=> 0xfffff7b8a330 <loop+4576>:    b.cs    0xfffff7b8e7a8 <loop+22104>
// fetch
=> 0xfffff7b8a334 <loop+4580>:    eor x19, x20, x28
=> 0xfffff7b8a338 <loop+4584>:    cmp x19, $0xffd
=> 0xfffff7b8a33c <loop+4588>:    b.hi    0xfffff7b89264 <loop+276>
=> 0xfffff7b8a340 <loop+4592>:    ldr w19, [x20, x22]
// decode
=> 0xfffff7b8a344 <loop+4596>:    and w10, w19, $0xffff
=> 0xfffff7b8a348 <loop+4600>:    ldr x16, [x24, x10, lsl $3]
=> 0xfffff7b8a34c <loop+4604>:    br  x16
// execute
=> 0xfffff7b8dde8 <loop+19608>:   add x20, x20, $0x4
=> 0xfffff7b8ddec <loop+19612>:   b   0xfffff7b8a328 <loop+4568>

In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.

Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with stress-ng :

Benchmarks

Times faster Benchmark
2.56 ± 0.03 stress-ng --no-rand-seed --syscall 1 --syscall-ops 4000
2.15 ± 0.02 stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
1.95 ± 0.00 stress-ng --no-rand-seed --cpu 1 --cpu-method fibonacci --cpu-ops 400
1.94 ± 0.01 stress-ng --no-rand-seed --cpu 1 --cpu-method int64 --cpu-ops 400
1.90 ± 0.01 stress-ng --no-rand-seed --memcpy 1 --memcpy-ops 50
1.88 ± 0.02 stress-ng --no-rand-seed --crypt 1 --crypt-method SHA-256 --crypt-ops 400000
1.87 ± 0.01 stress-ng --no-rand-seed --qsort 1 --qsort-ops 5
1.83 ± 0.01 stress-ng --no-rand-seed --memrate 1 --memrate-bytes 2M --memrate-ops 200
1.82 ± 0.03 stress-ng --no-rand-seed --hash 1 --hash-ops 40000
1.75 ± 0.00 stress-ng --no-rand-seed --heapsort 1 --heapsort-ops 3
1.72 ± 0.01 stress-ng --no-rand-seed --zlib 1 --zlib-ops 20
1.66 ± 0.00 stress-ng --no-rand-seed --matrix 1 --matrix-method mult --matrix-ops 20000
1.49 ± 0.02 stress-ng --no-rand-seed --hdd 1 --hdd-ops 2000
1.41 ± 0.00 stress-ng --no-rand-seed --fp 1 --fp-method floatadd --fp-ops 1000
1.33 ± 0.01 stress-ng --no-rand-seed --fma 1 --fma-ops 40000
1.24 ± 0.01 stress-ng --no-rand-seed --trig 1 --trig-ops 50
1.16 ± 0.01 stress-ng --no-rand-seed --fork 1 --fork-ops 1000
1.14 ± 0.01 stress-ng --no-rand-seed --malloc 1 --malloc-ops 40000

You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.

The table of benchmarks were created by running hyperfine and stress-ng, for example:

$ hyperfine -w 1 'cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400' '/usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400'
Benchmark 1: cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      2.225 s ±  0.021 s    [User: 2.213 s, System: 0.010 s]
  Range (min … max):    2.197 s …  2.257 s    10 runs
 
Benchmark 2: /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      4.615 s ±  0.041 s    [User: 4.602 s, System: 0.009 s]
  Range (min … max):    4.561 s …  4.682 s    10 runs
 
Summary
  cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400 ran
    2.07 ± 0.03 times faster than /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400

PREVIOUS PR ITERATION COMMENTS

This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.

Baseline

First I generated a hot trace of subsequent FENCE.I instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:

// mcycle check
| 0x7ffff7b88330 <interpret_loop+432>    add    $0x1,%r14++mcycle
0x7ffff7b88334 <interpret_loop+436>    cmp    %r11,%r14                │ mcycle < mcycle_tick_end
0x7ffff7b88337 <interpret_loop+439>    jae    0x7ffff7b88420-> break interpret hot loop
// fetch
0x7ffff7b8833d <interpret_loop+445>    mov    %r15,%rbx                │ pc
0x7ffff7b88340 <interpret_loop+448>    and    $0xfffffffffffff000,%rbx │ vaddr_page = pc & ~PAGE_OFFSET_MASK
0x7ffff7b88347 <interpret_loop+455>    cmp    %r12,%rbx                │ vaddr_page == fetch_vaddr_page
0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728-> miss fetch cache
0x7ffff7b88350 <interpret_loop+464>    lea    0x0(%r13,%r15,1),%rax    │ hptr = pc + fetch_vh_offset
0x7ffff7b88355 <interpret_loop+469>    mov    %r15,%rdx                │ pc
0x7ffff7b88358 <interpret_loop+472>    not    %rdx                     │ ~pc
0x7ffff7b8835b <interpret_loop+475>    test   $0xffe,%edx              │ ((~pc & PAGE_OFFSET_MASK) >> 1) == 0
0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760-> cross page boundary
0x7ffff7b88367 <interpret_loop+487>    mov    (%rax),%r9d              │ insn = *(uint32_t*)(hptr)
0x7ffff7b8836a <interpret_loop+490>    mov    %rbx,%r12                │ fetch_vaddr_page = vaddr_page
// decoding: check if is a compressed instruction
0x7ffff7b8836d <interpret_loop+493>    mov    %r9d,%eax                │ insn
0x7ffff7b88370 <interpret_loop+496>    not    %eax                     │ ~insn
0x7ffff7b88372 <interpret_loop+498>    test   $0x3,%al                 │ (~insn & 3) > 0
0x7ffff7b88374 <interpret_loop+500>    jne    0x7ffff7b882b0-> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
0x7ffff7b8837a <interpret_loop+506>    mov    %r9d,%eax
0x7ffff7b8837d <interpret_loop+509>    and    $0x707f,%eax
0x7ffff7b88382 <interpret_loop+514>    cmp    $0x3023,%eax
0x7ffff7b88387 <interpret_loop+519>    je     0x7ffff7b8a868
0x7ffff7b8838d <interpret_loop+525>    ja     0x7ffff7b88500
0x7ffff7b88393 <interpret_loop+531>    cmp    $0x101b,%eax             |
0x7ffff7b88398 <interpret_loop+536>    je     0x7ffff7b8a820
0x7ffff7b8839e <interpret_loop+542>    ja     0x7ffff7b887d0
0x7ffff7b883a4 <interpret_loop+548>    cmp    $0x3b,%eax
0x7ffff7b883a7 <interpret_loop+551>    je     0x7ffff7b8a638
0x7ffff7b883ad <interpret_loop+557>    ja     0x7ffff7b88b68
0x7ffff7b88b68 <interpret_loop+2536>   cmp    $0x1003,%eax
0x7ffff7b88b6d <interpret_loop+2541>   je     0x7ffff7b8a494
0x7ffff7b88b73 <interpret_loop+2547>   ja     0x7ffff7b89450
| 0x7ffff7b89450 <interpret_loop+4816>   cmp    $0x1013,%eax
0x7ffff7b89455 <interpret_loop+4821>   je     0x7ffff7b8a6d0
0x7ffff7b8945b <interpret_loop+4827>   cmp    $0x1017,%eax
0x7ffff7b89460 <interpret_loop+4832>   je     0x7ffff7b8a71c
0x7ffff7b89466 <interpret_loop+4838>   cmp    $0x100f,%eax
0x7ffff7b8946b <interpret_loop+4843>   jne    0x7ffff7b8ab45
// execute
0x7ffff7b89471 <interpret_loop+4849>   add    $0x4,%r15                │ pc += 4
0x7ffff7b89475 <interpret_loop+4853>   jmp    0x7ffff7b88330-> jump to begin

This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one FENCE.I in this trace, where:

  • mcycle check: 3 instructions
  • fetch: 11 instructions
  • decoding: 24 instructions
  • execution: 2 instructions

I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).

If we look closely in the fetch, there are these two branches:

0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728-> miss fetch cache
0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760-> cross page boundary

My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.

This is the benchmark for baseline.

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.205 MIPS   3250.0 ucycles
RISC-V Privileged Interrupt-management        495.199 MIPS    257.0 ucycles
RV64I - Base integer instruction set          574.503 MIPS    263.8 ucycles
RV64M - Integer multiplication and division   575.371 MIPS    312.2 ucycles
RV64A - Atomic instructions                   431.893 MIPS    304.2 ucycles
RV64F - Single-precision floating-point       218.556 MIPS    489.3 ucycles
RV64D - Double-precision floating-point       214.964 MIPS   1773.7 ucycles
RV64Zicsr - Control and status registers      289.061 MIPS    328.3 ucycles
RV64Zicntr - Base counters and timers         343.063 MIPS    284.3 ucycles
RV64Zifence - Instruction fetch fence         792.319 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.333 s ±  0.007 s    [User: 1.326 s, System: 0.006 s]
  Range (min … max):    1.321 s …  1.356 s    32 runs

Round 1 - Optimize fetch

After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:

// mcycle check
|   0x7ffff7b873e0 <interpret_loop+432>    add    $0x1,%r15++mcycle
0x7ffff7b873e4 <interpret_loop+436>    cmp    %r10,%r15               │ mcycle < mcycle_tick_end
0x7ffff7b873e7 <interpret_loop+439>    jae    0x7ffff7b874b0-> break interpret hot loop
// fetch
0x7ffff7b873ed <interpret_loop+445>    mov    %rbp,%rax               │ pc
0x7ffff7b873f0 <interpret_loop+448>    xor    %r14,%rax               │ pc ^ fetch_vaddr_page
0x7ffff7b873f3 <interpret_loop+451>    cmp    $0xffd,%rax             │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
0x7ffff7b873f9 <interpret_loop+457>    ja     0x7ffff7b877d8-> miss fetch cache
0x7ffff7b873ff <interpret_loop+463>    mov    0x0(%rbp,%r13,1),%ebx   │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
0x7ffff7b87404 <interpret_loop+468>    mov    %ebx,%eax               │ insn
0x7ffff7b87406 <interpret_loop+470>    not    %eax                    │ ~insn
0x7ffff7b87408 <interpret_loop+472>    test   $0x3,%al                │ (~insn & 3) > 0
0x7ffff7b8740a <interpret_loop+474>    jne    0x7ffff7b87360-> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
0x7ffff7b87410 <interpret_loop+480>    mov    %ebx,%eax
0x7ffff7b87412 <interpret_loop+482>    and    $0x707f,%eax
0x7ffff7b87417 <interpret_loop+487>    cmp    $0x3023,%eax
0x7ffff7b8741c <interpret_loop+492>    je     0x7ffff7b891e8
0x7ffff7b87422 <interpret_loop+498>    ja     0x7ffff7b87590
0x7ffff7b87428 <interpret_loop+504>    cmp    $0x101b,%eax
0x7ffff7b8742d <interpret_loop+509>    je     0x7ffff7b899f4
0x7ffff7b87433 <interpret_loop+515>    ja     0x7ffff7b87838
0x7ffff7b87439 <interpret_loop+521>    cmp    $0x3b,%eax
0x7ffff7b8743c <interpret_loop+524>    je     0x7ffff7b892b9
0x7ffff7b87442 <interpret_loop+530>    ja     0x7ffff7b87bc0
0x7ffff7b87bc0 <interpret_loop+2448>   cmp    $0x1003,%eax
0x7ffff7b87bc5 <interpret_loop+2453>   je     0x7ffff7b89030
0x7ffff7b87bcb <interpret_loop+2459>   ja     0x7ffff7b88680
0x7ffff7b88680 <interpret_loop+5200>   cmp    $0x1013,%eax
0x7ffff7b88685 <interpret_loop+5205>   je     0x7ffff7b897b0
0x7ffff7b8868b <interpret_loop+5211>   cmp    $0x1017,%eax
0x7ffff7b88690 <interpret_loop+5216>   je     0x7ffff7b89317
0x7ffff7b88696 <interpret_loop+5222>   cmp    $0x100f,%eax
0x7ffff7b8869b <interpret_loop+5227>   jne    0x7ffff7b89b7a
// execute
0x7ffff7b886a1 <interpret_loop+5233>   add    $0x4,%rbp               | pc += 4
0x7ffff7b886a5 <interpret_loop+5237>   jmp    0x7ffff7b873e0-> jump to begin

We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one FENCE.I in this trace, where:

  • mcycle check: 3 instructions (same as before)
  • fetch: 5 instructions (-6 instructions)
  • decoding: 24 instructions (same as before)
  • execution: 2 instructions (same as before)

So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.227 MIPS   3244.0 ucycles
RISC-V Privileged Interrupt-management        509.389 MIPS    257.0 ucycles
RV64I - Base integer instruction set          611.176 MIPS    264.3 ucycles
RV64M - Integer multiplication and division   606.949 MIPS    312.9 ucycles
RV64A - Atomic instructions                   449.197 MIPS    304.6 ucycles
RV64F - Single-precision floating-point       227.302 MIPS    489.7 ucycles
RV64D - Double-precision floating-point       225.613 MIPS   1774.1 ucycles
RV64Zicsr - Control and status registers      300.756 MIPS    329.0 ucycles
RV64Zicntr - Base counters and timers         354.946 MIPS    283.3 ucycles
RV64Zifence - Instruction fetch fence         847.884 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.269 s ±  0.011 s    [User: 1.261 s, System: 0.007 s]
  Range (min … max):    1.257 s …  1.317 s    32 runs

We can see improvements in all benchmarks, where:

  • Dhrystone took about 1.269 / 1.333 = 95% time to execute, minus 5%.
  • 611.176 / 574.50 = ~ 6% improvement in instruction execution speed for RV64I instruction set

Round 2 - Optimize decoding for uncompressed instruction

The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.

EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:

// mcycle check
|   0x7ffff7b82990 <interpret_loop+560>  add    $0x1,%r14++mcycle
0x7ffff7b82994 <interpret_loop+564>  cmp    %r11,%r14          │ mcycle < mcycle_tick_end
0x7ffff7b82997 <interpret_loop+567>  jb     0x7ffff7b828a0-> break interpret hot loop
// fetch
|   0x7ffff7b828a0 <interpret_loop+320>  mov    %r15,%rax          │ pc
0x7ffff7b828a3 <interpret_loop+323>  xor    %r13,%rax          │ pc ^ fetch_vaddr_page
0x7ffff7b828a6 <interpret_loop+326>  cmp    $0xffd,%rax        │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
0x7ffff7b828ac <interpret_loop+332>  ja     0x7ffff7b844f0-> miss fetch cache
0x7ffff7b828b2 <interpret_loop+338>  mov    (%r15,%r12,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
0x7ffff7b828b6 <interpret_loop+342>  mov    %ebx,%ecx          │ insn
0x7ffff7b828b8 <interpret_loop+344>  and    $0x3,%ecx          │ ~insn
0x7ffff7b828bb <interpret_loop+347>  cmp    $0x3,%ecx          │ (~insn & 3) > 0
0x7ffff7b828be <interpret_loop+350>  je     0x7ffff7b83100-> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
0x7ffff7b83100 <interpret_loop+2464> mov    %ebx,%eax          │ insn
0x7ffff7b83102 <interpret_loop+2466> mov    %ebx,%edx          │ insn
0x7ffff7b83104 <interpret_loop+2468> shr    $0x5,%eax          │ insn >> 5
0x7ffff7b83107 <interpret_loop+2471> and    $0x7f,%edx         │ insn & 0b1111111
0x7ffff7b8310a <interpret_loop+2474> and    $0x380,%eax        │ (insn >> 5) & 0b1110000000
0x7ffff7b8310f <interpret_loop+2479> or     %edx,%eax          │ ((insn >> 5) & 0b1110000000) | (insn & 0b1111111)
0x7ffff7b83111 <interpret_loop+2481> lea    -0x3(%rax),%edx    │ load index in jump table
0x7ffff7b83114 <interpret_loop+2484> cmp    $0x3f0,%edx        │ check if index is valid
0x7ffff7b8311a <interpret_loop+2490> ja     0x7ffff7b83130-> illegal instruction
0x7ffff7b8311c <interpret_loop+2492> lea    0x3b711(%rip),%rdi │ load jump base offset
0x7ffff7b83123 <interpret_loop+2499> movslq (%rdi,%rdx,4),%rdx │ load jump offset for given index
0x7ffff7b83127 <interpret_loop+2503> add    %rdi,%rdx          │ compute instruction jump address
0x7ffff7b8312a <interpret_loop+2506> jmp    *%rdx-> jump to instruction
// execute
0x7ffff7b83590 <interpret_loop+3632> add    $0x4,%r15          │ pc += 4
│  >0x7ffff7b83594 <interpret_loop+3636> jmp    0x7ffff7b82990-> jump to begin

We can see that it takes exactly 27 x86_64 instructions to execute one FENCE.I in this trace! Where:

  • mcycle check: 3 instructions (same as before)
  • fetch: 5 instructions (-6 instructions from baseline)
  • decoding: 17 instructions (-7 instructions from baseline)
  • execution: 2 instructions (same as before)

However this adds one memory indirection to lookup the jump table, but this is fine, this memory indirection is mostly likely cached in L1 CPU cache.

These are the new benchmark numbers:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.215 MIPS   3234.0 ucycles
RISC-V Privileged Interrupt-management        835.057 MIPS    257.0 ucycles
RV64I - Base integer instruction set          780.803 MIPS    259.1 ucycles
RV64M - Integer multiplication and division   673.788 MIPS    306.8 ucycles
RV64A - Atomic instructions                   523.084 MIPS    298.6 ucycles
RV64F - Single-precision floating-point       347.813 MIPS    482.4 ucycles
RV64D - Double-precision floating-point       374.271 MIPS   1310.9 ucycles
RV64Zicsr - Control and status registers      321.237 MIPS    321.7 ucycles
RV64Zicntr - Base counters and timers         488.825 MIPS    279.3 ucycles
RV64Zifence - Instruction fetch fence        1475.150 MIPS    241.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.055 s ±  0.023 s    [User: 1.048 s, System: 0.006 s]
  Range (min … max):    1.041 s …  1.167 s    32 runs

Whoa that is:

  • -21% execution time for dhrystone benchmark
  • FENCE.I instruction is +86% faster
  • RV64I instructions are +35% faster

Also some instructions are over 1GHz!

lui                                          1019.273 MIPS      248 ucycles
auipc                                        1014.344 MIPS      249 ucycles
beq                                          1009.463 MIPS      250 ucycles
bne                                          1021.258 MIPS      247 ucycles
blt                                          1021.258 MIPS      247 ucycles
bge                                          1018.283 MIPS      243 ucycles
bltu                                         1021.258 MIPS      244 ucycles
bgeu                                         1013.364 MIPS      248 ucycles

Round 3 - Single jump with computed gotos

Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.

After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:

// mcycle check
0x7ffff7b7e9d8 <interpret_loop+424>  add    $0x1,%r12++mcycle
0x7ffff7b7e9dc <interpret_loop+428>  cmp    %r10,%r12             │ mcycle < mcycle_tick_end
0x7ffff7b7e9df <interpret_loop+431>  jb     0x7ffff7b7e950-> break interpret hot loop
// fetch
0x7ffff7b7e950 <interpret_loop+288>  mov    %rbp,%rax             │ pc
0x7ffff7b7e953 <interpret_loop+291>  xor    %r15,%rax             │ pc ^ fetch_vaddr_page
0x7ffff7b7e956 <interpret_loop+294>  cmp    $0xffd,%rax           │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
0x7ffff7b7e95c <interpret_loop+300>  ja     0x7ffff7b80d30-> miss fetch cache
0x7ffff7b7e962 <interpret_loop+306>  mov    0x0(%rbp,%r14,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
0x7ffff7b7e967 <interpret_loop+311>  mov    %ebx,%eax             │ insn
0x7ffff7b7e969 <interpret_loop+313>  mov    %ebx,%edx             │ insn
0x7ffff7b7e96b <interpret_loop+315>  lea    0x622ae(%rip),%rdi    │ compute jump table pointer
0x7ffff7b7e972 <interpret_loop+322>  shr    $0x5,%eax             │ insn >> 5
0x7ffff7b7e975 <interpret_loop+325>  and    $0x7f,%edx            │ insn & 0b1111111
0x7ffff7b7e978 <interpret_loop+328>  and    $0x780,%eax           │ (insn >> 5) & 0b11110000000
0x7ffff7b7e97d <interpret_loop+333>  or     %edx,%eax             │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
0x7ffff7b7e97f <interpret_loop+335>  jmp    *(%rdi,%rax,8)        │ -> jump to instruction
// execute
0x7ffff7b7fe55 <interpret_loop+5669> add    $0x4,%rbp             │ pc += 4
0x7ffff7b7fe59 <interpret_loop+5673> jmp    0x7ffff7b7e9d8-> jump to begin

We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:

  • mcycle check: 3 instructions (same as before)
  • fetch: 5 instructions (-6 instructions from baseline)
  • decoding: 8 instructions (-16 instructions from baseline)
  • execution: 2 instructions (same as before)

So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.

Let's see the benchmarks:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.267 MIPS   3203.0 ucycles
RISC-V Privileged Interrupt-management        968.451 MIPS    234.0 ucycles
RV64I - Base integer instruction set          895.744 MIPS    237.5 ucycles
RV64M - Integer multiplication and division   747.252 MIPS    286.2 ucycles
RV64A - Atomic instructions                   567.218 MIPS    278.4 ucycles
RV64F - Single-precision floating-point       331.036 MIPS    445.3 ucycles
RV64D - Double-precision floating-point       342.071 MIPS   1273.0 ucycles
RV64Zicsr - Control and status registers      352.496 MIPS    303.0 ucycles
RV64Zicntr - Base counters and timers         503.979 MIPS    263.3 ucycles
RV64Zifence - Instruction fetch fence        1949.502 MIPS    218.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):     986.4 ms ±  26.5 ms    [User: 979.7 ms, System: 7.1 ms]
  Range (min … max):   968.3 ms … 1083.9 ms    32 runs

Whoa that is:

  • -27% execution time for dhrystone benchmark
  • FENCE.I instruction is +146% faster
  • RV64I instructions are +55% faster

Also many instructions are over 1GHz speed:

lui                                          1314.326 MIPS      225 ucycles
auipc                                        1309.403 MIPS      226 ucycles
beq                                          1239.754 MIPS      228 ucycles
bne                                          1220.992 MIPS      228 ucycles
blt                                          1223.841 MIPS      228 ucycles
bge                                          1296.455 MIPS      228 ucycles
bltu                                         1306.142 MIPS      228 ucycles
bgeu                                         1311.040 MIPS      228 ucycles
addi                                         1155.101 MIPS      229 ucycles
addiw                                        1157.651 MIPS      229 ucycles
xori                                         1156.375 MIPS      229 ucycles
ori                                          1162.785 MIPS      229 ucycles
andi                                         1186.462 MIPS      229 ucycles
slli                                         1011.410 MIPS      231 ucycles
fence                                        1949.502 MIPS      218 ucycles
fence.i                                      1949.502 MIPS      218 ucycles

arm64 trace

I also made a trace for this PR on arm64, this is it:

// mcycle check
| 0xfffff7c1f860 <interpret_loop+352>  add  x24, x24, #0x1++mcycle
0xfffff7c1f864 <interpret_loop+356>  cmp  x24, x26              │ mcycle < mcycle_tick_end
0xfffff7c1f868 <interpret_loop+360>  b.cc 0xfffff7c1f818-> break interpret hot loop
// fetch
0xfffff7c1f818 <interpret_loop+280>  eor  x1, x20, x27          │ pc ^ fetch_vaddr_page
0xfffff7c1f81c <interpret_loop+284>  cmp  x1, #0xffd            │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
0xfffff7c1f820 <interpret_loop+288>  b.hi 0xfffff7c21474-> miss fetch cache
0xfffff7c1f824 <interpret_loop+292>  ldr  w19, [x20, x28]       │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
0xfffff7c1f828 <interpret_loop+296>  and  w1, w19, #0x7f        │ insn & 0b1111111
0xfffff7c1f82c <interpret_loop+300>  lsr  w3, w19, #5           │ (insn >> 5) & 0b11110000000
0xfffff7c1f830 <interpret_loop+304>  and  w3, w3, #0x780        │ insn & 0b1111111
0xfffff7c1f834 <interpret_loop+308>  orr  w3, w3, w1            │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
0xfffff7c1f838 <interpret_loop+312>  ldr  x0, [x23, x3, lsl #3] │ compute jump table pointer
0xfffff7c1f83c <interpret_loop+316>  br   x0                    │ -> jump to instruction
// execute
0xfffff7c20dfc <interpret_loop+5884> add  x20, x20, #0x4        │ pc += 4
0xfffff7c20e00 <interpret_loop+5888> b    0xfffff7c1f860-> jump to begin

In short:

  • mcycle check: 3 instructions
  • fetch: 4 instructions (on x86_64 is 5)
  • decode: 6 instructions (on x86_64 is 8)
  • execute: 2 instructions

Looks like arm64 is more instruction efficient than x86_64.

@edubart edubart self-assigned this Apr 6, 2024
@edubart edubart added optimization Optimization enhancement New feature or request labels Apr 6, 2024
@edubart edubart force-pushed the feature/optim-fetch branch 3 times, most recently from 5271d12 to a6511aa Compare April 7, 2024 22:24
@edubart edubart changed the title feat: optimize instruction fetch Optimize instruction fetch and decoding Apr 7, 2024
@edubart edubart force-pushed the feature/optim-fetch branch 5 times, most recently from e2f145c to 5f3cf35 Compare April 9, 2024 02:38
@edubart edubart requested a review from diegonehab April 11, 2024 14:36
@edubart edubart force-pushed the feature/optim-fetch branch 2 times, most recently from 0dfdd1a to f0ef931 Compare April 15, 2024 16:02
@edubart edubart force-pushed the feature/optim-fetch branch from f0ef931 to 507d7bc Compare April 24, 2024 21:52
@edubart edubart force-pushed the feature/optim-fetch branch from 507d7bc to 588d6ca Compare August 10, 2024 15:03
@edubart edubart force-pushed the feature/optim-fetch branch 4 times, most recently from 9d3098b to d847ace Compare September 4, 2024 18:46
@edubart edubart force-pushed the feature/optim-fetch branch from d847ace to 158681d Compare October 14, 2024 16:12
@edubart edubart mentioned this pull request Oct 29, 2024
@vfusco vfusco added this to the v0.19.0 milestone Dec 12, 2024
@edubart edubart force-pushed the feature/optim-fetch branch 5 times, most recently from 71ddca6 to d0c601a Compare December 17, 2024 23:13
@edubart
Copy link
Contributor Author

edubart commented Dec 17, 2024

I rebased my optimizations PR on top of Perna's PR, tests passed, but to my surprise the make test-uarch-compare test went from 12min to 1hour in CI. I went digging into why, and the culprit was that uarch pristine RAM became way larger with my big jump table, making reset_uarch operation heavier. So I had to optimize uarch reset operation to avoid touching unnecessary pages, and now this test is taking 5mins.

@edubart edubart force-pushed the feature/optim-fetch branch 3 times, most recently from b27e578 to ccd366b Compare December 18, 2024 19:06
@edubart edubart force-pushed the feature/optim-fetch branch from ccd366b to 0688024 Compare December 20, 2024 12:50
@edubart edubart force-pushed the feature/optim-fetch branch from 0688024 to 50bf4e7 Compare December 20, 2024 12:53
@edubart edubart changed the base branch from main to feature/sha256 December 20, 2024 18:47
@edubart edubart changed the base branch from feature/sha256 to main December 20, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimization Optimization
Projects
Status: Waiting Review
Development

Successfully merging this pull request may close these issues.

2 participants