Optimize instruction fetch and decoding #226

edubart · 2024-04-06T14:23:49Z

This optimizes our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.

Here is the GCC x86_64 trace as proof:

//// FENCE GCC x86_64 (2/12 instructions)
// increment mcycle (3 instructions)
=> 0x7ffff7a2e98c <loop+28108>:   add    $0x1,%r15                     // mcycle += 1
=> 0x7ffff7a2e990 <loop+28112>:   cmp    %r13,%r15                     // mcycle < mcycle_tick_end
=> 0x7ffff7a2e993 <loop+28115>:   jae    0x7ffff7a2f230 <loop+30320>   // -> break loop
// fetch (5 instructions)
=> 0x7ffff7a2e999 <loop+28121>:   mov    %r10,%rbx                     // pc
=> 0x7ffff7a2e99c <loop+28124>:   xor    %rbp,%rbx                     // pc ^ fetch_vaddr_page
=> 0x7ffff7a2e99f <loop+28127>:   cmp    $0xffd,%rbx                   // check fetch page
=> 0x7ffff7a2e9a6 <loop+28134>:   ja     0x7ffff7a27d00 <loop+320>     // -> miss fetch
=> 0x7ffff7a2e9ac <loop+28140>:   mov    (%r14,%rbp,1),%ebx            // insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode (2 instructions)
=> 0x7ffff7a2e9b0 <loop+28144>:   movzwl %bx,%ecx                      // insn & 0b1111111111111111
=> 0x7ffff7a2e9b3 <loop+28147>:   jmp    *(%r11,%rcx,8)                // -> jump to instruction
// execute (2 instructions)
=> 0x7ffff7a2ea3b <loop+28283>:   add    $0x4,%rbp                     // pc += 4
=> 0x7ffff7a2ea3f <loop+28287>:   jmp    0x7ffff7a2e98c <loop+28108>   // -> jump to loop begin

And the Clang arm64:

//// FENCE Clang arm64 (2/12 instructions)
// increment mcycle
=> 0xfffff7b8a328 <loop+4568>:    add x25, x25, $0x1
=> 0xfffff7b8a32c <loop+4572>:    cmp x25, x27
=> 0xfffff7b8a330 <loop+4576>:    b.cs    0xfffff7b8e7a8 <loop+22104>
// fetch
=> 0xfffff7b8a334 <loop+4580>:    eor x19, x20, x28
=> 0xfffff7b8a338 <loop+4584>:    cmp x19, $0xffd
=> 0xfffff7b8a33c <loop+4588>:    b.hi    0xfffff7b89264 <loop+276>
=> 0xfffff7b8a340 <loop+4592>:    ldr w19, [x20, x22]
// decode
=> 0xfffff7b8a344 <loop+4596>:    and w10, w19, $0xffff
=> 0xfffff7b8a348 <loop+4600>:    ldr x16, [x24, x10, lsl $3]
=> 0xfffff7b8a34c <loop+4604>:    br  x16
// execute
=> 0xfffff7b8dde8 <loop+19608>:   add x20, x20, $0x4
=> 0xfffff7b8ddec <loop+19612>:   b   0xfffff7b8a328 <loop+4568>

In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.

Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with stress-ng :

Benchmarks

Times faster	Benchmark
2.56 ± 0.03	stress-ng --no-rand-seed --syscall 1 --syscall-ops 4000
2.15 ± 0.02	stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
1.95 ± 0.00	stress-ng --no-rand-seed --cpu 1 --cpu-method fibonacci --cpu-ops 400
1.94 ± 0.01	stress-ng --no-rand-seed --cpu 1 --cpu-method int64 --cpu-ops 400
1.90 ± 0.01	stress-ng --no-rand-seed --memcpy 1 --memcpy-ops 50
1.88 ± 0.02	stress-ng --no-rand-seed --crypt 1 --crypt-method SHA-256 --crypt-ops 400000
1.87 ± 0.01	stress-ng --no-rand-seed --qsort 1 --qsort-ops 5
1.83 ± 0.01	stress-ng --no-rand-seed --memrate 1 --memrate-bytes 2M --memrate-ops 200
1.82 ± 0.03	stress-ng --no-rand-seed --hash 1 --hash-ops 40000
1.75 ± 0.00	stress-ng --no-rand-seed --heapsort 1 --heapsort-ops 3
1.72 ± 0.01	stress-ng --no-rand-seed --zlib 1 --zlib-ops 20
1.66 ± 0.00	stress-ng --no-rand-seed --matrix 1 --matrix-method mult --matrix-ops 20000
1.49 ± 0.02	stress-ng --no-rand-seed --hdd 1 --hdd-ops 2000
1.41 ± 0.00	stress-ng --no-rand-seed --fp 1 --fp-method floatadd --fp-ops 1000
1.33 ± 0.01	stress-ng --no-rand-seed --fma 1 --fma-ops 40000
1.24 ± 0.01	stress-ng --no-rand-seed --trig 1 --trig-ops 50
1.16 ± 0.01	stress-ng --no-rand-seed --fork 1 --fork-ops 1000
1.14 ± 0.01	stress-ng --no-rand-seed --malloc 1 --malloc-ops 40000

You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.

The table of benchmarks were created by running hyperfine and stress-ng, for example:

$ hyperfine -w 1 'cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400' '/usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400'
Benchmark 1: cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      2.225 s ±  0.021 s    [User: 2.213 s, System: 0.010 s]
  Range (min … max):    2.197 s …  2.257 s    10 runs
 
Benchmark 2: /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      4.615 s ±  0.041 s    [User: 4.602 s, System: 0.009 s]
  Range (min … max):    4.561 s …  4.682 s    10 runs
 
Summary
  cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400 ran
    2.07 ± 0.03 times faster than /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400

PREVIOUS PR ITERATION COMMENTS

This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.

Baseline

First I generated a hot trace of subsequent FENCE.I instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:

// mcycle check
| 0x7ffff7b88330 <interpret_loop+432>    add    $0x1,%r14                │ ++mcycle
│ 0x7ffff7b88334 <interpret_loop+436>    cmp    %r11,%r14                │ mcycle < mcycle_tick_end
│ 0x7ffff7b88337 <interpret_loop+439>    jae    0x7ffff7b88420           │ -> break interpret hot loop
// fetch
│ 0x7ffff7b8833d <interpret_loop+445>    mov    %r15,%rbx                │ pc
│ 0x7ffff7b88340 <interpret_loop+448>    and    $0xfffffffffffff000,%rbx │ vaddr_page = pc & ~PAGE_OFFSET_MASK
│ 0x7ffff7b88347 <interpret_loop+455>    cmp    %r12,%rbx                │ vaddr_page == fetch_vaddr_page
│ 0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728           │ -> miss fetch cache
│ 0x7ffff7b88350 <interpret_loop+464>    lea    0x0(%r13,%r15,1),%rax    │ hptr = pc + fetch_vh_offset
│ 0x7ffff7b88355 <interpret_loop+469>    mov    %r15,%rdx                │ pc
│ 0x7ffff7b88358 <interpret_loop+472>    not    %rdx                     │ ~pc
│ 0x7ffff7b8835b <interpret_loop+475>    test   $0xffe,%edx              │ ((~pc & PAGE_OFFSET_MASK) >> 1) == 0
│ 0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760           │ -> cross page boundary
│ 0x7ffff7b88367 <interpret_loop+487>    mov    (%rax),%r9d              │ insn = *(uint32_t*)(hptr)
│ 0x7ffff7b8836a <interpret_loop+490>    mov    %rbx,%r12                │ fetch_vaddr_page = vaddr_page
// decoding: check if is a compressed instruction
│ 0x7ffff7b8836d <interpret_loop+493>    mov    %r9d,%eax                │ insn
│ 0x7ffff7b88370 <interpret_loop+496>    not    %eax                     │ ~insn
│ 0x7ffff7b88372 <interpret_loop+498>    test   $0x3,%al                 │ (~insn & 3) > 0
│ 0x7ffff7b88374 <interpret_loop+500>    jne    0x7ffff7b882b0           │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│ 0x7ffff7b8837a <interpret_loop+506>    mov    %r9d,%eax                │
│ 0x7ffff7b8837d <interpret_loop+509>    and    $0x707f,%eax             │
│ 0x7ffff7b88382 <interpret_loop+514>    cmp    $0x3023,%eax             │
│ 0x7ffff7b88387 <interpret_loop+519>    je     0x7ffff7b8a868           │
│ 0x7ffff7b8838d <interpret_loop+525>    ja     0x7ffff7b88500           │
│ 0x7ffff7b88393 <interpret_loop+531>    cmp    $0x101b,%eax             |
│ 0x7ffff7b88398 <interpret_loop+536>    je     0x7ffff7b8a820           │
│ 0x7ffff7b8839e <interpret_loop+542>    ja     0x7ffff7b887d0           │
│ 0x7ffff7b883a4 <interpret_loop+548>    cmp    $0x3b,%eax               │
│ 0x7ffff7b883a7 <interpret_loop+551>    je     0x7ffff7b8a638           │
│ 0x7ffff7b883ad <interpret_loop+557>    ja     0x7ffff7b88b68           │
│ 0x7ffff7b88b68 <interpret_loop+2536>   cmp    $0x1003,%eax             │
│ 0x7ffff7b88b6d <interpret_loop+2541>   je     0x7ffff7b8a494           │
│ 0x7ffff7b88b73 <interpret_loop+2547>   ja     0x7ffff7b89450           │
| 0x7ffff7b89450 <interpret_loop+4816>   cmp    $0x1013,%eax             │
│ 0x7ffff7b89455 <interpret_loop+4821>   je     0x7ffff7b8a6d0           │
│ 0x7ffff7b8945b <interpret_loop+4827>   cmp    $0x1017,%eax             │
│ 0x7ffff7b89460 <interpret_loop+4832>   je     0x7ffff7b8a71c           │
│ 0x7ffff7b89466 <interpret_loop+4838>   cmp    $0x100f,%eax             │
│ 0x7ffff7b8946b <interpret_loop+4843>   jne    0x7ffff7b8ab45           │
// execute
│ 0x7ffff7b89471 <interpret_loop+4849>   add    $0x4,%r15                │ pc += 4
│ 0x7ffff7b89475 <interpret_loop+4853>   jmp    0x7ffff7b88330           │ -> jump to begin

This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one FENCE.I in this trace, where:

mcycle check: 3 instructions
fetch: 11 instructions
decoding: 24 instructions
execution: 2 instructions

I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).

If we look closely in the fetch, there are these two branches:

│ 0x7ffff7b8834a <interpret_loop+458>    jne    0x7ffff7b88728           │ -> miss fetch cache
│ 0x7ffff7b88361 <interpret_loop+481>    je     0x7ffff7b88760           │ -> cross page boundary

My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.

This is the benchmark for baseline.

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.205 MIPS   3250.0 ucycles
RISC-V Privileged Interrupt-management        495.199 MIPS    257.0 ucycles
RV64I - Base integer instruction set          574.503 MIPS    263.8 ucycles
RV64M - Integer multiplication and division   575.371 MIPS    312.2 ucycles
RV64A - Atomic instructions                   431.893 MIPS    304.2 ucycles
RV64F - Single-precision floating-point       218.556 MIPS    489.3 ucycles
RV64D - Double-precision floating-point       214.964 MIPS   1773.7 ucycles
RV64Zicsr - Control and status registers      289.061 MIPS    328.3 ucycles
RV64Zicntr - Base counters and timers         343.063 MIPS    284.3 ucycles
RV64Zifence - Instruction fetch fence         792.319 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.333 s ±  0.007 s    [User: 1.326 s, System: 0.006 s]
  Range (min … max):    1.321 s …  1.356 s    32 runs

Round 1 - Optimize fetch

After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:

// mcycle check
|   0x7ffff7b873e0 <interpret_loop+432>    add    $0x1,%r15               │ ++mcycle
│   0x7ffff7b873e4 <interpret_loop+436>    cmp    %r10,%r15               │ mcycle < mcycle_tick_end
│   0x7ffff7b873e7 <interpret_loop+439>    jae    0x7ffff7b874b0          │ -> break interpret hot loop
// fetch
│   0x7ffff7b873ed <interpret_loop+445>    mov    %rbp,%rax               │ pc
│   0x7ffff7b873f0 <interpret_loop+448>    xor    %r14,%rax               │ pc ^ fetch_vaddr_page
│   0x7ffff7b873f3 <interpret_loop+451>    cmp    $0xffd,%rax             │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│   0x7ffff7b873f9 <interpret_loop+457>    ja     0x7ffff7b877d8          │ -> miss fetch cache
│   0x7ffff7b873ff <interpret_loop+463>    mov    0x0(%rbp,%r13,1),%ebx   │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
│   0x7ffff7b87404 <interpret_loop+468>    mov    %ebx,%eax               │ insn
│   0x7ffff7b87406 <interpret_loop+470>    not    %eax                    │ ~insn
│   0x7ffff7b87408 <interpret_loop+472>    test   $0x3,%al                │ (~insn & 3) > 0
│   0x7ffff7b8740a <interpret_loop+474>    jne    0x7ffff7b87360          │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│   0x7ffff7b87410 <interpret_loop+480>    mov    %ebx,%eax               │
│   0x7ffff7b87412 <interpret_loop+482>    and    $0x707f,%eax            │
│   0x7ffff7b87417 <interpret_loop+487>    cmp    $0x3023,%eax            │
│   0x7ffff7b8741c <interpret_loop+492>    je     0x7ffff7b891e8          │
│   0x7ffff7b87422 <interpret_loop+498>    ja     0x7ffff7b87590          │
│   0x7ffff7b87428 <interpret_loop+504>    cmp    $0x101b,%eax            │
│   0x7ffff7b8742d <interpret_loop+509>    je     0x7ffff7b899f4          │
│   0x7ffff7b87433 <interpret_loop+515>    ja     0x7ffff7b87838          │
│   0x7ffff7b87439 <interpret_loop+521>    cmp    $0x3b,%eax              │
│   0x7ffff7b8743c <interpret_loop+524>    je     0x7ffff7b892b9          │
│   0x7ffff7b87442 <interpret_loop+530>    ja     0x7ffff7b87bc0          │
│   0x7ffff7b87bc0 <interpret_loop+2448>   cmp    $0x1003,%eax            │
│   0x7ffff7b87bc5 <interpret_loop+2453>   je     0x7ffff7b89030          │
│   0x7ffff7b87bcb <interpret_loop+2459>   ja     0x7ffff7b88680          │
│   0x7ffff7b88680 <interpret_loop+5200>   cmp    $0x1013,%eax            │
│   0x7ffff7b88685 <interpret_loop+5205>   je     0x7ffff7b897b0          │
│   0x7ffff7b8868b <interpret_loop+5211>   cmp    $0x1017,%eax            │
│   0x7ffff7b88690 <interpret_loop+5216>   je     0x7ffff7b89317          │
│   0x7ffff7b88696 <interpret_loop+5222>   cmp    $0x100f,%eax            │
│   0x7ffff7b8869b <interpret_loop+5227>   jne    0x7ffff7b89b7a          │
// execute
│   0x7ffff7b886a1 <interpret_loop+5233>   add    $0x4,%rbp               | pc += 4
│   0x7ffff7b886a5 <interpret_loop+5237>   jmp    0x7ffff7b873e0          │ -> jump to begin

We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one FENCE.I in this trace, where:

mcycle check: 3 instructions (same as before)
fetch: 5 instructions (-6 instructions)
decoding: 24 instructions (same as before)
execution: 2 instructions (same as before)

So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:

$ lua bench-insns.lua
RISC-V Privileged Memory-management             4.227 MIPS   3244.0 ucycles
RISC-V Privileged Interrupt-management        509.389 MIPS    257.0 ucycles
RV64I - Base integer instruction set          611.176 MIPS    264.3 ucycles
RV64M - Integer multiplication and division   606.949 MIPS    312.9 ucycles
RV64A - Atomic instructions                   449.197 MIPS    304.6 ucycles
RV64F - Single-precision floating-point       227.302 MIPS    489.7 ucycles
RV64D - Double-precision floating-point       225.613 MIPS   1774.1 ucycles
RV64Zicsr - Control and status registers      300.756 MIPS    329.0 ucycles
RV64Zicntr - Base counters and timers         354.946 MIPS    283.3 ucycles
RV64Zifence - Instruction fetch fence         847.884 MIPS    246.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.269 s ±  0.011 s    [User: 1.261 s, System: 0.007 s]
  Range (min … max):    1.257 s …  1.317 s    32 runs

We can see improvements in all benchmarks, where:

Dhrystone took about 1.269 / 1.333 = 95% time to execute, minus 5%.
611.176 / 574.50 = ~ 6% improvement in instruction execution speed for RV64I instruction set

Round 2 - Optimize decoding for uncompressed instruction

The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.

EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:

// mcycle check
|   0x7ffff7b82990 <interpret_loop+560>  add    $0x1,%r14          │ ++mcycle
│   0x7ffff7b82994 <interpret_loop+564>  cmp    %r11,%r14          │ mcycle < mcycle_tick_end
│   0x7ffff7b82997 <interpret_loop+567>  jb     0x7ffff7b828a0     │ -> break interpret hot loop
// fetch
|   0x7ffff7b828a0 <interpret_loop+320>  mov    %r15,%rax          │ pc
│   0x7ffff7b828a3 <interpret_loop+323>  xor    %r13,%rax          │ pc ^ fetch_vaddr_page
│   0x7ffff7b828a6 <interpret_loop+326>  cmp    $0xffd,%rax        │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│   0x7ffff7b828ac <interpret_loop+332>  ja     0x7ffff7b844f0     │ -> miss fetch cache
│   0x7ffff7b828b2 <interpret_loop+338>  mov    (%r15,%r12,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decoding: check if is a compressed instruction
│   0x7ffff7b828b6 <interpret_loop+342>  mov    %ebx,%ecx          │ insn
│   0x7ffff7b828b8 <interpret_loop+344>  and    $0x3,%ecx          │ ~insn
│   0x7ffff7b828bb <interpret_loop+347>  cmp    $0x3,%ecx          │ (~insn & 3) > 0
│   0x7ffff7b828be <interpret_loop+350>  je     0x7ffff7b83100     │ -> decode compressed instruction
// decoding: decode fence.i uncompressed instruction
│   0x7ffff7b83100 <interpret_loop+2464> mov    %ebx,%eax          │ insn
│   0x7ffff7b83102 <interpret_loop+2466> mov    %ebx,%edx          │ insn
│   0x7ffff7b83104 <interpret_loop+2468> shr    $0x5,%eax          │ insn >> 5
│   0x7ffff7b83107 <interpret_loop+2471> and    $0x7f,%edx         │ insn & 0b1111111
│   0x7ffff7b8310a <interpret_loop+2474> and    $0x380,%eax        │ (insn >> 5) & 0b1110000000
│   0x7ffff7b8310f <interpret_loop+2479> or     %edx,%eax          │ ((insn >> 5) & 0b1110000000) | (insn & 0b1111111)
│   0x7ffff7b83111 <interpret_loop+2481> lea    -0x3(%rax),%edx    │ load index in jump table
│   0x7ffff7b83114 <interpret_loop+2484> cmp    $0x3f0,%edx        │ check if index is valid
│   0x7ffff7b8311a <interpret_loop+2490> ja     0x7ffff7b83130     │ -> illegal instruction
│   0x7ffff7b8311c <interpret_loop+2492> lea    0x3b711(%rip),%rdi │ load jump base offset
│   0x7ffff7b83123 <interpret_loop+2499> movslq (%rdi,%rdx,4),%rdx │ load jump offset for given index
│   0x7ffff7b83127 <interpret_loop+2503> add    %rdi,%rdx          │ compute instruction jump address
│   0x7ffff7b8312a <interpret_loop+2506> jmp    *%rdx              │ -> jump to instruction
// execute
│   0x7ffff7b83590 <interpret_loop+3632> add    $0x4,%r15          │ pc += 4
│  >0x7ffff7b83594 <interpret_loop+3636> jmp    0x7ffff7b82990     │ -> jump to begin

We can see that it takes exactly 27 x86_64 instructions to execute one FENCE.I in this trace! Where:

mcycle check: 3 instructions (same as before)
fetch: 5 instructions (-6 instructions from baseline)
decoding: 17 instructions (-7 instructions from baseline)
execution: 2 instructions (same as before)

However this adds one memory indirection to lookup the jump table, but this is fine, this memory indirection is mostly likely cached in L1 CPU cache.

These are the new benchmark numbers:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.215 MIPS   3234.0 ucycles
RISC-V Privileged Interrupt-management        835.057 MIPS    257.0 ucycles
RV64I - Base integer instruction set          780.803 MIPS    259.1 ucycles
RV64M - Integer multiplication and division   673.788 MIPS    306.8 ucycles
RV64A - Atomic instructions                   523.084 MIPS    298.6 ucycles
RV64F - Single-precision floating-point       347.813 MIPS    482.4 ucycles
RV64D - Double-precision floating-point       374.271 MIPS   1310.9 ucycles
RV64Zicsr - Control and status registers      321.237 MIPS    321.7 ucycles
RV64Zicntr - Base counters and timers         488.825 MIPS    279.3 ucycles
RV64Zifence - Instruction fetch fence        1475.150 MIPS    241.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):      1.055 s ±  0.023 s    [User: 1.048 s, System: 0.006 s]
  Range (min … max):    1.041 s …  1.167 s    32 runs

Whoa that is:

-21% execution time for dhrystone benchmark
FENCE.I instruction is +86% faster
RV64I instructions are +35% faster

Also some instructions are over 1GHz!

lui                                          1019.273 MIPS      248 ucycles
auipc                                        1014.344 MIPS      249 ucycles
beq                                          1009.463 MIPS      250 ucycles
bne                                          1021.258 MIPS      247 ucycles
blt                                          1021.258 MIPS      247 ucycles
bge                                          1018.283 MIPS      243 ucycles
bltu                                         1021.258 MIPS      244 ucycles
bgeu                                         1013.364 MIPS      248 ucycles

Round 3 - Single jump with computed gotos

Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.

After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:

// mcycle check
│ 0x7ffff7b7e9d8 <interpret_loop+424>  add    $0x1,%r12             │ ++mcycle
│ 0x7ffff7b7e9dc <interpret_loop+428>  cmp    %r10,%r12             │ mcycle < mcycle_tick_end
│ 0x7ffff7b7e9df <interpret_loop+431>  jb     0x7ffff7b7e950        │ -> break interpret hot loop
// fetch
│ 0x7ffff7b7e950 <interpret_loop+288>  mov    %rbp,%rax             │ pc
│ 0x7ffff7b7e953 <interpret_loop+291>  xor    %r15,%rax             │ pc ^ fetch_vaddr_page
│ 0x7ffff7b7e956 <interpret_loop+294>  cmp    $0xffd,%rax           │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│ 0x7ffff7b7e95c <interpret_loop+300>  ja     0x7ffff7b80d30        │ -> miss fetch cache
│ 0x7ffff7b7e962 <interpret_loop+306>  mov    0x0(%rbp,%r14,1),%ebx │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
│ 0x7ffff7b7e967 <interpret_loop+311>  mov    %ebx,%eax             │ insn
│ 0x7ffff7b7e969 <interpret_loop+313>  mov    %ebx,%edx             │ insn
│ 0x7ffff7b7e96b <interpret_loop+315>  lea    0x622ae(%rip),%rdi    │ compute jump table pointer
│ 0x7ffff7b7e972 <interpret_loop+322>  shr    $0x5,%eax             │ insn >> 5
│ 0x7ffff7b7e975 <interpret_loop+325>  and    $0x7f,%edx            │ insn & 0b1111111
│ 0x7ffff7b7e978 <interpret_loop+328>  and    $0x780,%eax           │ (insn >> 5) & 0b11110000000
│ 0x7ffff7b7e97d <interpret_loop+333>  or     %edx,%eax             │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
│ 0x7ffff7b7e97f <interpret_loop+335>  jmp    *(%rdi,%rax,8)        │ -> jump to instruction
// execute
│ 0x7ffff7b7fe55 <interpret_loop+5669> add    $0x4,%rbp             │ pc += 4
│ 0x7ffff7b7fe59 <interpret_loop+5673> jmp    0x7ffff7b7e9d8        │ -> jump to begin

We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:

mcycle check: 3 instructions (same as before)
fetch: 5 instructions (-6 instructions from baseline)
decoding: 8 instructions (-16 instructions from baseline)
execution: 2 instructions (same as before)

So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.

Let's see the benchmarks:

$ lua bench-insns.lua
-- Average instruction set speed --
RISC-V Privileged Memory-management             4.267 MIPS   3203.0 ucycles
RISC-V Privileged Interrupt-management        968.451 MIPS    234.0 ucycles
RV64I - Base integer instruction set          895.744 MIPS    237.5 ucycles
RV64M - Integer multiplication and division   747.252 MIPS    286.2 ucycles
RV64A - Atomic instructions                   567.218 MIPS    278.4 ucycles
RV64F - Single-precision floating-point       331.036 MIPS    445.3 ucycles
RV64D - Double-precision floating-point       342.071 MIPS   1273.0 ucycles
RV64Zicsr - Control and status registers      352.496 MIPS    303.0 ucycles
RV64Zicntr - Base counters and timers         503.979 MIPS    263.3 ucycles
RV64Zifence - Instruction fetch fence        1949.502 MIPS    218.0 ucycles

$ hyperfine -w 10 -m 32 "cartesi-machine dhrystone 2000000"
Benchmark 1: cartesi-machine dhrystone 2000000
  Time (mean ± σ):     986.4 ms ±  26.5 ms    [User: 979.7 ms, System: 7.1 ms]
  Range (min … max):   968.3 ms … 1083.9 ms    32 runs

Whoa that is:

-27% execution time for dhrystone benchmark
FENCE.I instruction is +146% faster
RV64I instructions are +55% faster

Also many instructions are over 1GHz speed:

lui                                          1314.326 MIPS      225 ucycles
auipc                                        1309.403 MIPS      226 ucycles
beq                                          1239.754 MIPS      228 ucycles
bne                                          1220.992 MIPS      228 ucycles
blt                                          1223.841 MIPS      228 ucycles
bge                                          1296.455 MIPS      228 ucycles
bltu                                         1306.142 MIPS      228 ucycles
bgeu                                         1311.040 MIPS      228 ucycles
addi                                         1155.101 MIPS      229 ucycles
addiw                                        1157.651 MIPS      229 ucycles
xori                                         1156.375 MIPS      229 ucycles
ori                                          1162.785 MIPS      229 ucycles
andi                                         1186.462 MIPS      229 ucycles
slli                                         1011.410 MIPS      231 ucycles
fence                                        1949.502 MIPS      218 ucycles
fence.i                                      1949.502 MIPS      218 ucycles

arm64 trace

I also made a trace for this PR on arm64, this is it:

// mcycle check
| 0xfffff7c1f860 <interpret_loop+352>  add  x24, x24, #0x1        │ ++mcycle
│ 0xfffff7c1f864 <interpret_loop+356>  cmp  x24, x26              │ mcycle < mcycle_tick_end
│ 0xfffff7c1f868 <interpret_loop+360>  b.cc 0xfffff7c1f818        │ -> break interpret hot loop
// fetch
│ 0xfffff7c1f818 <interpret_loop+280>  eor  x1, x20, x27          │ pc ^ fetch_vaddr_page
│ 0xfffff7c1f81c <interpret_loop+284>  cmp  x1, #0xffd            │ (pc ^ fetch_vaddr_page) < (PMA_PAGE_SIZE - 2)
│ 0xfffff7c1f820 <interpret_loop+288>  b.hi 0xfffff7c21474        │ -> miss fetch cache
│ 0xfffff7c1f824 <interpret_loop+292>  ldr  w19, [x20, x28]       │ insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode
│ 0xfffff7c1f828 <interpret_loop+296>  and  w1, w19, #0x7f        │ insn & 0b1111111
│ 0xfffff7c1f82c <interpret_loop+300>  lsr  w3, w19, #5           │ (insn >> 5) & 0b11110000000
│ 0xfffff7c1f830 <interpret_loop+304>  and  w3, w3, #0x780        │ insn & 0b1111111
│ 0xfffff7c1f834 <interpret_loop+308>  orr  w3, w3, w1            │ ((insn >> 5) & 0b11110000000) | (insn & 0b1111111)
│ 0xfffff7c1f838 <interpret_loop+312>  ldr  x0, [x23, x3, lsl #3] │ compute jump table pointer
│ 0xfffff7c1f83c <interpret_loop+316>  br   x0                    │ -> jump to instruction
// execute
│ 0xfffff7c20dfc <interpret_loop+5884> add  x20, x20, #0x4        │ pc += 4
│ 0xfffff7c20e00 <interpret_loop+5888> b    0xfffff7c1f860        │ -> jump to begin

In short:

mcycle check: 3 instructions
fetch: 4 instructions (on x86_64 is 5)
decode: 6 instructions (on x86_64 is 8)
execute: 2 instructions

Looks like arm64 is more instruction efficient than x86_64.

edubart · 2024-12-17T23:33:18Z

I rebased my optimizations PR on top of Perna's PR, tests passed, but to my surprise the make test-uarch-compare test went from 12min to 1hour in CI. I went digging into why, and the culprit was that uarch pristine RAM became way larger with my big jump table, making reset_uarch operation heavier. So I had to optimize uarch reset operation to avoid touching unnecessary pages, and now this test is taking 5mins.

edubart self-assigned this Apr 6, 2024

edubart added optimization Optimization enhancement New feature or request labels Apr 6, 2024

edubart force-pushed the feature/optim-fetch branch 3 times, most recently from 5271d12 to a6511aa Compare April 7, 2024 22:24

edubart changed the title ~~feat: optimize instruction fetch~~ Optimize instruction fetch and decoding Apr 7, 2024

edubart force-pushed the feature/optim-fetch branch 5 times, most recently from e2f145c to 5f3cf35 Compare April 9, 2024 02:38

edubart requested a review from diegonehab April 11, 2024 14:36

edubart force-pushed the feature/optim-fetch branch 2 times, most recently from 0dfdd1a to f0ef931 Compare April 15, 2024 16:02

edubart force-pushed the feature/optim-fetch branch from f0ef931 to 507d7bc Compare April 24, 2024 21:52

edubart force-pushed the feature/optim-fetch branch from 507d7bc to 588d6ca Compare August 10, 2024 15:03

edubart force-pushed the feature/optim-fetch branch 4 times, most recently from 9d3098b to d847ace Compare September 4, 2024 18:46

edubart force-pushed the feature/optim-fetch branch from d847ace to 158681d Compare October 14, 2024 16:12

edubart mentioned this pull request Oct 29, 2024

Big jump tables #288

Merged

vfusco added this to the v0.19.0 milestone Dec 12, 2024

edubart force-pushed the feature/optim-fetch branch 5 times, most recently from 71ddca6 to d0c601a Compare December 17, 2024 23:13

edubart force-pushed the feature/optim-fetch branch 3 times, most recently from b27e578 to ccd366b Compare December 18, 2024 19:06

feat: optimize uarch reset to speed up CI testing

30704dd

edubart force-pushed the feature/optim-fetch branch from ccd366b to 0688024 Compare December 20, 2024 12:50

feat: optimize instruction fetch and decoding with big jump tables

50bf4e7

edubart force-pushed the feature/optim-fetch branch from 0688024 to 50bf4e7 Compare December 20, 2024 12:53

edubart changed the base branch from main to feature/sha256 December 20, 2024 18:47

edubart changed the base branch from feature/sha256 to main December 20, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize instruction fetch and decoding #226

Optimize instruction fetch and decoding #226

edubart commented Apr 6, 2024 •

edited

Loading

edubart commented Dec 17, 2024

Optimize instruction fetch and decoding #226

Are you sure you want to change the base?

Optimize instruction fetch and decoding #226

Conversation

edubart commented Apr 6, 2024 • edited Loading

Benchmarks

PREVIOUS PR ITERATION COMMENTS

Baseline

Round 1 - Optimize fetch

Round 2 - Optimize decoding for uncompressed instruction

Round 3 - Single jump with computed gotos

arm64 trace

edubart commented Dec 17, 2024

edubart commented Apr 6, 2024 •

edited

Loading