Skip to content

Benchmarks

Eduardo Bart edited this page Dec 19, 2024 · 22 revisions

Benchmarks

Stress Benchmark CM 0.19 CM 0.18 QEMU 9.1 stress-ng command
Sieve Eratosthenes 22.30 ± 0.58 41.48 ± 1.15 5.59 ± 0.15 --cpu 1 --cpu-method sieve --cpu-ops 600
Fibonacci sequence 15.52 ± 0.16 30.27 ± 0.30 4.82 ± 0.33 --cpu 1 --cpu-method fibonacci --cpu-ops 500
Heapsort 12.81 ± 0.12 22.66 ± 0.36 6.67 ± 0.07 --heapsort 1 --heapsort-ops 5
JPEG compression 50.53 ± 1.89 76.74 ± 2.87 25.03 ± 0.94 --jpeg 1 --jpeg-ops 15
Zlib compression 7.82 ± 0.26 13.60 ± 0.45 4.01 ± 0.14 --zlib 1 --zlib-ops 30
Quicksort 18.38 ± 0.24 30.54 ± 0.20 10.29 ± 0.08 --qsort 1 --qsort-ops 8
Hashing functions 16.35 ± 0.38 29.98 ± 0.64 6.45 ± 0.14 --hash 1 --hash-ops 70000
SHA-256 hashing 41.48 ± 1.50 79.25 ± 3.26 19.87 ± 0.92 --crypt 1 --crypt-method SHA-256 --crypt-ops 500000
CPU int8 arithmetic 29.95 ± 0.92 59.24 ± 1.93 14.48 ± 0.45 --cpu 1 --cpu-method int8 --cpu-ops 700
CPU int32 arithmetic 28.46 ± 0.88 54.60 ± 1.64 13.95 ± 0.42 --cpu 1 --cpu-method int32 --cpu-ops 700
CPU int64 arithmetic 27.78 ± 0.36 54.35 ± 0.74 13.41 ± 0.17 --cpu 1 --cpu-method int64 --cpu-ops 700
CPU int128 arithmetic 20.34 ± 0.33 43.47 ± 0.78 6.99 ± 0.11 --cpu 1 --cpu-method int128 --cpu-ops 800
CPU float32 arithmetic 23.54 ± 0.34 35.03 ± 0.50 13.69 ± 0.20 --cpu 1 --cpu-method float --cpu-ops 600
CPU float64 arithmetic 35.81 ± 0.83 53.57 ± 1.24 19.18 ± 0.46 --cpu 1 --cpu-method double --cpu-ops 300
CPU looping 11.14 ± 0.12 24.17 ± 0.28 3.08 ± 0.03 --cpu 1 --cpu-method loop --cpu-ops 800
CPU NO-OP instruction 16.77 ± 0.22 37.46 ± 0.49 2.13 ± 0.04 --nop 1 --nop-ops 100000
CPU atomic operations 4.79 ± 0.22 4.90 ± 0.23 1.64 ± 0.08 --atomic 1 --atomic-ops 500
CPU branching prediction 2.99 ± 0.01 5.40 ± 0.05 10.37 ± 0.13 --branch 1 --branch-ops 300000
CPU cache trashing 14.63 ± 0.07 21.42 ± 0.08 8.42 ± 0.06 --cache 1 --cache-ops 150000
CPU cache line 47.42 ± 2.10 79.50 ± 3.53 12.22 ± 0.55 --cacheline 1 --cacheline-ops 125
CPU read cycle 96.83 ± 10.5 162.35 ± 17.7 14.91 ± 1.64 --clock 1 --clock-ops 2000
CPU pipeline execution 18.74 ± 0.32 22.23 ± 0.38 83.88 ± 2.18 --goto 1 --goto-ops 500000
CPU instruction cache trashing 26.87 ± 0.86 22.22 ± 0.70 37.19 ± 1.22 --icache 1 --icache-ops 200
CPU branching instruction cache 1.16 ± 0.01 1.22 ± 0.03 8.23 ± 0.09 --far-branch 1 --far-branch-ops 200
CPU registers read/write 5.34 ± 0.05 12.57 ± 0.59 1.09 ± 0.01 --regs 1 --regs-ops 15000
CPU function call 24.83 ± 0.50 44.69 ± 0.76 17.29 ± 0.31 --funccall 1 --funccall-ops 400
CPU bitwise arithmetic 38.40 ± 1.37 78.63 ± 2.83 15.22 ± 0.55 --cpu 1 --cpu-method bitops --cpu-ops 400
CPU page table and TLB 39.22 ± 0.96 39.56 ± 0.98 23.72 ± 0.66 --pagemove 1 --pagemove-ops 30
CPU TLB shootdown 18.48 ± 1.36 24.40 ± 1.80 11.11 ± 0.96 --tlb-shootdown 1 --tlb-shootdown-ops 2000
Memory copy 33.76 ± 0.84 64.19 ± 1.59 10.77 ± 0.27 --memcpy 1 --memcpy-ops 80
Memory read/write 12.14 ± 1.02 21.58 ± 0.79 4.19 ± 0.17 --memrate 1 --memrate-bytes 2M --memrate-ops 400
Memory mapping 21.45 ± 0.42 23.17 ± 0.49 14.35 ± 0.29 --mmap 1 --mmap-bytes 96M --mmap-ops 4
Memory and cache thrashing 6.54 ± 0.13 9.55 ± 0.19 9.76 ± 0.15 --randlist 1 --randlist-ops 250
Virtual memory page fault 19.76 ± 1.08 26.53 ± 1.46 19.71 ± 1.08 --fault 1 --fault-ops 10000
Virtual memory read/write 21.46 ± 0.86 37.28 ± 1.49 12.46 ± 0.50 --vm 1 --vm-bytes 96M --vm-ops 20000
Virtual memory addressing 20.63 ± 0.29 37.71 ± 0.50 6.67 ± 0.38 --vm-addr 1 --vm-addr-ops 20
Process forking 9.61 ± 0.23 11.14 ± 0.21 7.99 ± 0.23 --fork 1 --fork-ops 2000
Process context switching 9.42 ± 1.23 6.40 ± 0.83 12.32 ± 1.61 --switch 1 --switch-ops 200000
File read/write 20.31 ± 0.43 34.64 ± 0.73 6.14 ± 0.15 --hdd 1 --hdd-ops 6000
Threading 101.03 ± 4.24 115.29 ± 4.83 19.06 ± 0.82 --pthread 1 --pthread-ops 1500
Linux system calls 14.94 ± 0.95 14.30 ± 0.90 1.54 ± 0.10 --syscall 1 --syscall-ops 4000
Integer vector arithmetic 118.93 ± 8.70 126.12 ± 9.22 19.95 ± 1.47 --vecmath 1 --vecmath-ops 100
Integer wide vector arithmetic 165.38 ± 16.0 207.88 ± 20.1 45.03 ± 4.39 --vecwide 1 --vecwide-ops 600
Multi-precision floating-point 38.58 ± 0.83 54.01 ± 1.16 13.90 ± 0.33 --mpfr 1 --mpfr-ops 200
Floating-point square root 183.36 ± 29.42 348.78 ± 55.96 63.78 ± 10.30 --cpu 1 --cpu-method sqrt --cpu-ops 20
Floating-point FMA 101.68 ± 5.88 135.14 ± 7.61 36.92 ± 2.09 --fma 1 --fma-ops 100000
Floating-point math 110.51 ± 8.23 222.73 ± 16.7 52.28 ± 3.88 --fp 1 --fp-ops 150
Floating-point matmul 38.12 ± 0.92 49.40 ± 1.78 15.53 ± 0.38 --matrix 1 --matrix-method prod --matrix-ops 150
Floating-point trigonometric 97.03 ± 8.68 122.02 ± 10.9 43.83 ± 4.31 --trig 1 --trig-ops 80
Floating-point vector math 61.55 ± 5.05 107.38 ± 8.81 26.70 ± 2.24 --vecfp 1 --vecfp-ops 200

How to read: Lower is better. All numbers are relative speed to the same benchmark run on the host, for example 5.30 + 0.13 means the benchmark on the host was 5.30 times faster than in the guest, with a standard deviation of 0.13.

Benchmark Notes

  • All benchmarks where limited to use 1 CPU core.
  • All benchmarks include time to boot and shutdown Linux, while runs on the host doesn't.
  • Both QEMU and Cartesi Machine used the same guest kernel and guest rootfs.
  • QEMU is faster because it has a JIT (just in time compilation).
  • Floating-point benchmarks are slower because of software emulation.
  • Vector math benchmark is slower because the guest CPU has no support for SIMD instructions while the host has.
  • Square root benchmark is the worst because it's the heaviest instruction in the Cartesi Machine.
  • You can get more information about what each benchmark does in stress-ng manpages.

Conclusions

  • Cartesi Machine can be very fast at some workloads, but slow at others, depends on the workload.
  • Cartesi Machine is in average ~2x slower than QEMU, that is pretty good considering there is no JIT.
  • CPU registers benchmark is the fastest, meaning read and writes of RISC-V general purpose registers is fast.
  • Square root of floating-point numbers is the slowest benchmark, because it's the only instruction in the RISC-V interpreter that performs a loop.
  • SHA-256 and integer vector arithmetic are noticeable slower because there is no support for SIMD instructions.
  • Floating-point benchmarks are noticeable slower because of the deterministic software emulation of its instructions.

Benchmark Environment

  • Host CPU x86_64 Intel Core i9-14900K
  • Host Linux 6.6.65-1-lts
  • Guest CPU rv64imafdc_zicntr_zicsr_zifencei_zihpm
  • Guest Linux 6.5.13-ctsi-1
  • Guest machine with 128MB of RAM
  • QEMU 9.1.2
  • Cartesi Machine Emulator 0.19.0 (unreleased yet)
  • GCC 14.2.1 20240910
  • stress-ng 0.17.06
  • Run on 19/December/2024
Clone this wiki locally