Benchmarks

Stress Benchmark	CM 0.19	CM 0.18	QEMU 9.1	`stress-ng` command
Sieve Eratosthenes	22.30 ± 0.58	41.48 ± 1.15	5.59 ± 0.15	`--cpu 1 --cpu-method sieve --cpu-ops 600`
Fibonacci sequence	15.52 ± 0.16	30.27 ± 0.30	4.82 ± 0.33	`--cpu 1 --cpu-method fibonacci --cpu-ops 500`
Heapsort	12.81 ± 0.12	22.66 ± 0.36	6.67 ± 0.07	`--heapsort 1 --heapsort-ops 5`
JPEG compression	50.53 ± 1.89	76.74 ± 2.87	25.03 ± 0.94	`--jpeg 1 --jpeg-ops 15`
Zlib compression	7.82 ± 0.26	13.60 ± 0.45	4.01 ± 0.14	`--zlib 1 --zlib-ops 30`
Quicksort	18.38 ± 0.24	30.54 ± 0.20	10.29 ± 0.08	`--qsort 1 --qsort-ops 8`
Hashing functions	16.35 ± 0.38	29.98 ± 0.64	6.45 ± 0.14	`--hash 1 --hash-ops 70000`
SHA-256 hashing	41.48 ± 1.50	79.25 ± 3.26	19.87 ± 0.92	`--crypt 1 --crypt-method SHA-256 --crypt-ops 500000`
CPU `int8` arithmetic	29.95 ± 0.92	59.24 ± 1.93	14.48 ± 0.45	`--cpu 1 --cpu-method int8 --cpu-ops 700`
CPU `int32` arithmetic	28.46 ± 0.88	54.60 ± 1.64	13.95 ± 0.42	`--cpu 1 --cpu-method int32 --cpu-ops 700`
CPU `int64` arithmetic	27.78 ± 0.36	54.35 ± 0.74	13.41 ± 0.17	`--cpu 1 --cpu-method int64 --cpu-ops 700`
CPU `int128` arithmetic	20.34 ± 0.33	43.47 ± 0.78	6.99 ± 0.11	`--cpu 1 --cpu-method int128 --cpu-ops 800`
CPU `float32` arithmetic	23.54 ± 0.34	35.03 ± 0.50	13.69 ± 0.20	`--cpu 1 --cpu-method float --cpu-ops 600`
CPU `float64` arithmetic	35.81 ± 0.83	53.57 ± 1.24	19.18 ± 0.46	`--cpu 1 --cpu-method double --cpu-ops 300`
CPU looping	11.14 ± 0.12	24.17 ± 0.28	3.08 ± 0.03	`--cpu 1 --cpu-method loop --cpu-ops 800`
CPU NO-OP instruction	16.77 ± 0.22	37.46 ± 0.49	2.13 ± 0.04	`--nop 1 --nop-ops 100000`
CPU atomic operations	4.79 ± 0.22	4.90 ± 0.23	1.64 ± 0.08	`--atomic 1 --atomic-ops 500`
CPU branching prediction	2.99 ± 0.01	5.40 ± 0.05	10.37 ± 0.13	`--branch 1 --branch-ops 300000`
CPU cache trashing	14.63 ± 0.07	21.42 ± 0.08	8.42 ± 0.06	`--cache 1 --cache-ops 150000`
CPU cache line	47.42 ± 2.10	79.50 ± 3.53	12.22 ± 0.55	`--cacheline 1 --cacheline-ops 125`
CPU read cycle	96.83 ± 10.5	162.35 ± 17.7	14.91 ± 1.64	`--clock 1 --clock-ops 2000`
CPU pipeline execution	18.74 ± 0.32	22.23 ± 0.38	83.88 ± 2.18	`--goto 1 --goto-ops 500000`
CPU instruction cache trashing	26.87 ± 0.86	22.22 ± 0.70	37.19 ± 1.22	`--icache 1 --icache-ops 200`
CPU branching instruction cache	1.16 ± 0.01	1.22 ± 0.03	8.23 ± 0.09	`--far-branch 1 --far-branch-ops 200`
CPU registers read/write	5.34 ± 0.05	12.57 ± 0.59	1.09 ± 0.01	`--regs 1 --regs-ops 15000`
CPU function call	24.83 ± 0.50	44.69 ± 0.76	17.29 ± 0.31	`--funccall 1 --funccall-ops 400`
CPU bitwise arithmetic	38.40 ± 1.37	78.63 ± 2.83	15.22 ± 0.55	`--cpu 1 --cpu-method bitops --cpu-ops 400`
CPU page table and TLB	39.22 ± 0.96	39.56 ± 0.98	23.72 ± 0.66	`--pagemove 1 --pagemove-ops 30`
CPU TLB shootdown	18.48 ± 1.36	24.40 ± 1.80	11.11 ± 0.96	`--tlb-shootdown 1 --tlb-shootdown-ops 2000`
Memory copy	33.76 ± 0.84	64.19 ± 1.59	10.77 ± 0.27	`--memcpy 1 --memcpy-ops 80`
Memory read/write	12.14 ± 1.02	21.58 ± 0.79	4.19 ± 0.17	`--memrate 1 --memrate-bytes 2M --memrate-ops 400`
Memory mapping	21.45 ± 0.42	23.17 ± 0.49	14.35 ± 0.29	`--mmap 1 --mmap-bytes 96M --mmap-ops 4`
Memory and cache thrashing	6.54 ± 0.13	9.55 ± 0.19	9.76 ± 0.15	`--randlist 1 --randlist-ops 250`
Virtual memory page fault	19.76 ± 1.08	26.53 ± 1.46	19.71 ± 1.08	`--fault 1 --fault-ops 10000`
Virtual memory read/write	21.46 ± 0.86	37.28 ± 1.49	12.46 ± 0.50	`--vm 1 --vm-bytes 96M --vm-ops 20000`
Virtual memory addressing	20.63 ± 0.29	37.71 ± 0.50	6.67 ± 0.38	`--vm-addr 1 --vm-addr-ops 20`
Process forking	9.61 ± 0.23	11.14 ± 0.21	7.99 ± 0.23	`--fork 1 --fork-ops 2000`
Process context switching	9.42 ± 1.23	6.40 ± 0.83	12.32 ± 1.61	`--switch 1 --switch-ops 200000`
File read/write	20.31 ± 0.43	34.64 ± 0.73	6.14 ± 0.15	`--hdd 1 --hdd-ops 6000`
Threading	101.03 ± 4.24	115.29 ± 4.83	19.06 ± 0.82	`--pthread 1 --pthread-ops 1500`
Linux system calls	14.94 ± 0.95	14.30 ± 0.90	1.54 ± 0.10	`--syscall 1 --syscall-ops 4000`
Integer vector arithmetic	118.93 ± 8.70	126.12 ± 9.22	19.95 ± 1.47	`--vecmath 1 --vecmath-ops 100`
Integer wide vector arithmetic	165.38 ± 16.0	207.88 ± 20.1	45.03 ± 4.39	`--vecwide 1 --vecwide-ops 600`
Multi-precision floating-point	38.58 ± 0.83	54.01 ± 1.16	13.90 ± 0.33	`--mpfr 1 --mpfr-ops 200`
Floating-point square root	183.36 ± 29.42	348.78 ± 55.96	63.78 ± 10.30	`--cpu 1 --cpu-method sqrt --cpu-ops 20`
Floating-point FMA	101.68 ± 5.88	135.14 ± 7.61	36.92 ± 2.09	`--fma 1 --fma-ops 100000`
Floating-point math	110.51 ± 8.23	222.73 ± 16.7	52.28 ± 3.88	`--fp 1 --fp-ops 150`
Floating-point matmul	38.12 ± 0.92	49.40 ± 1.78	15.53 ± 0.38	`--matrix 1 --matrix-method prod --matrix-ops 150`
Floating-point trigonometric	97.03 ± 8.68	122.02 ± 10.9	43.83 ± 4.31	`--trig 1 --trig-ops 80`
Floating-point vector math	61.55 ± 5.05	107.38 ± 8.81	26.70 ± 2.24	`--vecfp 1 --vecfp-ops 200`

How to read: Lower is better. All numbers are relative speed to the same benchmark run on the host, for example 5.30 + 0.13 means the benchmark on the host was 5.30 times faster than in the guest, with a standard deviation of 0.13.

Benchmark Notes

All benchmarks where limited to use 1 CPU core.
All benchmarks include time to boot and shutdown Linux, while runs on the host doesn't.
Both QEMU and Cartesi Machine used the same guest kernel and guest rootfs.
QEMU is faster because it has a JIT (just in time compilation).
Floating-point benchmarks are slower because of software emulation.
Vector math benchmark is slower because the guest CPU has no support for SIMD instructions while the host has.
Square root benchmark is the worst because it's the heaviest instruction in the Cartesi Machine.
You can get more information about what each benchmark does in stress-ng manpages.

Conclusions

Cartesi Machine can be very fast at some workloads, but slow at others, depends on the workload.
Cartesi Machine is in average ~2x slower than QEMU, that is pretty good considering there is no JIT.
CPU registers benchmark is the fastest, meaning read and writes of RISC-V general purpose registers is fast.
Square root of floating-point numbers is the slowest benchmark, because it's the only instruction in the RISC-V interpreter that performs a loop.
SHA-256 and integer vector arithmetic are noticeable slower because there is no support for SIMD instructions.
Floating-point benchmarks are noticeable slower because of the deterministic software emulation of its instructions.

Benchmark Environment

Host CPU x86_64 Intel Core i9-14900K
Host Linux 6.6.65-1-lts
Guest CPU rv64imafdc_zicntr_zicsr_zifencei_zihpm
Guest Linux 6.5.13-ctsi-1
Guest machine with 128MB of RAM
QEMU 9.1.2
Cartesi Machine Emulator 0.19.0 (unreleased yet)
GCC 14.2.1 20240910
stress-ng 0.17.06
Run on 19/December/2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Benchmarks

Benchmark Notes

Conclusions

Benchmark Environment

Clone this wiki locally