Benchmarks

Cartesi Machine Benchmarks

Overview

This document presents a comprehensive analysis of the performance benchmarks for the Cartesi Machine Emulator (CM), versions 0.18 and 0.19, in comparison with QEMU version 8.0. The benchmarks are designed to evaluate various computational aspects of the emulator, highlighting its strengths, areas for improvement, and adherence to its core design principles.

Design philosophy

Before reading the benchmark results, it is essential to understand the design philosophy of the Cartesi Machine Emulator. The Cartesi Machine Emulator is architected with the following fundamental goals:

Low complexity, for simplifying auditing processes and minimize the potential for errors.
Determinism, for bit-perfect computations reproducibility across different platforms.
Portability, for compatibility with various architectures (e.g., zkVMs, RISC-V RV32I).
Security, for providing strong guarantees of safe, sandboxed and correct execution of applications.

To achieve these objectives, the emulator adopts specific architectural decisions:

No Just-In-Time (JIT) compilation, avoiding complexities and security issues associated with JIT, promoting simplicity and portability.
No multi-core interpretation, ensuring determinism and security by avoiding multi-threaded execution paths.

Benchmark methodology

Tool Used: All benchmarks were conducted using the stress-ng tool, a stress workload generator designed to exercise various system components.
Normalization: Results are normalized against native host execution; lower values indicate better performance relative to native execution.
Comparative Analysis: Benchmarks compare CM versions 0.18, 0.19, and QEMU 8.0, providing insights into performance evolution and relative standing.
CPU Configuration: Single-core limitation enforced on both host and guest environments to ensure consistency.
Kernel and Root Filesystem: Identical guest kernel and root filesystem images used across all emulator tests for fair comparison.
Benchmark Execution: Each test includes the overhead of booting and shutting down the guest Linux OS.
Repetition and Averaging: Benchmarks were repeated multiple times, and results were averaged to account for variability and ensure statistical significance.

Benchmark results

The table below summarizes the performance of CM versions 0.18, 0.19, and QEMU 8.0 across various computational tasks. Each entry includes the slowdown factor and the specific stress-ng command used.

Stress Benchmark	CM 0.19	CM 0.18	QEMU 9.1	`stress-ng` command
Sieve Eratosthenes	22.30 ± 0.58	41.48 ± 1.15	5.59 ± 0.15	`--cpu 1 --cpu-method sieve --cpu-ops 600`
Fibonacci sequence	15.52 ± 0.16	30.27 ± 0.30	4.82 ± 0.33	`--cpu 1 --cpu-method fibonacci --cpu-ops 500`
Heapsort	12.81 ± 0.12	22.66 ± 0.36	6.67 ± 0.07	`--heapsort 1 --heapsort-ops 5`
JPEG compression	50.53 ± 1.89	76.74 ± 2.87	25.03 ± 0.94	`--jpeg 1 --jpeg-ops 15`
Zlib compression	7.82 ± 0.26	13.60 ± 0.45	4.01 ± 0.14	`--zlib 1 --zlib-ops 30`
Quicksort	18.38 ± 0.24	30.54 ± 0.20	10.29 ± 0.08	`--qsort 1 --qsort-ops 8`
Hashing functions	16.35 ± 0.38	29.98 ± 0.64	6.45 ± 0.14	`--hash 1 --hash-ops 70000`
SHA-256 hashing	41.48 ± 1.50	79.25 ± 3.26	19.87 ± 0.92	`--crypt 1 --crypt-method SHA-256 --crypt-ops 500000`
CPU `int8` arithmetic	29.95 ± 0.92	59.24 ± 1.93	14.48 ± 0.45	`--cpu 1 --cpu-method int8 --cpu-ops 700`
CPU `int32` arithmetic	28.46 ± 0.88	54.60 ± 1.64	13.95 ± 0.42	`--cpu 1 --cpu-method int32 --cpu-ops 700`
CPU `int64` arithmetic	27.78 ± 0.36	54.35 ± 0.74	13.41 ± 0.17	`--cpu 1 --cpu-method int64 --cpu-ops 700`
CPU `int128` arithmetic	20.34 ± 0.33	43.47 ± 0.78	6.99 ± 0.11	`--cpu 1 --cpu-method int128 --cpu-ops 800`
CPU `float32` arithmetic	23.54 ± 0.34	35.03 ± 0.50	13.69 ± 0.20	`--cpu 1 --cpu-method float --cpu-ops 600`
CPU `float64` arithmetic	35.81 ± 0.83	53.57 ± 1.24	19.18 ± 0.46	`--cpu 1 --cpu-method double --cpu-ops 300`
CPU looping	11.14 ± 0.12	24.17 ± 0.28	3.08 ± 0.03	`--cpu 1 --cpu-method loop --cpu-ops 800`
CPU NO-OP instruction	16.77 ± 0.22	37.46 ± 0.49	2.13 ± 0.04	`--nop 1 --nop-ops 100000`
CPU atomic operations	4.79 ± 0.22	4.90 ± 0.23	1.64 ± 0.08	`--atomic 1 --atomic-ops 500`
CPU branching prediction	2.99 ± 0.01	5.40 ± 0.05	10.37 ± 0.13	`--branch 1 --branch-ops 300000`
CPU cache trashing	14.63 ± 0.07	21.42 ± 0.08	8.42 ± 0.06	`--cache 1 --cache-ops 150000`
CPU cache line	47.42 ± 2.10	79.50 ± 3.53	12.22 ± 0.55	`--cacheline 1 --cacheline-ops 125`
CPU read cycle	96.83 ± 10.5	162.35 ± 17.7	14.91 ± 1.64	`--clock 1 --clock-ops 2000`
CPU pipeline execution	18.74 ± 0.32	22.23 ± 0.38	83.88 ± 2.18	`--goto 1 --goto-ops 500000`
CPU icache trashing	26.87 ± 0.86	22.22 ± 0.70	37.19 ± 1.22	`--icache 1 --icache-ops 200`
CPU icache branching	1.16 ± 0.01	1.22 ± 0.03	8.23 ± 0.09	`--far-branch 1 --far-branch-ops 200`
CPU registers read/write	5.34 ± 0.05	12.57 ± 0.59	1.09 ± 0.01	`--regs 1 --regs-ops 15000`
CPU function call	24.83 ± 0.50	44.69 ± 0.76	17.29 ± 0.31	`--funccall 1 --funccall-ops 400`
CPU bitwise arithmetic	38.40 ± 1.37	78.63 ± 2.83	15.22 ± 0.55	`--cpu 1 --cpu-method bitops --cpu-ops 400`
CPU page table and TLB	39.22 ± 0.96	39.56 ± 0.98	23.72 ± 0.66	`--pagemove 1 --pagemove-ops 30`
CPU TLB shootdown	18.48 ± 1.36	24.40 ± 1.80	11.11 ± 0.96	`--tlb-shootdown 1 --tlb-shootdown-ops 2000`
Memory copy	33.76 ± 0.84	64.19 ± 1.59	10.77 ± 0.27	`--memcpy 1 --memcpy-ops 80`
Memory read/write	12.14 ± 1.02	21.58 ± 0.79	4.19 ± 0.17	`--memrate 1 --memrate-bytes 2M --memrate-ops 400`
Memory mapping	21.45 ± 0.42	23.17 ± 0.49	14.35 ± 0.29	`--mmap 1 --mmap-bytes 96M --mmap-ops 4`
Memory and cache thrashing	6.54 ± 0.13	9.55 ± 0.19	9.76 ± 0.15	`--randlist 1 --randlist-ops 250`
Virtual memory page fault	19.76 ± 1.08	26.53 ± 1.46	19.71 ± 1.08	`--fault 1 --fault-ops 10000`
Virtual memory read/write	21.46 ± 0.86	37.28 ± 1.49	12.46 ± 0.50	`--vm 1 --vm-bytes 96M --vm-ops 20000`
Virtual memory addressing	20.63 ± 0.29	37.71 ± 0.50	6.67 ± 0.38	`--vm-addr 1 --vm-addr-ops 20`
Process forking	9.61 ± 0.23	11.14 ± 0.21	7.99 ± 0.23	`--fork 1 --fork-ops 2000`
Process context switching	9.42 ± 1.23	6.40 ± 0.83	12.32 ± 1.61	`--switch 1 --switch-ops 200000`
File read/write	20.31 ± 0.43	34.64 ± 0.73	6.14 ± 0.15	`--hdd 1 --hdd-ops 6000`
Threading	101.03 ± 4.24	115.29 ± 4.83	19.06 ± 0.82	`--pthread 1 --pthread-ops 1500`
Linux system calls	14.94 ± 0.95	14.30 ± 0.90	1.54 ± 0.10	`--syscall 1 --syscall-ops 4000`
Integer vector arithmetic	118.93 ± 8.70	126.12 ± 9.22	19.95 ± 1.47	`--vecmath 1 --vecmath-ops 100`
Integer wide vector arithmetic	165.38 ± 16.0	207.88 ± 20.1	45.03 ± 4.39	`--vecwide 1 --vecwide-ops 600`
Multi-precision floating-point	38.58 ± 0.83	54.01 ± 1.16	13.90 ± 0.33	`--mpfr 1 --mpfr-ops 200`
Floating-point square root	183.36 ± 29.42	348.78 ± 55.96	63.78 ± 10.30	`--cpu 1 --cpu-method sqrt --cpu-ops 20`
Floating-point FMA	101.68 ± 5.88	135.14 ± 7.61	36.92 ± 2.09	`--fma 1 --fma-ops 100000`
Floating-point math	110.51 ± 8.23	222.73 ± 16.7	52.28 ± 3.88	`--fp 1 --fp-ops 150`
Floating-point matmul	38.12 ± 0.92	49.40 ± 1.78	15.53 ± 0.38	`--matrix 1 --matrix-method prod --matrix-ops 150`
Floating-point trigonometric	97.03 ± 8.68	122.02 ± 10.9	43.83 ± 4.31	`--trig 1 --trig-ops 80`
Floating-point vector math	61.55 ± 5.05	107.38 ± 8.81	26.70 ± 2.24	`--vecfp 1 --vecfp-ops 200`

Note: Each result is presented in the format mean ± standard deviation, representing the slowdown factor relative to native execution.
Note: Bold entries in the results indicate the best performance among the compared emulators.

Performance analysis

By analyzing the benchmarks, we can extract the following insights for the Cartesi Machine:

1. Areas where CM outperforms QEMU

Despite generally slower performance due to its design choices prioritizing determinism and security, CM 0.19 notably outperforms QEMU 8.0 in certain benchmarks:

CPU branch prediction:
- The simpler execution model of CM may result in more predictable branch behavior, leading to better performance in branch prediction tasks.
CPU pipeline execution:
- CM's straightforward interpretation without the overhead of dynamic optimizations could allow more efficient pipeline execution in this context.
CPU icache branching:
- The emulator's deterministic instruction flow might lead to better instruction cache utilization in branching scenarios.
Memory and cache thrashing:
- CM's emulation layer may handle cache thrashing more efficiently due to its consistent memory access patterns.

2. Floating-point operations

Observations:
- Significantly higher latency in floating-point computations.
- Particularly high latency in square roots, due to the iterative nature of its instruction implementation.
- The lack of hardware acceleration impacts performance in these tasks.
Implications:
- Applications heavily reliant on floating-point calculations may experience reduced performance.
- Necessary trade-off due to software-based emulation ensuring determinism and portability.

3. Vector and SIMD operations

Observations:
- Notable performance degradation in benchmarks involving vector arithmetic and SIMD operations.
Implications:
- Workloads that utilize vectorized computations may not achieve optimal performance.
- Consistent with the emulator's design priorities favoring portability and simplicity over platform-specific optimizations.

4. Threading and parallelism

Observations:
- Higher overhead in threading benchmarks, indicating less efficient context switching and thread management.
- Single-core interpretation design impacts the ability to leverage multi-threaded workloads.
Implications:
- Multi-threaded applications may not perform as efficiently.
- Necessary trade-off due to the prioritized deterministic execution model.

Comparative insights

CM 0.19 vs. CM 0.18:
- CM 0.19 shows significant performance improvements over CM 0.18 across most benchmarks.
- The update reflects optimizations and enhancements in the emulator's core execution engine.
CM 0.19 vs. QEMU:
- QEMU generally outperforms CM due to its use of JIT compilation and allowance for non-deterministic optimizations.
- CM maintains respectable performance, often within a 2x slowdown factor relative to QEMU, despite its stricter design constraints.

Test environment

Hardware configuration

Host CPU: Intel Core i9-14900K (x86_64)
Host RAM: 64GB
Guest CPU: rv64imafdc_zicntr_zicsr_zifencei_zihpm
Guest RAM: 128MB (guest machine)

Software stack

Host OS: Linux 6.6.65-1-lts
Guest OS: Linux 6.5.13-ctsi-1
QEMU: 9.1.2
Cartesi Machine Emulator: 0.18 and 0.19 (pre-release)
Compiler: GCC 14.2.1 20240910
Benchmark tool: stress-ng 0.17.06
Test date: 19/December/2024

Conclusion

The Cartesi Machine Emulator, adhering to its core principles of low complexity, determinism, portability, and security, demonstrates solid performance across a range of computational tasks. While it does not outperform emulators like QEMU that leverage JIT compilation and multi-threading, CM provides a dependable and consistent environment suitable for applications where determinism and reproducibility are paramount. Version 0.19 exhibits notable performance enhancements over version 0.18, reflecting ongoing optimization efforts. Developer targeting the Cartesi Machine can expect reliable execution, with the understanding that certain computationally intensive tasks, particularly those involving floating-point and vector operations, may experience performance trade-offs inherent to the emulator's design philosophy.

Recommendations

Application profiling: Developers should profile their applications to identify performance-critical sections that may be impacted by the emulator's limitations.
Optimization strategies:
- Minimize heavy floating-point computations where possible.
- Explore algorithmic optimizations that reduce reliance on vector operations.
Parallelism considerations: Given the single-core execution model, applications designed for parallel execution may need adjustments to align with the emulator's capabilities.

By aligning application design with the strengths of the Cartesi Machine Emulator, developer can effectively leverage its secure and deterministic environment for a wide range of computational tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly