Benchmark profiling for hotspots

After the PR to introduce SIMD packing for the Merkle tree construction https://github.com/leanEthereum/leanSig/pull/21 did not improve performance significantly and my [investigation into precomputing the domain separator](https://github.com/Vindaar/leanSig/commit/e851ae4f4f4f4c5711085f0a5afe4028935cf867) did not show any performance improvement (though I did not expect much here), I decided to profile the benchmark to get a better idea about where our current hotspots actually lie.

I assume that the benchmarks represent a roughly realistic representation of the key generation process.

### Setup 

I commented out `bench_function_poseidon` in `benches/benchmark.rs` and all benches other than the last in `benches/benchmark_poseidon_top_level.rs`:

```rust
    // benchmarking lifetime 2^32 - size optimized
    benchmark_signature_scheme::<SIGTopLevelTargetSumLifetime32Dim32Base26>(
        c,
        &format!(
            "Top Level TS, Lifetime 2^32, Activation 2^{MAX_LOG_ACTIVATION_DURATION}, Dimension 32, Base 26 (Size Optimized)"
        ),
    );
```

For the profiling, I defined the following Cargo setup:

```toml
[profile.profiling]
inherits = "release"
debug = true
```

I then compiled the benchmark using:

```
RUSTFLAGS="-C target-cpu=native" cargo build --profile profiling --bench benchmark --features with-gen-benches-poseidon-top-level
```

### Profiling

With the binary built, I profiled using `perf`:

```
perf record -g --call-graph dwarf target/profiling/deps/benchmark-fc7b9741a242b2be --bench --profile-time 30
```

and stopped the execution after the key generation step was done.

### Results

I had a look at the results with `perf` and `hotspot`.


Looking at a bottom up overview with `perf`:

```
# Total Lost Samples: 1139045
#
# Samples: 3M of event 'cycles:Pu'
# Event count (approx.): 2477540120255
#
# Overhead  Command          Shared Object               Symbol                                                                                                                                                                                                                                                               >
# ........  ...............  ..........................  .....................................................................................................................................................................................................................................................................>
#
    34.16%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hb890ae4a73de274dE.llvm.4266101522582905067
    25.91%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,16_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,16_usize,_>>::permute_state
    15.08%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] keccak::keccak_p
     2.67%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
     1.52%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2ExternalLayerMonty31<FP,_> as p3_poseidon2::external::ExternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,_,_>>::permute_state_initial
     1.27%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hd986374ef873cd47E.llvm.4266101522582905067
     1.20%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,24_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,24_usize,_>>::permute_state
     1.10%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17h7ee852a958250337E.llvm.4266101522582905067
     1.02%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] crossbeam_epoch::default::with_handle
     0.92%  benchmark-fc7b9  libc.so.6                   [.] __memmove_avx_unaligned_erms
     0.91%  benchmark-fc7b9  libc.so.6                   [.] _int_malloc
     0.84%  benchmark-fc7b9  [unknown]                   [k] 0xffffffff9ee001c8
     0.80%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <leansig::symmetric::prf::shake_to_field::ShakePRFtoF<_,_> as leansig::symmetric::prf::Pseudorandom>::get_domain_element
     0.65%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] p3_monty_31::poseidon2::<impl p3_poseidon2::internal::InternalLayer<p3_monty_31::monty_31::MontyField31<FP>,_,_> for p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,_,P2P>>::permute_state
     0.63%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] crossbeam_epoch::internal::Global::try_advance
     0.56%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <digest::core_api::wrapper::CoreWrapper<T> as digest::ExtendableOutput>::finalize_xof
     0.56%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_koala_bear::poseidon2::KoalaBearInternalLayerParameters as p3_monty_31::poseidon2::InternalLayerBaseParameters<p3_koala_bear::koala_bear::KoalaBearParameters,16_usize>>::internal_layer_mat_mul
     0.54%  benchmark-fc7b9  libc.so.6                   [.] cfree@GLIBC_2.2.5
     0.53%  benchmark-fc7b9  libc.so.6                   [.] malloc

```

We see that about 75% of the runtime is already spent in the external permute state call of Poseidon2, the internal Poseidon2 layers and the Keccak permutation. The elements further down the line individually don't make up that much time and many of them anyhow are also related to cryptographic operations. `malloc` and SIMD `memmove` instructions do make up 2-3%, but it's not clear it is a) worthwhile trying to optimize that at the moment and b) easily doable. Maybe some heap profiling could be worth it to see if some of that could be avoided (e.g. don't allocate the Merkle tree layers for each layer, but rather preallocate all).

`hotspot` also shows the exact SIMD intrinsics that are expensive for each call. Here the overview looks as follows:

<img width="2551" height="908" alt="Image" src="https://github.com/user-attachments/assets/7d009c48-ecf4-4883-8781-7f0506ae815c" />

### Summary

At the moment it is not obvious that there is still much low hanging fruit in terms of optimizations. I assume that the Poseidon2 and Keccak implementations are already heavily optimized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark profiling for hotspots #27

Setup

Profiling

Results

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark profiling for hotspots #27

Description

Setup

Profiling

Results

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions