Skip to content

Benchmark profiling for hotspots #27

@Vindaar

Description

@Vindaar

After the PR to introduce SIMD packing for the Merkle tree construction #21 did not improve performance significantly and my investigation into precomputing the domain separator did not show any performance improvement (though I did not expect much here), I decided to profile the benchmark to get a better idea about where our current hotspots actually lie.

I assume that the benchmarks represent a roughly realistic representation of the key generation process.

Setup

I commented out bench_function_poseidon in benches/benchmark.rs and all benches other than the last in benches/benchmark_poseidon_top_level.rs:

    // benchmarking lifetime 2^32 - size optimized
    benchmark_signature_scheme::<SIGTopLevelTargetSumLifetime32Dim32Base26>(
        c,
        &format!(
            "Top Level TS, Lifetime 2^32, Activation 2^{MAX_LOG_ACTIVATION_DURATION}, Dimension 32, Base 26 (Size Optimized)"
        ),
    );

For the profiling, I defined the following Cargo setup:

[profile.profiling]
inherits = "release"
debug = true

I then compiled the benchmark using:

RUSTFLAGS="-C target-cpu=native" cargo build --profile profiling --bench benchmark --features with-gen-benches-poseidon-top-level

Profiling

With the binary built, I profiled using perf:

perf record -g --call-graph dwarf target/profiling/deps/benchmark-fc7b9741a242b2be --bench --profile-time 30

and stopped the execution after the key generation step was done.

Results

I had a look at the results with perf and hotspot.

Looking at a bottom up overview with perf:

# Total Lost Samples: 1139045
#
# Samples: 3M of event 'cycles:Pu'
# Event count (approx.): 2477540120255
#
# Overhead  Command          Shared Object               Symbol                                                                                                                                                                                                                                                               >
# ........  ...............  ..........................  .....................................................................................................................................................................................................................................................................>
#
    34.16%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hb890ae4a73de274dE.llvm.4266101522582905067
    25.91%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,16_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,16_usize,_>>::permute_state
    15.08%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] keccak::keccak_p
     2.67%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
     1.52%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2ExternalLayerMonty31<FP,_> as p3_poseidon2::external::ExternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,_,_>>::permute_state_initial
     1.27%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hd986374ef873cd47E.llvm.4266101522582905067
     1.20%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,24_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,24_usize,_>>::permute_state
     1.10%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] _ZN12p3_poseidon28external31external_terminal_permute_state17h7ee852a958250337E.llvm.4266101522582905067
     1.02%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] crossbeam_epoch::default::with_handle
     0.92%  benchmark-fc7b9  libc.so.6                   [.] __memmove_avx_unaligned_erms
     0.91%  benchmark-fc7b9  libc.so.6                   [.] _int_malloc
     0.84%  benchmark-fc7b9  [unknown]                   [k] 0xffffffff9ee001c8
     0.80%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <leansig::symmetric::prf::shake_to_field::ShakePRFtoF<_,_> as leansig::symmetric::prf::Pseudorandom>::get_domain_element
     0.65%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] p3_monty_31::poseidon2::<impl p3_poseidon2::internal::InternalLayer<p3_monty_31::monty_31::MontyField31<FP>,_,_> for p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,_,P2P>>::permute_state
     0.63%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] crossbeam_epoch::internal::Global::try_advance
     0.56%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <digest::core_api::wrapper::CoreWrapper<T> as digest::ExtendableOutput>::finalize_xof
     0.56%  benchmark-fc7b9  benchmark-fc7b9741a242b2be  [.] <p3_koala_bear::poseidon2::KoalaBearInternalLayerParameters as p3_monty_31::poseidon2::InternalLayerBaseParameters<p3_koala_bear::koala_bear::KoalaBearParameters,16_usize>>::internal_layer_mat_mul
     0.54%  benchmark-fc7b9  libc.so.6                   [.] cfree@GLIBC_2.2.5
     0.53%  benchmark-fc7b9  libc.so.6                   [.] malloc

We see that about 75% of the runtime is already spent in the external permute state call of Poseidon2, the internal Poseidon2 layers and the Keccak permutation. The elements further down the line individually don't make up that much time and many of them anyhow are also related to cryptographic operations. malloc and SIMD memmove instructions do make up 2-3%, but it's not clear it is a) worthwhile trying to optimize that at the moment and b) easily doable. Maybe some heap profiling could be worth it to see if some of that could be avoided (e.g. don't allocate the Merkle tree layers for each layer, but rather preallocate all).

hotspot also shows the exact SIMD intrinsics that are expensive for each call. Here the overview looks as follows:

Image

Summary

At the moment it is not obvious that there is still much low hanging fruit in terms of optimizations. I assume that the Poseidon2 and Keccak implementations are already heavily optimized.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions