-
Notifications
You must be signed in to change notification settings - Fork 9
Description
After the PR to introduce SIMD packing for the Merkle tree construction #21 did not improve performance significantly and my investigation into precomputing the domain separator did not show any performance improvement (though I did not expect much here), I decided to profile the benchmark to get a better idea about where our current hotspots actually lie.
I assume that the benchmarks represent a roughly realistic representation of the key generation process.
Setup
I commented out bench_function_poseidon in benches/benchmark.rs and all benches other than the last in benches/benchmark_poseidon_top_level.rs:
// benchmarking lifetime 2^32 - size optimized
benchmark_signature_scheme::<SIGTopLevelTargetSumLifetime32Dim32Base26>(
c,
&format!(
"Top Level TS, Lifetime 2^32, Activation 2^{MAX_LOG_ACTIVATION_DURATION}, Dimension 32, Base 26 (Size Optimized)"
),
);For the profiling, I defined the following Cargo setup:
[profile.profiling]
inherits = "release"
debug = trueI then compiled the benchmark using:
RUSTFLAGS="-C target-cpu=native" cargo build --profile profiling --bench benchmark --features with-gen-benches-poseidon-top-level
Profiling
With the binary built, I profiled using perf:
perf record -g --call-graph dwarf target/profiling/deps/benchmark-fc7b9741a242b2be --bench --profile-time 30
and stopped the execution after the key generation step was done.
Results
I had a look at the results with perf and hotspot.
Looking at a bottom up overview with perf:
# Total Lost Samples: 1139045
#
# Samples: 3M of event 'cycles:Pu'
# Event count (approx.): 2477540120255
#
# Overhead Command Shared Object Symbol >
# ........ ............... .......................... .....................................................................................................................................................................................................................................................................>
#
34.16% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hb890ae4a73de274dE.llvm.4266101522582905067
25.91% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,16_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,16_usize,_>>::permute_state
15.08% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] keccak::keccak_p
2.67% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
1.52% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2ExternalLayerMonty31<FP,_> as p3_poseidon2::external::ExternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,_,_>>::permute_state_initial
1.27% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] _ZN12p3_poseidon28external31external_terminal_permute_state17hd986374ef873cd47E.llvm.4266101522582905067
1.20% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,24_usize,ILP> as p3_poseidon2::internal::InternalLayer<p3_monty_31::x86_64_avx2::packing::PackedMontyField31AVX2<FP>,24_usize,_>>::permute_state
1.10% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] _ZN12p3_poseidon28external31external_terminal_permute_state17h7ee852a958250337E.llvm.4266101522582905067
1.02% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] crossbeam_epoch::default::with_handle
0.92% benchmark-fc7b9 libc.so.6 [.] __memmove_avx_unaligned_erms
0.91% benchmark-fc7b9 libc.so.6 [.] _int_malloc
0.84% benchmark-fc7b9 [unknown] [k] 0xffffffff9ee001c8
0.80% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <leansig::symmetric::prf::shake_to_field::ShakePRFtoF<_,_> as leansig::symmetric::prf::Pseudorandom>::get_domain_element
0.65% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] p3_monty_31::poseidon2::<impl p3_poseidon2::internal::InternalLayer<p3_monty_31::monty_31::MontyField31<FP>,_,_> for p3_monty_31::x86_64_avx2::poseidon2::Poseidon2InternalLayerMonty31<FP,_,P2P>>::permute_state
0.63% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] crossbeam_epoch::internal::Global::try_advance
0.56% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <digest::core_api::wrapper::CoreWrapper<T> as digest::ExtendableOutput>::finalize_xof
0.56% benchmark-fc7b9 benchmark-fc7b9741a242b2be [.] <p3_koala_bear::poseidon2::KoalaBearInternalLayerParameters as p3_monty_31::poseidon2::InternalLayerBaseParameters<p3_koala_bear::koala_bear::KoalaBearParameters,16_usize>>::internal_layer_mat_mul
0.54% benchmark-fc7b9 libc.so.6 [.] cfree@GLIBC_2.2.5
0.53% benchmark-fc7b9 libc.so.6 [.] malloc
We see that about 75% of the runtime is already spent in the external permute state call of Poseidon2, the internal Poseidon2 layers and the Keccak permutation. The elements further down the line individually don't make up that much time and many of them anyhow are also related to cryptographic operations. malloc and SIMD memmove instructions do make up 2-3%, but it's not clear it is a) worthwhile trying to optimize that at the moment and b) easily doable. Maybe some heap profiling could be worth it to see if some of that could be avoided (e.g. don't allocate the Merkle tree layers for each layer, but rather preallocate all).
hotspot also shows the exact SIMD intrinsics that are expensive for each call. Here the overview looks as follows:
Summary
At the moment it is not obvious that there is still much low hanging fruit in terms of optimizations. I assume that the Poseidon2 and Keccak implementations are already heavily optimized.