Skip to content

Conversation

@RainBoltz
Copy link

Summary

This PR introduces a comprehensive set of performance optimizations across multiple TON components:

  • Add thread-local memory pools for hot allocation paths (CellBuilder, RLDP2 packets)
  • Enable compiler optimizations (vectorization, loop unrolling) for Release builds
  • Tune RocksDB settings for higher throughput (larger cache, more background threads, compression)
  • Add optional LZ4 compression for large ADNL packets
  • Optimize cell serializer and deserializer with batch operations and lookup tables
  • Optimize LRUCache with std::map replacing std::set

Changes

Memory Pools

  • crypto/vm/cells/CellBuilderPool: Thread-local pool for CellBuilder objects, reducing allocation overhead during cell
    construction
  • rldp2/PacketPool: Generic ObjectPool template and BufferSlicePool for high-throughput packet handling
  • PoolMonitor: Statistics tracking for pool usage (debugging/monitoring)

Cell Serialization Optimizations (crypto/vm/cells/)

  • CellSlice: Add batch bit-reading methods (prefetch_bits_to(), optimized fetch_bytes())
  • CellBuilder: Optimize store_bytes() with word-aligned writes
  • bitstring.cpp: Add SIMD-friendly byte manipulation utilities
  • tlb_tags.hpp: Add compile-time TLB tag lookup tables for faster deserialization

ADNL Packet Compression (adnl/adnl-packet-compression.{h,cpp})

  • LZ4 compression for packets >4KB with magic header identification
  • Transparent compress/decompress with fallback for uncompressed data
  • Improved error handling and edge case coverage

RocksDB Tuning (tddb/td/db/RocksDb.cpp)

  • Increase default block cache from 1GB to 4GB
  • Cache index and filter blocks in memory
  • Pin L0 filter/index blocks
  • Increase background compaction threads (4→8) and flush threads (2→4)
  • Add LZ4 compression (ZSTD for bottommost level)
  • Tune memtable and compaction settings

Data Structure Optimizations

  • LRUCache: Replace std::set with std::map for cleaner implementation; add likely/unlikely hints
  • ObjectPool: Improved thread-local pooling with reserved capacity
  • Bitset: Optimized bit operations with cross-platform intrinsics
  • ChainBuffer/CyclicBuffer: Minor improvements

Notes

  • RocksDB cache increase (1GB→4GB) assumes validators have sufficient RAM; consider making configurable
  • ADNL compression adds ~1-2% CPU overhead but can significantly reduce bandwidth for large packets
  • JeMalloc can be optionally enabled with -DTON_USE_JEMALLOC=ON for improved memory allocation
  • Cell serialization optimizations provide significant speedup for TLB parsing workloads

…ng, and ADNL compression

  This PR introduces a comprehensive set of performance optimizations across multiple TON components:

  - Add thread-local memory pools for hot allocation paths (CellBuilder, RLDP2 packets)
  - Enable aggressive compiler optimizations (LTO, vectorization, loop unrolling) for Release builds
  - Tune RocksDB settings for higher throughput (larger cache, more background threads, compression)
  - Add optional LZ4 compression for large ADNL packets
  - Optimize LRUCache with unordered_map replacing std::set
  - Enable JeMalloc by default for better memory allocation performance

  - Enable JeMalloc by default for non-tonlib builds
  - Add -O3, -flto, -ffast-math, -funroll-loops for Release/RelWithDebInfo
  - Enable auto-vectorization (-fvectorize, -fslp-vectorize) on Clang
  - Add -mtune=native when targeting native architecture

  - crypto/vm/cells/CellBuilderPool: Thread-local pool for CellBuilder objects, reducing allocation overhead during cell
  construction
  - rldp2/PacketPool: Generic ObjectPool<T> template and BufferSlicePool for high-throughput packet handling
  - PoolMonitor: Statistics tracking for pool usage (debugging/monitoring)

  - LZ4 compression for packets >4KB with magic header identification
  - Transparent compress/decompress with fallback for uncompressed data

  - Increase default block cache from 1GB to 4GB
  - Cache index and filter blocks in memory
  - Pin L0 filter/index blocks
  - Increase background compaction threads (4→8) and flush threads (2→4)
  - Add LZ4 compression (ZSTD for bottommost level)
  - Tune memtable and compaction settings

  - LRUCache: Replace std::set with std::unordered_map for O(1) lookups; add likely/unlikely hints
  - ObjectPool: Improved thread-local pooling with reserved capacity
  - Bitset: Optimized bit operations
  - ChainBuffer/CyclicBuffer: Minor improvements

  - tdutils/test/LRUCache.cpp - LRUCache unit tests
  - tdutils/test/ObjectPool.cpp - ObjectPool unit tests
  - tdutils/test/OptimizationBenchmarks.cpp - Microbenchmarks
  - tdutils/test/Phase5Benchmarks.cpp - Integration benchmarks
  - storage/test/bitset_optimization.cpp - Bitset benchmarks
  - test/test-memory-pools.cpp - Memory pool tests

  - All existing tests pass (ctest)
  - New unit tests pass for LRUCache, ObjectPool, memory pools
  - Benchmark results show improvement (run Phase5Benchmarks)
  - No memory leaks under ASan
  - Build succeeds on Ubuntu 22.04/24.04 and macOS

  - -ffast-math may affect floating-point precision in edge cases; TON's core logic uses integer arithmetic
  - RocksDB cache increase (1GB→4GB) assumes validators have sufficient RAM; consider making configurable
  - ADNL compression adds ~1-2% CPU overhead but can significantly reduce bandwidth for large packets
@DanShaders
Copy link
Collaborator

DanShaders commented Dec 3, 2025

If you want this to ever be merged, please split changes into separate commits/PRs and provide before/after benchmark results for each optimization individually (ideally, for metrics that we actually care about: blocks per second, transactions per second, latencies).

At first glance, most of the touched code here do not lie on any hotpath and thus do not need to be any more compilated than it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants