network: More efficient caching for Envoy socket addresses #37832

abeyad · 2024-12-27T23:02:26Z

An LRU cache was introduced to cache Envoy::Network::Address instances because they are expensive to create. These addresses are cached for reading source and destination addresses from recvmsg and recvmmsg calls on QUIC UDP sockets. The current size of the cache is 4 entries for each IoHandle (i.e. each socket).

A locally run CPU profile of Envoy Mobile showed about 1.75% of CPU cycles going towards querying and inserting into the quic::QuicLRUCache.

Given the small number of elements in the cache, this commit uses a std::vector data structure instead of QuicLRUCache. QuicLRUCache, std::vector, and std::deque were compared using newly added benchmark tests, and the following were the results:

QuicLRUCache:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31595 ns        31494 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    5538 ns         5538 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 38918 ns        38814 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      52969 ns        52846 ns         1000

std::deque:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31805 ns        31716 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    1553 ns         1550 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 27243 ns        27189 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      39335 ns        39235 ns         1000

std::vector:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31960 ns        31892 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    1514 ns         1514 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 26361 ns        26261 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      43987 ns        43738 ns         1000

std::vector uses 3.5x less CPU cycles than quic::QuicLRUCache and performs very slightly better than std::deque at small cache sizes. If considering creating a bigger cache size (e.g. >= 50 entries), std::deque may perform better and it's worth profiling, though in such a situation, no cache at all seems to perform better than a cache.

Risk Level: low
Testing: unit and benchmark tests
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Envoy::Network::Address instances are expensive to create, so an LRU cache was introduced to cache Envoy::Network::Address instances. These addresses are cached for reading source and destination addresses from recvmsg/recvmmsg calls on QUIC UDP sockets. The current size of the cache is 4 entries for each IoHandle (i.e. each socket). A locally run CPU profile of Envoy Mobile showed about 1.75% of CPU cycles going towards querying and inserting into the QuicLRUCache. Given the small number of elements in the cache, this commit uses a std::deque data structure instead of QuicLRUCache. QuicLRUCache, std::vector, and std::deque were compared using newly added benchmark tests, and the following were the results: Signed-off-by: Ali Beyad <[email protected]>

abeyad · 2024-12-27T23:21:24Z

/assign @alyssawilk

abeyad · 2024-12-27T23:21:34Z

cc @RenjieTang

Signed-off-by: Ali Beyad <[email protected]>

abeyad · 2024-12-28T06:24:48Z

/retest

repokitteh-read-only bot assigned alyssawilk Dec 27, 2024

abeyad added 2 commits December 27, 2024 17:22

rm lru cache

4fc1b67

Signed-off-by: Ali Beyad <[email protected]>

format

ab865e5

Signed-off-by: Ali Beyad <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network: More efficient caching for Envoy socket addresses #37832

network: More efficient caching for Envoy socket addresses #37832

abeyad commented Dec 27, 2024

abeyad commented Dec 27, 2024

abeyad commented Dec 27, 2024

abeyad commented Dec 28, 2024

network: More efficient caching for Envoy socket addresses #37832

Are you sure you want to change the base?

network: More efficient caching for Envoy socket addresses #37832

Conversation

abeyad commented Dec 27, 2024

abeyad commented Dec 27, 2024

abeyad commented Dec 27, 2024

abeyad commented Dec 28, 2024