Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: More efficient caching for Envoy socket addresses #37832

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

abeyad
Copy link
Contributor

@abeyad abeyad commented Dec 27, 2024

An LRU cache was introduced to cache Envoy::Network::Address instances because they are expensive to create. These addresses are cached for reading source and destination addresses from recvmsg and recvmmsg calls on QUIC UDP sockets. The current size of the cache is 4 entries for each IoHandle (i.e. each socket).

A locally run CPU profile of Envoy Mobile showed about 1.75% of CPU cycles going towards querying and inserting into the quic::QuicLRUCache.

Given the small number of elements in the cache, this commit uses a std::vector data structure instead of QuicLRUCache. QuicLRUCache, std::vector, and std::deque were compared using newly added benchmark tests, and the following were the results:

QuicLRUCache:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31595 ns        31494 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    5538 ns         5538 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 38918 ns        38814 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      52969 ns        52846 ns         1000

std::deque:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31805 ns        31716 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    1553 ns         1550 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 27243 ns        27189 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      39335 ns        39235 ns         1000

std::vector:

-------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------
BM_GetOrCreateEnvoyAddressInstanceNoCache/iterations:1000                           31960 ns        31892 ns         1000
BM_GetOrCreateEnvoyAddressInstanceConnectedSocket/iterations:1000                    1514 ns         1514 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocket/iterations:1000                 26361 ns        26261 ns         1000
BM_GetOrCreateEnvoyAddressInstanceUnconnectedSocketLargerCache/iterations:1000      43987 ns        43738 ns         1000

std::vector uses 3.5x less CPU cycles than quic::QuicLRUCache and performs very slightly better than std::deque at small cache sizes. If considering creating a bigger cache size (e.g. >= 50 entries), std::deque may perform better and it's worth profiling, though in such a situation, no cache at all seems to perform better than a cache.

Risk Level: low
Testing: unit and benchmark tests
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Envoy::Network::Address instances are expensive to create, so an
LRU cache was introduced to cache Envoy::Network::Address instances.
These addresses are cached for reading source and destination
addresses from recvmsg/recvmmsg calls on QUIC UDP sockets. The
current size of the cache is 4 entries for each IoHandle (i.e. each
socket).

A locally run CPU profile of Envoy Mobile showed about 1.75% of CPU
cycles going towards querying and inserting into the QuicLRUCache.

Given the small number of elements in the cache, this commit uses a
std::deque data structure instead of QuicLRUCache. QuicLRUCache,
std::vector, and std::deque were compared using newly added benchmark
tests, and the following were the results:

Signed-off-by: Ali Beyad <[email protected]>
@abeyad
Copy link
Contributor Author

abeyad commented Dec 27, 2024

/assign @alyssawilk

@abeyad
Copy link
Contributor Author

abeyad commented Dec 27, 2024

cc @RenjieTang

Signed-off-by: Ali Beyad <[email protected]>
Signed-off-by: Ali Beyad <[email protected]>
@abeyad
Copy link
Contributor Author

abeyad commented Dec 28, 2024

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants