Performance: Channel Congestion #4273

bw-solana · 2025-01-03T20:31:26Z

A lot of channel.send()/channel.recv() get contended on high number of txs (this is the same problem as poh.record conceptually) - this happens with the prio cache, with accountsdb.write_accounts_to etc.

This is also easy to fix: increase batching, sprinkle some spin looping. This is about not making threads go in a "i go to sleep; [other thread] wake up the sleeping thread" in a tight loop (syscall storm to wake up/sleep)

Another "fun" contention pattern we have is channel.send(item) <- this done from N threads, where the receiver is sleeping so it must be woken up. To be woken up, a mutex must be acquired on the channel, so multiple threads (producers) acquire it at the same time and only 1 thread succeeds, the others go to sleep. The one that succeeds sends the syscall to wake up the receiver, releases the mutex. All the other threads now race to acquire the mutex to wake up the receiver... which has been woken up already

So they manage to lock, see it's awake, do nothing. They contend the mutex for absolutely 0 reason. This happens with the prio cache, accountsdb, pretty much anything that gets executed with many txs from replay stage.

We could also consider moving to this: https://github.com/temporalxyz/que

bw-solana added this to Agave Performance Jan 3, 2025

bw-solana moved this to In progress in Agave Performance Jan 3, 2025

bw-solana assigned alessandrod Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Channel Congestion #4273

Performance: Channel Congestion #4273

bw-solana commented Jan 3, 2025 •

edited

Loading

Performance: Channel Congestion #4273

Performance: Channel Congestion #4273

Comments

bw-solana commented Jan 3, 2025 • edited Loading

bw-solana commented Jan 3, 2025 •

edited

Loading