Naïve question about performance bottlenecks #3160

peterbourgon · 2021-12-05T20:07:14Z

peterbourgon
Dec 5, 2021

I watched Alex's presentation Co-Designing Raft + Thread-per-Core Execution Model with some interest and learned a lot. But I'm left with kind of a high-level and probably naïve question. My intuition is that the bottleneck in any Raft-like (i.e. CP) system is always going to be the inter-node communication required for consensus, and that this would dominate by orders of magnitude any gains you could get by speeding up local I/O performance. What about this system am I missing, such that the optimizations discussed in the presentation have meaningful impact? I'm sure it's something! :)

emaxerrno · 2021-12-06T01:43:07Z

emaxerrno
Dec 6, 2021
Maintainer

hi @peterbourgon, good question!

For context, remember that what we are trying to do is not be a Kafka++ but rather evolve the conversation around streaming like embedding computational engines (V8, Wasmer, WaVM, etc) into redpanda, for things like customizable compaction strategies, partition placement strategies, etc, etc. Anything that you think should have an API, at some point will.

For raft, there area a few networking bottlenecks (heartbeats+data rpc) - for heartbeats we built a custom "lossy" compressor to reduce the heartbeat size on the wire. The second one is data. Turns out that this is not as trivial as saying the network is the bottleneck (which it can be), but more nuanced.

Let's take the i3en.12xlarge(50Gbps) vs the ie3n.6xlarge (25gbps) instances. The former is IOPS-disk+CPU(compression) bound the latter is network bound. Assuming that we can go as fast as FIO - say 1.1GB/s on an xfs raid0 (software), ata the 12xlarge you really have a lot of network wiggle room (which is quickly consumed by other things like tiered storage, etc), but for the raft part in specific it ultimately depends on the hardware that is running on. The bottleneck will shift from subsystem to subsystem.

The TpC design is all about giving us tools to saturate the underlying devices by ultimately reducing coordination. Reducing coordination is not just at the filesystem level, or at the cpu level, but we really spend a lot of time thinking on coalescing, debouncing, batching, pipelining, removing barriers etc. No panacea but for the kinds of systems like redpanda, the TpC is a good foundation to help us build for the future we intend to see in streaming (see initial sentence on context)

hope this helps

6 replies

emaxerrno Dec 6, 2021
Maintainer

Let's take the i3en.12xlarge(50Gbps) vs the ie3n.6xlarge (25gbps) instances. The former is IOPS-disk+CPU(compression) bound the latter is network bound.

My intuition is that the network bottleneck would not be a function of available bandwidth and impacting data throughput, but rather a function of latency and availability between nodes and impacting rounds-per-second at the protocol level. Like, doesn't a single round of consensus cost you at least a few milliseconds of I/O wait on the network? Am I off-base? Do you amortize or avoid those costs somehow?

raft is a stable leadership protocol (one less round trip). w.r.t the kafka protocol and offset advancement (kafka offset) it is true that you can only advance on the next RPC - one to append the data and one to confirm the offsets were actually saved - they get piggy backed on all responses, so on a busy system this gets eliminated (one less round trip). so amortized to 1-ish....... there is logic there to break the ties if not data gets int the system after some time

The TpC design is all about giving us tools to saturate the underlying devices by ultimately reducing coordination. Reducing coordination is not just at the filesystem level, or at the cpu level, but we really spend a lot of time thinking on coalescing, debouncing, batching, pipelining, removing barriers etc.

💯 Coördination-avoidance is key! I guess I'm just surprised that inter-node coördination costs don't completely dominate on-host coördination costs in the overall system.

peterbourgon Dec 6, 2021
Author

How frequently do you do a consensus round? How much data can realistically be in the commit?

edit: thanks for the answers BTW, this is super interesting for me!

emaxerrno Dec 6, 2021
Maintainer

How frequently do you do a consensus round? How much data can realistically be in the commit?

for leader election, is pretty rare: O(hours/days) on a stable cluster, or during load skew, i.e.: if too many leaders on a node, we move the leader to another node less busy

there is a chaos report in the works w/ more details on this too that we hope to finish editing in Jan

peterbourgon Dec 6, 2021
Author

Not for leader election, for adding data to the replicated log. And feel free to point me somewhere else if this is too basic of a question :)

emaxerrno Dec 7, 2021
Maintainer

this is a very long answer. tl;dr is that you can tune the consistency to match kafka's acks=non, leader, all/quorum. https://github.com/vectorizedio/redpanda/blob/dev/src/v/cluster/partition.h

so depending on what the client wants. we need to write this in formal blog post. because the next questions will be about recovery and how to deal w/ kafka truncation, etc. all good questions but hopefully that should give you a starting point to peruse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naïve question about performance bottlenecks #3160

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Naïve question about performance bottlenecks #3160

peterbourgon Dec 5, 2021

Replies: 1 comment · 6 replies

emaxerrno Dec 6, 2021 Maintainer

emaxerrno Dec 6, 2021 Maintainer

peterbourgon Dec 6, 2021 Author

emaxerrno Dec 6, 2021 Maintainer

peterbourgon Dec 6, 2021 Author

emaxerrno Dec 7, 2021 Maintainer

peterbourgon
Dec 5, 2021

Replies: 1 comment 6 replies

emaxerrno
Dec 6, 2021
Maintainer

emaxerrno Dec 6, 2021
Maintainer

peterbourgon Dec 6, 2021
Author

emaxerrno Dec 6, 2021
Maintainer

peterbourgon Dec 6, 2021
Author

emaxerrno Dec 7, 2021
Maintainer