To run these benchmarks yourself, simply run make benchmark
.
This repo tests different Go ASM implementations of Union2By2
.
The goal of these benchmarks is to quantitatively determine
if we can improve the performance of Union2By2
in the Roaring repo.
For context, this method is one of the forefront bottlenecks in Roaring today,
so any optimizations should immediately be felt by end users.
Below is a profile of Roaring under load, exhibiting said profile
characteristics (note that we only provide the Roaring snippet for privacy
reasons):
Note that union2by2
accounts for 25% of our execution time.
For reference, we have provided the Go implementation of union2by2
.
This implementation can also be seen in union2by2.go
.
func union2by2_generic(set1, set2, buffer []uint16) []uint16 {
pos := 0
k1 := 0
k2 := 0
if len(set2) == 0 {
buffer = buffer[:len(set1)]
copy(buffer, set1[:])
return buffer
}
if len(set1) == 0 {
buffer = buffer[:len(set2)]
copy(buffer, set2[:])
return buffer
}
s1 := set1[k1]
s2 := set2[k2]
buffer = buffer[:cap(buffer)]
for {
if s1 < s2 {
buffer[pos] = s1
pos++
k1++
if k1 >= len(set1) {
copy(buffer[pos:], set2[k2:])
pos += len(set2) - k2
break
}
s1 = set1[k1]
} else if s1 == s2 {
buffer[pos] = s1
pos++
k1++
k2++
if k1 >= len(set1) {
copy(buffer[pos:], set2[k2:])
pos += len(set2) - k2
break
}
if k2 >= len(set2) {
copy(buffer[pos:], set1[k1:])
pos += len(set1) - k1
break
}
s1 = set1[k1]
s2 = set2[k2]
} else { // if (set1[k1]>set2[k2])
buffer[pos] = s2
pos++
k2++
if k2 >= len(set2) {
copy(buffer[pos:], set1[k1:])
pos += len(set1) - k1
break
}
s2 = set2[k2]
}
}
return buffer[:pos]
}
While the existing assembly solution is quite streamlined, out optimization strategy aims to make the control flow more CPU-friendly. More specifically, we target two specific vectors for optimization:
- Minimizing branch mispredictions, and
- Minimizing the number of instructions
Our operating assumption/hypothesis for these optimizations asserts that branch misprediction is the largest inefficiency in the existing implementation. CPUs typically pipeline their instructions to increase the "instructions per cycle" bandwidth of a CPU. Hence, when conditional branches are encountered, the machine predicts which branch to go down and continues to pipeline instructions. If the CPU is ever wrong, then all of those pipelined instructions need to be purged, thereby wasting clock cycles. The original ASM implementation does not seem to optimize against branch misprediction.
There are two attempts to minimize branch misprediction/maximize branch prediction. First, we attempt to replace all branches with conditional moves. These instructions select a value from two possible source registers based on the CPU's condition flags (see CSEL documentation). Hence, we are now able to circumvent branching in favor of a single conditional instruction. Note that while we do minimize branching, these instructions are not free. Since we are waiting for the resolution of two registers and a condition flag, we have 3 data dependencies. The implied tradeoff is that these 3 data dependencies have less expected latency than a branch misprediction. Such an assumption is fair to make, though, since each of the three existing branches are equally plausible (and, hence, the probability of mispredicting a branch on any given iteration is 33%).
A new pseudo-Go implementation of the above changes is shown below: Note that the implementation only has branches to break out of the loop.
func union2by2_cond_moves(set1, set2, buffer []uint16) []uint16 {
pos := 0
k1 := 0
k2 := 0
s1 := 0
s2 := 0
buffer = buffer[:cap(buffer)]
for {
if k1 >= len(set1) {
break
}
if k2 >= len(set2) {
break
}
s1 = set1[k1]
s2 = set2[k2]
buffer[pos++] = s1 < s2 ? s1 : s2
k1 = s1 <= s2 ? k1 + 1 : k1
k2 = s1 >= s2 ? k2 + 1 : k2
}
if k1 < len(set1) {
copy(buffer[pos:], set2[k2:])
pos += len(set2) - k2
return buffer[:pos]
}
if k2 < len(set2) {
copy(buffer[pos:], set1[k1:])
pos += len(set1) - k1
return buffer[:pos]
}
return buffer[:pos]
}
While implementing this solution, we were hamstrung by Go's contrived flavor of ASM.
We were unable to use the native arm64
instructions to do conditional moves,
and we had to revert to recreating some missing functionality.
The associated cost was the addition of three (otherwise unnecessary) instructions.
Refer to the below results snippet for a peak at performance for randomly generated sets:
BenchmarkUnion2By2_Random/65535_random_elements_Original#03-10 4441 907552 ns/op
BenchmarkUnion2By2_Random/65535_random_elements_Version_1#03-10 10000 331182 ns/op
We see a 67% performance boost here.
The dualistic approach to minimize branch mispredictions is to maximize branch predictions. This approach attempts to leverage the branch prediction functionality to give the program a much higher success rate. Recall that the original codebase had three branches at the top level of the for-loop. The branch predictor would have to choose one of these three branches every iteration of the loop, and the probability of choosing correctly was 33%. However, we can see performance gains if we move the loop to inside the conditional blocks (a practice known as loop unrolling). Refer to the below pseudo-Go:
func union2by2_loop_unrolling(set1, set2, buffer []uint16) []uint16 {
// same setup logic ...
lt_loop:
for {
if s1 > s2 {
goto gt_loop
}
if s1 == s2 {
goto eq_loop
}
buffer[pos] = s1
pos++
k1++
if k1 >= len(set1) {
// flush and return
}
s1 = set1[k1]
}
eq_loop:
for {
if s1 < s2 {
goto lt_loop
}
if s1 > s2 {
goto gt_loop
}
buffer[pos] = s1
pos++
k1++
k2++
if k1 >= len(set1) {
// flush and return
}
if k2 >= len(set2) {
// flush and return
}
s1 = set1[k1]
s2 = set2[k2]
}
gt_loop:
for {
if s1 < s2 {
goto lt_loop
}
if s1 == s2 {
goto eq_loop
}
buffer[pos] = s2
pos++
k2++
if k2 >= len(set2) {
// flush and return
}
s2 = set2[k2]
}
return buffer[:pos]
}
Notice how we we have three loops here. This structure is more amenable to the branch predictor, since the predictor will likely continue repeating the loop (rather than jumping to another loop). In the event that one of our conditions is tripped inside the loop, we then jump and start executing the associated loop. This structure offers a great performance boost to programs that have runs of elements to add, such as:
set1={1, 2, 3, 7, 8, 9}; set2={4, 5, 6, 10, 11, 12}
Refer to the result snippet below for a peak at performance for sets with runs:
BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Original-10 29239 41246 ns/op
BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Version_5-10 32748 35839 ns/op
BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Version_7-10 39901 29964 ns/op
BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Original-10 18892 62461 ns/op
BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Version_5-10 27440 42865 ns/op
BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Version_7-10 26617 43098 ns/op
Notice that we have observed a 25% and 33% performance boost by using a flavor of loop unrolling for fixed
and variable
runs.
For more, please refer to the results section.
There are some other minor upgrades that were made to the algorithm.
These upgrades were added to minimize the number of instructions in the loop
(since those instructions scale with the number of iterations).
For example, one upgrade we made was to pre-sort the sets such that
we guarantee that registers R0/R3
correspond to the set that will be exhausted first.
Hence, we would be able to get rid of the R1/R4
comparison and branch in the loop,
reducing the total number of instructions by 2n
,
where n
is the number of loop iterations.
We have 6 implementations/variations of union2by2
for ASM.
These implementations are described below:
- Remove all conditional branches (except for loop termination checks) in favor of conditional moves,
- Build off of implementation (1), but add pre-sorting of sets,
- Add some branches to implementation (1) in order to reduce the total number of instructions (i.e. explore instruction/branch prediction tradeoff),
- Implementation (2) with a pre-sort on the sets,
- The original implementation, but with loop unrolling,
- The original implementation, but with pre-sorting the sets, and
- The original implmentation with both loop unrolling and pre-sorting
We also have 3 different types of benchmarks, as listed below:
- Fixed-size, alternating runs (i.e.,
step=2, set1={0,1,4,5}, set2={2,3,6,7}
), - Variable-size, alternating runs (i.e.,
max_step=4, set1={0,2,3,6,7,8,9,14,16}, set2={1,4,5,10,11,12,13,15}
), and - Random sets
All results are in the results/
folder.
These benchmarks were run on a 2021, M1-Pro Macbook.
- Implementations (5) and (7) show the benefit of loop unrolling as the number of elements (and, hence, the number of fixed runs) increase.
- Conversely, we see loop unrolling struggle mightily when the step size is 1 (i.e., no runs, which is worst case performance for loop unrolling)
- We see that the conditional moves are 2-3x slower than the original implementation. This makes sense for two reasons; first, since we have runs, branch prediction is guessing correctly for some proportion of the branches. Secondly, the additional instructions are evidently costly here. Hence, the conditional moves are not pulling their weight in these tests.
- Results are fairly consistent with above, but we do see loop unrolling struggle a little more here.
- Loop unrolling seems to reach parity at
maxSteps=4
, and then overtake the original implementation afterwards. - The conditional moves are still bad.
- Conditional moves shine here - Versions 1-3 seem to pull ahead by far in these benchmarks. This makes sense, as random data are not friendly to branch predictions.
- Notice that at low cardinalities (< 10k elements) version 3 stays stable at
8k ns/op
, which is 4x faster than the original. - At max cardinality, though, versions 1 and 2 seem to be faster than version 3. These implementations are both about 2x faster than the original.
- While loop unrolling does seem to be effective still (read: faster than baseline), it seems to pail in comparison to the performance gains of the first 3 implementations.
These results seem to indicate that each of our implementations does something better than the baseline implementation. However, they also have the ability to perform worse than the baseline implementation. We are likely to be concerned with the performance of this function at larger loads, though, so we can use this to inform our decision of which implementations to use under which conditions. At sufficiently high loads, though, we see that the loop unrolling is generally faster than the baseline implementation. However, under random load (which is likely to be the best indicator for realworld performance), we see incredibly performance gains from implementations 1-3.