Skip to content

anshulkamath/union2by2-optimizations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Union2By2 Optimizations

To run these benchmarks yourself, simply run make benchmark.

Background

This repo tests different Go ASM implementations of Union2By2. The goal of these benchmarks is to quantitatively determine if we can improve the performance of Union2By2 in the Roaring repo. For context, this method is one of the forefront bottlenecks in Roaring today, so any optimizations should immediately be felt by end users. Below is a profile of Roaring under load, exhibiting said profile characteristics (note that we only provide the Roaring snippet for privacy reasons):

Roaring Profile

Note that union2by2 accounts for 25% of our execution time.

Original Algorithm

For reference, we have provided the Go implementation of union2by2. This implementation can also be seen in union2by2.go.

func union2by2_generic(set1, set2, buffer []uint16) []uint16 {
	pos := 0
	k1 := 0
	k2 := 0

	if len(set2) == 0 {
		buffer = buffer[:len(set1)]
		copy(buffer, set1[:])
		return buffer
	}

	if len(set1) == 0 {
		buffer = buffer[:len(set2)]
		copy(buffer, set2[:])
		return buffer
	}

	s1 := set1[k1]
	s2 := set2[k2]
	buffer = buffer[:cap(buffer)]
	for {
		if s1 < s2 {
			buffer[pos] = s1
			pos++
			k1++
			if k1 >= len(set1) {
				copy(buffer[pos:], set2[k2:])
				pos += len(set2) - k2
				break
			}
			s1 = set1[k1]
		} else if s1 == s2 {
			buffer[pos] = s1
			pos++
			k1++
			k2++
			if k1 >= len(set1) {
				copy(buffer[pos:], set2[k2:])
				pos += len(set2) - k2
				break
			}
			if k2 >= len(set2) {
				copy(buffer[pos:], set1[k1:])
				pos += len(set1) - k1
				break
			}
			s1 = set1[k1]
			s2 = set2[k2]
		} else { // if (set1[k1]>set2[k2])
			buffer[pos] = s2
			pos++
			k2++
			if k2 >= len(set2) {
				copy(buffer[pos:], set1[k1:])
				pos += len(set1) - k1
				break
			}
			s2 = set2[k2]
		}
	}
	return buffer[:pos]
}

Optimization Strategy

While the existing assembly solution is quite streamlined, out optimization strategy aims to make the control flow more CPU-friendly. More specifically, we target two specific vectors for optimization:

  1. Minimizing branch mispredictions, and
  2. Minimizing the number of instructions

Minimizing branch mispredictions

Our operating assumption/hypothesis for these optimizations asserts that branch misprediction is the largest inefficiency in the existing implementation. CPUs typically pipeline their instructions to increase the "instructions per cycle" bandwidth of a CPU. Hence, when conditional branches are encountered, the machine predicts which branch to go down and continues to pipeline instructions. If the CPU is ever wrong, then all of those pipelined instructions need to be purged, thereby wasting clock cycles. The original ASM implementation does not seem to optimize against branch misprediction.

There are two attempts to minimize branch misprediction/maximize branch prediction. First, we attempt to replace all branches with conditional moves. These instructions select a value from two possible source registers based on the CPU's condition flags (see CSEL documentation). Hence, we are now able to circumvent branching in favor of a single conditional instruction. Note that while we do minimize branching, these instructions are not free. Since we are waiting for the resolution of two registers and a condition flag, we have 3 data dependencies. The implied tradeoff is that these 3 data dependencies have less expected latency than a branch misprediction. Such an assumption is fair to make, though, since each of the three existing branches are equally plausible (and, hence, the probability of mispredicting a branch on any given iteration is 33%).

A new pseudo-Go implementation of the above changes is shown below: Note that the implementation only has branches to break out of the loop.

func union2by2_cond_moves(set1, set2, buffer []uint16) []uint16 {
    pos := 0
    k1 := 0
    k2 := 0 

    s1 := 0
    s2 := 0
    buffer = buffer[:cap(buffer)]
    for {
        if k1 >= len(set1) {
            break
        }

        if k2 >= len(set2) {
            break
        }

        s1 = set1[k1]
        s2 = set2[k2]

        buffer[pos++] = s1 < s2 ? s1 : s2
        k1 = s1 <= s2 ? k1 + 1 : k1
        k2 = s1 >= s2 ? k2 + 1 : k2
    }

    if k1 < len(set1) {
        copy(buffer[pos:], set2[k2:])
        pos += len(set2) - k2
        return buffer[:pos]
    }

    if k2 < len(set2) {
        copy(buffer[pos:], set1[k1:])
        pos += len(set1) - k1
        return buffer[:pos]
        }

    return buffer[:pos]
}

While implementing this solution, we were hamstrung by Go's contrived flavor of ASM. We were unable to use the native arm64 instructions to do conditional moves, and we had to revert to recreating some missing functionality. The associated cost was the addition of three (otherwise unnecessary) instructions.

Refer to the below results snippet for a peak at performance for randomly generated sets:

BenchmarkUnion2By2_Random/65535_random_elements_Original#03-10    	    4441	    907552 ns/op
BenchmarkUnion2By2_Random/65535_random_elements_Version_1#03-10   	   10000	    331182 ns/op

We see a 67% performance boost here.

Maximizing branch predictions

The dualistic approach to minimize branch mispredictions is to maximize branch predictions. This approach attempts to leverage the branch prediction functionality to give the program a much higher success rate. Recall that the original codebase had three branches at the top level of the for-loop. The branch predictor would have to choose one of these three branches every iteration of the loop, and the probability of choosing correctly was 33%. However, we can see performance gains if we move the loop to inside the conditional blocks (a practice known as loop unrolling). Refer to the below pseudo-Go:

func union2by2_loop_unrolling(set1, set2, buffer []uint16) []uint16 {
	// same setup logic ...
lt_loop:
    for {
        if s1 > s2 {
            goto gt_loop
        } 

        if s1 == s2 {
            goto eq_loop
        }

        buffer[pos] = s1
        pos++
        k1++

        if k1 >= len(set1) {
            // flush and return
        }
        s1 = set1[k1]
    }

eq_loop:
    for {
        if s1 < s2 {
            goto lt_loop
        } 

        if s1 > s2 {
            goto gt_loop
        }

        buffer[pos] = s1
        pos++
        k1++
        k2++
        if k1 >= len(set1) {
            // flush and return
        }
        if k2 >= len(set2) {
            // flush and return
        }
        s1 = set1[k1]
        s2 = set2[k2]
    }

gt_loop:
    for {
        if s1 < s2 {
            goto lt_loop
        } 

        if s1 == s2 {
            goto eq_loop
        }

        buffer[pos] = s2
        pos++
        k2++
        if k2 >= len(set2) {
            // flush and return
        }
        s2 = set2[k2]
    }

    return buffer[:pos]
}

Notice how we we have three loops here. This structure is more amenable to the branch predictor, since the predictor will likely continue repeating the loop (rather than jumping to another loop). In the event that one of our conditions is tripped inside the loop, we then jump and start executing the associated loop. This structure offers a great performance boost to programs that have runs of elements to add, such as:

set1={1, 2, 3, 7, 8, 9}; set2={4, 5, 6, 10, 11, 12}

Refer to the result snippet below for a peak at performance for sets with runs:

BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Original-10   	   29239	     41246 ns/op
BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Version_5-10  	   32748	     35839 ns/op
BenchmarkUnion2By2_FixedAlternating/65535_elements_fixed_alternating_on_1024_Version_7-10  	   39901	     29964 ns/op

BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Original-10     	   18892	     62461 ns/op
BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Version_5-10    	   27440	     42865 ns/op
BenchmarkUnion2By2_VariableAlternating/65535_elements_alternating_on_16_Version_7-10    	   26617	     43098 ns/op

Notice that we have observed a 25% and 33% performance boost by using a flavor of loop unrolling for fixed and variable runs. For more, please refer to the results section.

Minimizing instructions

There are some other minor upgrades that were made to the algorithm. These upgrades were added to minimize the number of instructions in the loop (since those instructions scale with the number of iterations). For example, one upgrade we made was to pre-sort the sets such that we guarantee that registers R0/R3 correspond to the set that will be exhausted first. Hence, we would be able to get rid of the R1/R4 comparison and branch in the loop, reducing the total number of instructions by 2n, where n is the number of loop iterations.

Benchmarks

We have 6 implementations/variations of union2by2 for ASM. These implementations are described below:

  1. Remove all conditional branches (except for loop termination checks) in favor of conditional moves,
  2. Build off of implementation (1), but add pre-sorting of sets,
  3. Add some branches to implementation (1) in order to reduce the total number of instructions (i.e. explore instruction/branch prediction tradeoff),
  4. Implementation (2) with a pre-sort on the sets,
  5. The original implementation, but with loop unrolling,
  6. The original implementation, but with pre-sorting the sets, and
  7. The original implmentation with both loop unrolling and pre-sorting

We also have 3 different types of benchmarks, as listed below:

  1. Fixed-size, alternating runs (i.e., step=2, set1={0,1,4,5}, set2={2,3,6,7}),
  2. Variable-size, alternating runs (i.e., max_step=4, set1={0,2,3,6,7,8,9,14,16}, set2={1,4,5,10,11,12,13,15}), and
  3. Random sets

Results

All results are in the results/ folder. These benchmarks were run on a 2021, M1-Pro Macbook.

Fixed-size, alternating runs

  1. Implementations (5) and (7) show the benefit of loop unrolling as the number of elements (and, hence, the number of fixed runs) increase.
  2. Conversely, we see loop unrolling struggle mightily when the step size is 1 (i.e., no runs, which is worst case performance for loop unrolling)
  3. We see that the conditional moves are 2-3x slower than the original implementation. This makes sense for two reasons; first, since we have runs, branch prediction is guessing correctly for some proportion of the branches. Secondly, the additional instructions are evidently costly here. Hence, the conditional moves are not pulling their weight in these tests.

Variable-size, alternating runs

  1. Results are fairly consistent with above, but we do see loop unrolling struggle a little more here.
  2. Loop unrolling seems to reach parity at maxSteps=4, and then overtake the original implementation afterwards.
  3. The conditional moves are still bad.

Random sets

  1. Conditional moves shine here - Versions 1-3 seem to pull ahead by far in these benchmarks. This makes sense, as random data are not friendly to branch predictions.
  2. Notice that at low cardinalities (< 10k elements) version 3 stays stable at 8k ns/op, which is 4x faster than the original.
  3. At max cardinality, though, versions 1 and 2 seem to be faster than version 3. These implementations are both about 2x faster than the original.
  4. While loop unrolling does seem to be effective still (read: faster than baseline), it seems to pail in comparison to the performance gains of the first 3 implementations.

Recap

These results seem to indicate that each of our implementations does something better than the baseline implementation. However, they also have the ability to perform worse than the baseline implementation. We are likely to be concerned with the performance of this function at larger loads, though, so we can use this to inform our decision of which implementations to use under which conditions. At sufficiently high loads, though, we see that the loop unrolling is generally faster than the baseline implementation. However, under random load (which is likely to be the best indicator for realworld performance), we see incredibly performance gains from implementations 1-3.

About

Attempts at scalar optimizations for Union2by2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published