This is a specialized family of algorithms which require the list s to be partitioned (see also set partition) by an unknown arbitrary predicate, such that these conditions hold true:
- The predicate function must not discriminate identical values; that is, it must be deterministic (not necessarily pure).
- All occurrences of each value must be grouped into runs of consecutive copies, like this:
- Good example:
[0,0,1,1](2 distinct runs) - Bad example:
[0,1,1,0](3 runs, 1 is duped)
- Good example:
There are many ways to reword this:
- Runs must "consolidate its representative value"
- There cannot be more than 1 partition with the same mode
- There's a bijection between modes and partitions.
Note
A single partition of N modes (or arbitrary-len runs) is "the same" as N mono-modal partitions. So by asserting injection, we get bijection for free!
Unlike most bisection methods, this algorithm doesn't need to know the predicate, as long as equality is total. IOW, it is predicate-agnostic.
Note
There is an alternative formulation that uses approximate-equality, but since ≈ isn't transitive, the list must be sorted (or at least, partitioned by approximate comparison, which groups "similar" partitions).
The set of all sorted lists is a strict subset of the set of all partitioned lists, assuming the comparison-function is standard numeric (scalar, not vectorial) comparison.
These algorithms exploit the following lemmas (theorems?):
- Multiplication is faster than repeated addition
- Bin exponentiation is faster than repeated multiplication
- Iterating over all elements of a list is
O(n), but finding values in a sorted list can be as fast asO(log(n))(bin-search)
See reference implementations here. Those are designed to "complement each other", because I want them to be examples of various use-cases.
I hope that this proof-of-concept can help optimize many other programs that deal with grouped values of any type, such as a specialized compressor.
Most discussion about partition-jumping was on this disbloat guild, in the #cs-theory channel. I don't like disbloat, but it was my only choice at the time.
TLDR: average time is O(part_count * lb(n)), but it's more nuanced than that. Best-case is O(1). Worst-case is O(n * lb(n)) (bin-search), or O(n) (exponential search)
The runtime of these algorithms is dominated by the partition count. So for the overhead to be worthwhile, the number of unique values should be low (relative to the length of s)
It's worth noting that the aforementioned lb (bin logarithm) is misleading. The 1st bisection is O(lb(n)) but the next is O(lb(n - part_len_0)) then O(lb(n - part_len_0 - part_len_1)), and so on, until it becomes O(lb(part_len_last)) (if we ignore the target == s[-1] check, which simplifies the last bisection into O(1)). That's only true for bin-search. For exp-search it's O(lb(part_len_i)) (worst-case), and O(1) (best-case) with target == s[-1].
As for the space complexity, it's O(1) for fixed-precision numbers. The basic implementation doesn't allocate auxiliary memory, but I'm working on one that does (see below)
The impl with aux-mem will use a data-structure to track the known "sub-partitions" that it encounters while bisecting (I call those "witnesses" or "bystanders"). For example:
s := [0,0,1,1,1,1,1]
target := 0, overshoot to index 3. After finding the partition-point ("boundary", as I call them) at index 2, we can remember that there are at least 2 instances of 1, even before we set it as our target, just because we happen to visit it while searching for 0
If you only want to track one bystander at a time, your aux-mem will be O(1). But my plan is to remember all bystanders, so I need something like a hash-map. I'm not sure if it's worth it, as maps have overhead