Skip to content

Commit

Permalink
[Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup (#662)
Browse files Browse the repository at this point in the history
This is an MR for the 3rd Milestone of the Serokell's grant about
improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation
of the set data structure to the' base' library. Also, it brings a few
changes to the new functional map that was added in #664 , #654 .

# General changes:

* rename `PersistentOrderedMap` to `OrderedMap` (same for the
`OrderedSet`)
* improve docs

# Functional Map changes:

## New functionality:
+ add `any`/`all` functions
+ add `contains` function
+ add `minEntry`/`maxEntry`

## Optimizations:
+ Store `size` in the Map, [benchmark
results](serokell#35)

## Fixup: 
+ add `entriesRev()`, remove `iter()`

# NEW functional Set:

The new data structure implements an ordered set interface using
Red-Black trees as well as the new functional map from the 1-2
Milestones.

## API implemented:
* Basic operations (based on the map): `put`, `delete`, `contains`,
`fromIter`, etc
* Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight`
* Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal`
* Additional operations (as for the `OrderedMap`): `min`/`max`,
`all`/`some`

## Maintainance support:
* Unit, property tests
* Documentation

## Applied optimizations:

* Same optimizations that were useful for the functional map:
   * inline node color
   * float-out exceeded matching in iteration
   * `map`/`filterMap` through `foldLeft`
   * direct recursion in `foldLeft`
* [Benchmark results for all four optimizations
together](serokell#27)
* store size in the root of the tree, [benchmark
results](serokell#36 (comment))
* Pattern matching order optimization, [benchmark
results](serokell#36 (comment))
 * Other optimizations:
* Inline code of `OrderedMap` instead of sharing it, [benchmark
results](serokell#25)
* `intersect` optimization: use order of output values to build the
resulting tree faster, see
serokell#39
* `isSubset`, `equal` optimization: use early exit and use order of
subtrees to reduce intermediate tree height, see
serokell#37

## Rejected optimizations:

* Nipkow's implementation of set operation [Tobias Nipkow's "Functional
Data Structures and Algorithms", 117].
Initially, we were planning to use an implementation of set operations
(`intersect`, `union`, `diff`) from Nipkow's book. However, the
experiment shows that naive implementation with a simple size heuristic
performs better. [The benchmark
results](serokell#33) are comparing
3 versions:
* persistentset_baseline -- original implementation that uses Nipkow's
algorithms. However, the black height is calculated before each set
operation (the book assumes it's stored).
* persistentset_bh -- the same as the baseline but the black height is
stored in each node.
* persistentset -- naive implementation that looks up in a smaller set
and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which
is very close to Nipkow's version). Sizes of sets are also stored but
only in the root.
The last one outperforms others and keeps a tree slim in terms of byte
size. Thus, we have picked it.

## Final benchmark results:

### Collection benchmarks

| |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove
50|upgrade|
|--:|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108|
|trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696|

|orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579|

|trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466|

|orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120|

|trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045|

|orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232|

|trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388|

|orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778|

|trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492|

### set API

| |size|intersect|union|diff|equals|isSubset|
|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|100|146_264|157_544|215_871|28_117|27_726|
|trieset+100|100|352_496|411_306|350_935|201_896|201_456|
|orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938|
|trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673|
|orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591|

|trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381|
|orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393|

|trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346|

|orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558|

|trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151|

### new set API

| |size|foldLeft|foldRight|mapfilter|map|
|--:|--:|--:|--:|--:|--:|
|orderedset|100|16_487|16_463|88_028|224_597|
|orderedset|1000|133_685|131_953|1_526_510|4_035_782|
|orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733|
|orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463|
|orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902|

---------

Co-authored-by: Andrei Borzenkov <[email protected]>
Co-authored-by: Andrei Borzenkov <[email protected]>
Co-authored-by: Sergey Gulin <[email protected]>
Co-authored-by: Claudio Russo <[email protected]>
  • Loading branch information
5 people authored Nov 13, 2024
1 parent 7b5d31e commit 1961fab
Show file tree
Hide file tree
Showing 6 changed files with 4,236 additions and 0 deletions.
Loading

0 comments on commit 1961fab

Please sign in to comment.