[DMS-75] Specialize PersistentOrederedSet #25

Sereja313 · 2024-10-15T17:38:06Z

Profiling branch: https://github.com/serokell/canister-profiling/tree/sereja/set-profiling
Baseline: https://github.com/serokell/motoko-base/tree/milestone-3 (6f4f2d5)

Collection benchmarks

	binary_size	generate	max mem	batch_get 50	batch_put 50	batch_remove 50	upgrade
trieset+100	211_008	574_022	47_652	131_218	288_429	268_499	352_101
persistentset_baseline+100	202_189	219_058	44_336	52_801	141_117	137_603	227_632
persistentset+100	196_855	196_106	37_784	49_615	128_635	130_668	210_758
trieset+1000	211_008	7_374_045	633_440	162_806	383_594	375_264	731_650
persistentset_baseline+1000	202_189	3_056_930	597_144	70_272	186_123	185_682	325_848
persistentset+1000	196_855	2_804_787	532_992	66_718	172_961	177_698	302_292
trieset+10000	211_008	105_695_670	682_792	192_931	457_923	462_594	3_986_772
persistentset_baseline+10000	202_189	50_563_749	520_660	86_221	229_300	236_072	404_402
persistentset+10000	196_855	44_279_117	360_508	82_339	214_958	227_419	374_743
trieset+100000	211_008	1_234_038_235	6_826_516	222_247	560_440	549_813	178_864_274
persistentset_baseline+100000	202_189	599_355_743	5_200_660	100_768	275_603	282_971	522_728
persistentset+100000	196_855	534_152_351	3_600_508	96_540	260_077	273_659	486_063
trieset+1000000	211_008	13_990_048_548	68_228_312	252_211	650_405	642_099	1_782_978_890
persistentset_baseline+1000000	202_189	6_917_106_711	52_000_696	119_173	322_324	337_316	595_061
persistentset+1000000	196_855	6_241_486_671	36_000_508	114_567	305_638	327_353	551_652

set API

	size	intersect	union	diff	equals	isSubset
trieset+100	100	352_496	411_306	350_935	201_896	201_456
persistentset_baseline+100	100	172_285	184_195	232_360	174_780	174_777
persistentset+100	100	157_677	170_099	219_483	156_755	157_147
trieset+1000	1000	731_650	1_079_906	912_629	2_589_090	4_023_673
persistentset_baseline+1000	1000	360_364	540_644	964_480	2_037_592	2_037_589
persistentset+1000	1000	339_879	516_679	946_277	1_851_541	1_851_538
trieset+10000	10000	3_986_854	21_412_306	5_984_106	46_174_710	31_885_381
persistentset_baseline+10000	10000	477_164	1_091_340	2_178_119	34_034_115	23_452_085
persistentset+10000	10000	450_437	1_056_800	2_154_884	28_585_151	28_585_107
trieset+100000	100000	178_863_894	209_889_623	199_028_396	521_399_350	521_399_346
persistentset_baseline+100000	100000	622_884	1_969_023	3_798_832	372_185_796	372_185_998
persistentset+100000	100000	589_221	1_922_651	3_769_853	317_013_739	317_013_654
trieset+1000000	1000000	1_782_977_198	2_092_850_787	1_984_818_266	5_813_335_155	5_813_335_151
persistentset_baseline+1000000	1000000	716_211	3_090_602	6_099_423	4_032_065_747	4_032_064_965
persistentset+1000000	1000000	675_690	3_032_485	6_064_414	3_473_642_146	3_473_641_487

new set API

	size	foldLeft	foldRight	mapfilter	map
persistentset_baseline	100	91_182	92_853	178_648	350_288
persistentset	100	85_458	86_734	158_197	315_166
persistentset_baseline	1000	878_985	890_689	3_644_203	4_853_625
persistentset	1000	821_920	833_624	3_050_847	4_468_816
persistentset_baseline	10000	19_334_740	19_447_074	44_145_477	72_915_858
persistentset	10000	15_241_830	15_354_246	38_010_778	65_203_581
persistentset_baseline	100000	193_217_301	194_325_505	514_083_407	870_274_303
persistentset	100000	152_314_883	153_422_513	450_944_413	789_670_737
persistentset_baseline	1000000	1_932_086_919	1_943_111_009	5_846_426_752	10_106_787_639
persistentset	1000000	1_523_084_378	1_534_107_566	5_197_944_386	9_265_810_970

Currently a persistent ordered set is implemented by storing () in PersistentOrderedMap, this PR manually specializes the set type.

GoPavel · 2024-10-17T10:47:07Z

Looks like a significant speed up: 7-10%

GoPavel · 2024-10-31T15:33:53Z

Merged into #36

This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library. The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 . # General changes: * rename `PersistentOrderedMap` to `OrderedMap` (same for the `OrderedSet`) * improve docs # Functional Map changes: ## New functionality: + add `any`/`all` functions + add `contains` function + add `minEntry`/`maxEntry` ## Optimizations: + Store `size` in the Map, [benchmark results](serokell#35) ## Fixup: + add `entriesRev()`, remove `iter()` # NEW functional Set: The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones. ## API implemented: * Basic operations (based on the map): `put`, `delete`, `contains`, `fromIter`, etc * Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight` * Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal` * Additional operations (as for the `OrderedMap`): `min`/`max`, `all`/`some` ## Maintainance support: * Unit, property tests * Documentation ## Applied optimizations: * Same optimizations that were useful for the functional map: * inline node color * float-out exceeded matching in iteration * `map`/`filterMap` through `foldLeft` * direct recursion in `foldLeft` * [Benchmark results for all four optimizations together](serokell#27) * store size in the root of the tree, [benchmark results](serokell#36 (comment)) * Pattern matching order optimization, [benchmark results](serokell#36 (comment)) * Other optimizations: * Inline code of `OrderedMap` instead of sharing it, [benchmark results](serokell#25) * `intersect` optimization: use order of output values to build the resulting tree faster, see serokell#39 * `isSubset`, `equal` optimization: use early exit and use order of subtrees to reduce intermediate tree height, see serokell#37 ## Rejected optimizations: * Nipkow's implementation of set operation [Tobias Nipkow's "Functional Data Structures and Algorithms", 117]. Initially, we were planning to use an implementation of set operations (`intersect`, `union`, `diff`) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. [The benchmark results](serokell#33) are comparing 3 versions: * persistentset_baseline -- original implementation that uses Nipkow's algorithms. However, the black height is calculated before each set operation (the book assumes it's stored). * persistentset_bh -- the same as the baseline but the black height is stored in each node. * persistentset -- naive implementation that looks up in a smaller set and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which is very close to Nipkow's version). Sizes of sets are also stored but only in the root. The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it. ## Final benchmark results: ### Collection benchmarks | |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove 50|upgrade| |--:|--:|--:|--:|--:|--:|--:|--:| |orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108| |trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696| |orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579| |trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466| |orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120| |trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045| |orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232| |trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388| |orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778| |trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492| ### set API | |size|intersect|union|diff|equals|isSubset| |--:|--:|--:|--:|--:|--:|--:| |orderedset+100|100|146_264|157_544|215_871|28_117|27_726| |trieset+100|100|352_496|411_306|350_935|201_896|201_456| |orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938| |trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673| |orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591| |trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381| |orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393| |trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346| |orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558| |trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151| ### new set API | |size|foldLeft|foldRight|mapfilter|map| |--:|--:|--:|--:|--:|--:| |orderedset|100|16_487|16_463|88_028|224_597| |orderedset|1000|133_685|131_953|1_526_510|4_035_782| |orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733| |orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463| |orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902| --------- Co-authored-by: Andrei Borzenkov <[email protected]> Co-authored-by: Andrei Borzenkov <[email protected]> Co-authored-by: Sergey Gulin <[email protected]> Co-authored-by: Claudio Russo <[email protected]>

[DMS-75] Specialize PersistentOrederedSet

9ebd08f

Currently a persistent ordered set is implemented by storing () in PersistentOrderedMap, this PR manually specializes the set type.

GoPavel added experiment experiment experiment:to-merge Successful optimization that intented to moved to another PR and removed experiment experiment labels Oct 21, 2024

GoPavel closed this Oct 31, 2024

GoPavel mentioned this pull request Nov 5, 2024

[Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup dfinity/motoko-base#662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DMS-75] Specialize PersistentOrederedSet #25

[DMS-75] Specialize PersistentOrederedSet #25

Sereja313 commented Oct 15, 2024 •

edited

Loading

GoPavel commented Oct 17, 2024

GoPavel commented Oct 31, 2024

[DMS-75] Specialize PersistentOrederedSet #25

[DMS-75] Specialize PersistentOrederedSet #25

Conversation

Sereja313 commented Oct 15, 2024 • edited Loading

Collection benchmarks

set API

new set API

GoPavel commented Oct 17, 2024

GoPavel commented Oct 31, 2024

Sereja313 commented Oct 15, 2024 •

edited

Loading