WIP fsst+ #4132

a10y · 2025-08-05T22:22:40Z

Was glancing through FSST+, the meat is the cleaving algorithm in https://github.com/cwida/fsst_plus/blob/main/src/cleaving/cleaving.h#L75-L159 which dictates the optimal point to split the sorted blocks by maximum prefix.

Far from complete. Also the test case is just for development ease, in actuality it seems that the cleaving happens after FSST encoding, not before. So you split the codes by prefix rather than the source strings. This probably makes sense given the LCP is faster to compute over compressed strings than full strings.

An alternative to this is the very simplistic splitting in FSSTView from LiquidCache: https://github.com/XiangpengHao/liquid-cache/blob/main/dev/design/00-fsst-view.md

Signed-off-by: Andrew Duffy <[email protected]>

robert3005 · 2025-08-05T22:27:44Z

Does fsst guarantee that no code is a prefix of another? I was always unsure about that part and lcp would be hard if it wasn’t the case

robert3005 · 2025-08-05T22:27:54Z

Should we maybe put this in the fsst repo?

coveralls · 2025-08-05T22:32:12Z

coverage: 84.054% (+0.03%) from 84.027%
when pulling 63a8367 on aduffy/fsstplus
into b2a318b on develop.

a10y · 2025-08-05T22:34:19Z

The algorithm isn't really FSST-dependent AFAICT, it treats the input as opaque byte strings. The handling code seems to do this step after FSST-encoding all of the values.

So I actually think this makes more sense in Vortex than in the core fsst crate

yan-alex · 2025-08-15T12:55:14Z

encodings/fsst/src/prefix.rs

+}
+
+/// Maximum shared prefix length.
+pub const MAX_PREFIX: usize = 128;


The maximum prefix size should be 255 here :)
128 is the batch size, i.e. how many rows at a time the chunk_by_similarity() function would process.

But nice start, you guys are fast

[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is a string-focused ML training dataset provided by HuggingFace. This PR adds some data from the fineweb sample as a new benchmark. I hand-crafted some queries, also the benchmarks are quite fast, I'm only pulling a 2GB subset of the data so we might want to pull the full ~25GB sample. FineWeb contains data extracts from CommonCrawl, and has a number of string fields. We don't really have many string-filtering-heavy benchmarks and I think this can be a useful one to evaluate things like #4548 or #4132 --------- Signed-off-by: Andrew Duffy <[email protected]>

basic fsst+ beginnings

63a8367

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/fsstplus branch from 2adb639 to 63a8367 Compare August 5, 2025 22:23

yan-alex reviewed Aug 15, 2025

View reviewed changes

robert3005 mentioned this pull request Sep 25, 2025

Epic: FSST Array improvements #4758

Open

5 tasks

a10y mentioned this pull request Oct 2, 2025

feat: FineWeb as a benchmark dataset #4828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP fsst+ #4132

WIP fsst+ #4132

Uh oh!

a10y commented Aug 5, 2025 •

edited

Loading

Uh oh!

robert3005 commented Aug 5, 2025

Uh oh!

robert3005 commented Aug 5, 2025

Uh oh!

coveralls commented Aug 5, 2025

Uh oh!

a10y commented Aug 5, 2025

Uh oh!

yan-alex Aug 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WIP fsst+ #4132

Are you sure you want to change the base?

WIP fsst+ #4132

Uh oh!

Conversation

a10y commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robert3005 commented Aug 5, 2025

Uh oh!

robert3005 commented Aug 5, 2025

Uh oh!

coveralls commented Aug 5, 2025

Uh oh!

a10y commented Aug 5, 2025

Uh oh!

yan-alex Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

a10y commented Aug 5, 2025 •

edited

Loading

yan-alex Aug 15, 2025 •

edited

Loading