Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Aug 5, 2025

Was glancing through FSST+, the meat is the cleaving algorithm in https://github.com/cwida/fsst_plus/blob/main/src/cleaving/cleaving.h#L75-L159 which dictates the optimal point to split the sorted blocks by maximum prefix.

Far from complete. Also the test case is just for development ease, in actuality it seems that the cleaving happens after FSST encoding, not before. So you split the codes by prefix rather than the source strings. This probably makes sense given the LCP is faster to compute over compressed strings than full strings.

An alternative to this is the very simplistic splitting in FSSTView from LiquidCache: https://github.com/XiangpengHao/liquid-cache/blob/main/dev/design/00-fsst-view.md

Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/fsstplus branch from 2adb639 to 63a8367 Compare August 5, 2025 22:23
@robert3005
Copy link
Contributor

Does fsst guarantee that no code is a prefix of another? I was always unsure about that part and lcp would be hard if it wasn’t the case

@robert3005
Copy link
Contributor

Should we maybe put this in the fsst repo?

@coveralls
Copy link

Coverage Status

coverage: 84.054% (+0.03%) from 84.027%
when pulling 63a8367 on aduffy/fsstplus
into b2a318b on develop.

@a10y
Copy link
Contributor Author

a10y commented Aug 5, 2025

The algorithm isn't really FSST-dependent AFAICT, it treats the input as opaque byte strings. The handling code seems to do this step after FSST-encoding all of the values.

So I actually think this makes more sense in Vortex than in the core fsst crate

}

/// Maximum shared prefix length.
pub const MAX_PREFIX: usize = 128;
Copy link

@yan-alex yan-alex Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum prefix size should be 255 here :)
128 is the batch size, i.e. how many rows at a time the chunk_by_similarity() function would process.

But nice start, you guys are fast

@robert3005 robert3005 mentioned this pull request Sep 25, 2025
5 tasks
a10y added a commit that referenced this pull request Oct 2, 2025
[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is a
string-focused ML training dataset provided by HuggingFace.

This PR adds some data from the fineweb sample as a new benchmark. I
hand-crafted some queries, also the benchmarks are quite fast, I'm only
pulling a 2GB subset of the data so we might want to pull the full ~25GB
sample.

FineWeb contains data extracts from CommonCrawl, and has a number of
string fields. We don't really have many string-filtering-heavy
benchmarks and I think this can be a useful one to evaluate things like
#4548 or #4132

---------

Signed-off-by: Andrew Duffy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants