-
Notifications
You must be signed in to change notification settings - Fork 85
WIP fsst+ #4132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
WIP fsst+ #4132
Conversation
Signed-off-by: Andrew Duffy <[email protected]>
|
Does fsst guarantee that no code is a prefix of another? I was always unsure about that part and lcp would be hard if it wasn’t the case |
|
Should we maybe put this in the fsst repo? |
|
The algorithm isn't really FSST-dependent AFAICT, it treats the input as opaque byte strings. The handling code seems to do this step after FSST-encoding all of the values. So I actually think this makes more sense in Vortex than in the core fsst crate |
| } | ||
|
|
||
| /// Maximum shared prefix length. | ||
| pub const MAX_PREFIX: usize = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum prefix size should be 255 here :)
128 is the batch size, i.e. how many rows at a time the chunk_by_similarity() function would process.
But nice start, you guys are fast
[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is a string-focused ML training dataset provided by HuggingFace. This PR adds some data from the fineweb sample as a new benchmark. I hand-crafted some queries, also the benchmarks are quite fast, I'm only pulling a 2GB subset of the data so we might want to pull the full ~25GB sample. FineWeb contains data extracts from CommonCrawl, and has a number of string fields. We don't really have many string-filtering-heavy benchmarks and I think this can be a useful one to evaluate things like #4548 or #4132 --------- Signed-off-by: Andrew Duffy <[email protected]>
Was glancing through FSST+, the meat is the cleaving algorithm in https://github.com/cwida/fsst_plus/blob/main/src/cleaving/cleaving.h#L75-L159 which dictates the optimal point to split the sorted blocks by maximum prefix.
Far from complete. Also the test case is just for development ease, in actuality it seems that the cleaving happens after FSST encoding, not before. So you split the codes by prefix rather than the source strings. This probably makes sense given the LCP is faster to compute over compressed strings than full strings.
An alternative to this is the very simplistic splitting in FSSTView from LiquidCache: https://github.com/XiangpengHao/liquid-cache/blob/main/dev/design/00-fsst-view.md