CNDB-13639: Adds optional prefetching for known sequential reads by pcmanus · Pull Request #1692 · datastax/cassandra

pcmanus · 2025-04-11T13:36:34Z

This introduces a PrefetchingRebufferer that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting -Dcassandra.read_prefetching_size_kb to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and SortedStringTableCursor, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with -Dcassandra.read_prefetching_threads (but default to the number of "processors").

The -Dcassandra.read_prefetching_window can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of -Dcassandra.read_prefetching_size_kb prefetched, then prefetching is triggered.

See https://github.com/riptano/cndb/issues/13639 for additional motivation.

I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity.

In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size).

github-actions · 2025-04-11T13:36:52Z

blambov

Looks good to me.

Pushed a patch with a couple of typo corrections.

This introduces a `PrefetchingRebufferer` that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap). Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb` to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and `SortedStringTableCursor`, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it. Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with `-Dcassandra.read_prefetching_threads` (but default to the number of "processors"). The `-Dcassandra.read_prefetching_window` can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then prefetching is triggered.

sonarqubecloud · 2025-04-22T16:00:29Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
82.3% Coverage on New Code
1.9% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-04-22T16:06:35Z

❌ Build ds-cassandra-pr-gate/PR-1692 rejected by Butler

1 new test failure(s) in 4 builds
See build details here

Found 1 new test failures

Test	Explanation	Branch history	Upstream history
...gLegacyIndex.test_sstableloader_with_failing_2i	regression	🔴🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵

Found 6 known test failures

This introduces a `PrefetchingRebufferer` that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap). Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb` to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and `SortedStringTableCursor`, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it. Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with `-Dcassandra.read_prefetching_threads` (but default to the number of "processors"). The `-Dcassandra.read_prefetching_window` can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then prefetching is triggered. See riptano/cndb#13639 for additional motivation. I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE). One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity. In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size). --------- Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

This introduces a `PrefetchingRebufferer` that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap). Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb` to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and `SortedStringTableCursor`, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it. Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with `-Dcassandra.read_prefetching_threads` (but default to the number of "processors"). The `-Dcassandra.read_prefetching_window` can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then prefetching is triggered. See riptano/cndb#13639 for additional motivation. I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE). One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity. In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size). --------- Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com> (Rebase of commit c57974a)

pcmanus force-pushed the CNDB-13639-prefetching branch from 3ca9092 to 18c9e57 Compare April 11, 2025 14:33

eolivelli requested a review from blambov April 11, 2025 15:30

blambov approved these changes Apr 14, 2025

View reviewed changes

pcmanus and others added 2 commits April 22, 2025 16:44

Typo and formatting corrections

070433b

pcmanus force-pushed the CNDB-13639-prefetching branch from a14e75e to 070433b Compare April 22, 2025 14:44

pcmanus merged commit 9fd4c4c into main Apr 22, 2025
465 of 479 checks passed

pcmanus deleted the CNDB-13639-prefetching branch April 22, 2025 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-13639: Adds optional prefetching for known sequential reads#1692

CNDB-13639: Adds optional prefetching for known sequential reads#1692
pcmanus merged 2 commits intomainfrom
CNDB-13639-prefetching

pcmanus commented Apr 11, 2025

Uh oh!

github-actions Bot commented Apr 11, 2025

Uh oh!

blambov left a comment

Uh oh!

sonarqubecloud Bot commented Apr 22, 2025

Uh oh!

cassci-bot commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pcmanus commented Apr 11, 2025

Uh oh!

github-actions Bot commented Apr 11, 2025

Checklist before you submit for review

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 22, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Apr 22, 2025

❌ Build ds-cassandra-pr-gate/PR-1692 rejected by Butler

Found 1 new test failures

Found 6 known test failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants