Skip to content

CNDB-13639: Adds optional prefetching for known sequential reads#1692

Merged
pcmanus merged 2 commits intomainfrom
CNDB-13639-prefetching
Apr 22, 2025
Merged

CNDB-13639: Adds optional prefetching for known sequential reads#1692
pcmanus merged 2 commits intomainfrom
CNDB-13639-prefetching

Conversation

@pcmanus
Copy link
Copy Markdown

@pcmanus pcmanus commented Apr 11, 2025

This introduces a PrefetchingRebufferer that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting -Dcassandra.read_prefetching_size_kb to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and SortedStringTableCursor, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with -Dcassandra.read_prefetching_threads (but default to the number of "processors").

The -Dcassandra.read_prefetching_window can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of -Dcassandra.read_prefetching_size_kb prefetched, then prefetching is triggered.

See https://github.com/riptano/cndb/issues/13639 for additional motivation.

I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity.

In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size).

@github-actions
Copy link
Copy Markdown

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@pcmanus pcmanus force-pushed the CNDB-13639-prefetching branch from 3ca9092 to 18c9e57 Compare April 11, 2025 14:33
@eolivelli eolivelli requested a review from blambov April 11, 2025 15:30
Copy link
Copy Markdown

@blambov blambov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Pushed a patch with a couple of typo corrections.

pcmanus and others added 2 commits April 22, 2025 16:44
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb`
to the desired value. With that, and assuming the disk access mode
allows it (meaning, it's not uncompressed mmap), then prefetching will
be applied to reads that are "clearly" sequential. Mostly, as of this
patch, this means the sstable "scanners" and `SortedStringTableCursor`,
so compaction, SAI index building and tools that read sstable fully
(scrub, verifier) will benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed thread
pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than half
of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then
prefetching is triggered.
@pcmanus pcmanus force-pushed the CNDB-13639-prefetching branch from a14e75e to 070433b Compare April 22, 2025 14:44
@sonarqubecloud
Copy link
Copy Markdown

@cassci-bot
Copy link
Copy Markdown

❌ Build ds-cassandra-pr-gate/PR-1692 rejected by Butler


1 new test failure(s) in 4 builds
See build details here


Found 1 new test failures

Test Explanation Branch history Upstream history
...gLegacyIndex.test_sstableloader_with_failing_2i regression 🔴🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵

Found 6 known test failures

@pcmanus pcmanus merged commit 9fd4c4c into main Apr 22, 2025
465 of 479 checks passed
@pcmanus pcmanus deleted the CNDB-13639-prefetching branch April 22, 2025 16:15
djatnieks pushed a commit that referenced this pull request May 13, 2025
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>
djatnieks pushed a commit that referenced this pull request May 18, 2025
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>
michaelsembwever pushed a commit that referenced this pull request Feb 6, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>
michaelsembwever pushed a commit that referenced this pull request Feb 10, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Feb 11, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Feb 12, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Feb 14, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Feb 16, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Feb 27, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Mar 2, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Mar 4, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Mar 5, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Mar 25, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Mar 27, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Apr 14, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
michaelsembwever pushed a commit that referenced this pull request Apr 15, 2026
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches
(read from the underlying reader) a configurable number of chunks in
advance (so only work on top of rebufferer factories that work with
chunks, i.e. not uncompressed mmap).

Prefetching must first be enabled globally by setting
`-Dcassandra.read_prefetching_size_kb` to the desired value. With that,
and assuming the disk access mode allows it (meaning, it's not
uncompressed mmap), then prefetching will be applied to reads that are
"clearly" sequential. Mostly, as of this patch, this means the sstable
"scanners" and `SortedStringTableCursor`, so compaction, SAI index
building and tools that read sstable fully (scrub, verifier) will
benefit from it.

Since rebufferers are synchronous, prefetching is done through a fixed
thread pool. The number of thread of that pool can be set with
`-Dcassandra.read_prefetching_threads` (but default to the number of
"processors").

The `-Dcassandra.read_prefetching_window` can also be set to define how
often prefetching is re-triggered. By default, when there is less than
half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched,
then prefetching is triggered.

See riptano/cndb#13639 for additional
motivation.

I'll note that this patch is adapted from the similar DSE behavior.
However, there is a fair bit of modification since the DSE version was
relying on the asynchronous nature of rebufferers there (it also had a
few behavior I didn't fully understood and I didn't dig when it felt a
bit specific to DSE).

One point worth mentioning was that the DSE version was relying on the
ability of the underlying channel to create batches when multiple chunks
are prefetched. This is something that could have advantages and we
could try to add back over time, but it's a lot harder to see how to do
that in the non-async world of C*. Typically, in DSE, since all the
prefetches were initiated on the current thread (without blocking it),
building the "batch" was relatively easy, but it doesn't translate in
this patch. Not to mention that currently the underlying channels have
no batching options and those would have to be added (in a way that
trickled down to the channel implementation within CNDB). Anyway, as of
this patch, chunks are prefetched in parallel but with no batching
optimization. Which does mean the `window' parameter is probably not as
useful as for DSE, but I've kept it nonetheless as it's not exactly
adding complexity.

In the context of compaction in CNDB though, I think this mean that it
would kind of be ideal if we could use a relatively large "chunk size"
for the readers: we read the full file anyway, so there is no waste
there to use a large-ish chunk size, and it would kind of provide
batching for prefetching "for free" (in the sense that if we want to
prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb
chunk, but need a lot more with a 4kb chunk size).

---------

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>

 (Rebase of commit c57974a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants