Skip to content

buffer: more copy optimizations#63892

Open
ronag wants to merge 2 commits into
nodejs:mainfrom
ronag:buffer-copy-viewbytes
Open

buffer: more copy optimizations#63892
ronag wants to merge 2 commits into
nodejs:mainfrom
ronag:buffer-copy-viewbytes

Conversation

@ronag

@ronag ronag commented Jun 13, 2026

Copy link
Copy Markdown
Member
config baseline view-level API median Δ
partial, 8 B 42.0 M/s 62.8 M/s +49%
full, 8 B 42.2 M/s 62.4 M/s +48%
partial, 128 B 42.0 M/s 63.1 M/s +50%
full, 128 B 42.2 M/s 61.7 M/s +46%
partial, 1024 B 37.8 M/s 55.1 M/s +46%
full, 1024 B 35.1 M/s 47.3 M/s +35%

Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte
range from one ArrayBufferView to another. Unlike the existing
v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the views: it
resolves each view's data pointer directly (JSTypedArray::DataPtr /
DataView::data_pointer) and reads the backing buffer's shared/immutable/
detached flags as plain field loads, without ever materializing or
fetching the views' ArrayBuffers.

This exists because materializing the ArrayBuffer is expensive. Profiling
Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small
copies the dominant native cost is ArrayBufferView::Buffer() ->
JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call
(not just first materialization) and incurred twice (source and target).
Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of
a small copy is spent turning two views into ArrayBuffers and querying
them piecemeal, while the actual memmove is ~4%. Routing the node binding
through CopyArrayBufferBytes forces all of that onto the embedder side; a
view-level entry point folds it into a single call of cheap field reads.

Byte ranges are clamped to both views' byte lengths. Nothing is copied
when the source is detached/out-of-bounds or the target is detached/
out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes
actually copied is returned. When both views are backed by a
SharedArrayBuffer a relaxed-atomic memmove is used, honoring the
SharedArrayBuffer memory model; any other combination performs a plain
memmove on the backing store (matching CopyArrayBufferBytes).

Carried as a floating patch; v8_embedder_string is bumped to -node.22
accordingly. It is the natural sibling of the CopyArrayBufferBytes API
added in the preceding floating patch and touches nothing but the public
ArrayBuffer/ArrayBufferView API.

Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte
range from one ArrayBufferView to another. Unlike the existing
v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the *views*: it
resolves each view's data pointer directly (JSTypedArray::DataPtr /
DataView::data_pointer) and reads the backing buffer's shared/immutable/
detached flags as plain field loads, without ever materializing or
fetching the views' ArrayBuffers.

This exists because materializing the ArrayBuffer is expensive. Profiling
Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small
copies the dominant native cost is ArrayBufferView::Buffer() ->
JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call
(not just first materialization) and incurred twice (source and target).
Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of
a small copy is spent turning two views into ArrayBuffers and querying
them piecemeal, while the actual memmove is ~4%. Routing the node binding
through CopyArrayBufferBytes forces all of that onto the embedder side; a
view-level entry point folds it into a single call of cheap field reads.

Byte ranges are clamped to both views' byte lengths. Nothing is copied
when the source is detached/out-of-bounds or the target is detached/
out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes
actually copied is returned. When both views are backed by a
SharedArrayBuffer a relaxed-atomic memmove is used, honoring the
SharedArrayBuffer memory model; any other combination performs a plain
memmove on the backing store (matching CopyArrayBufferBytes).

Carried as a floating patch; v8_embedder_string is bumped to -node.22
accordingly. It is the natural sibling of the CopyArrayBufferBytes API
added in the preceding floating patch and touches nothing but the public
ArrayBuffer/ArrayBufferView API.

Refs: nodejs#55422
Signed-off-by: Robert Nagy <ronagy@icloud.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nodejs-github-bot

Copy link
Copy Markdown
Collaborator

Review requested:

  • @nodejs/gyp
  • @nodejs/security-wg
  • @nodejs/v8-update

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Jun 13, 2026
@ronag ronag requested a review from bnoordhuis June 13, 2026 17:03
@ronag

ronag commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@targos @addaleax @joyeecheung This is a relatively trivial v8 patch. Would you be comfortable with floating this? I've already spent money on funding work to upstream CopyArrayBufferBytes to V8 and not sure I can spend much more.

@ronag

ronag commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

I've tried the CopyArrayBufferBytes implementation with LTO enabled and the compiler is not able to inline things properly.

Route the native backing of Buffer.prototype.copy (CopyImpl, the `_copy`
binding) through the new v8::ArrayBufferView::CopyArrayBufferViewBytes API
instead of v8::ArrayBuffer::CopyArrayBufferBytes. The previous binding had
to convert both views to ArrayBuffers (ArrayBufferView::Buffer()), read
their byte offsets (ByteOffset()) and test shared-ness
(IsSharedArrayBuffer()) before the copy -- around half a dozen separate V8
API calls per copy. The view-level API does all of that internally from
the views' own fields in a single call, so the binding now just forwards
the two views, the view-relative offsets and the length.

Profiling on AMD EPYC 9135 (x86-64) attributed the small-copy cost almost
entirely to that view->buffer conversion: ArrayBufferView::Buffer() /
JSTypedArray::GetBuffer() alone was ~25% of runtime, paid every call and
twice per copy. Resolving the buffer in JS instead (passing
source.buffer/target.buffer to the binding) was measured and is worse:
the typed-array `.buffer` getter is not JIT-inlined and dispatches through
the CEntry trampoline to a C++ builtin, costing ~36%.

The view-level copy keeps all existing semantics: byte-range clamping,
no-op (0 bytes) on a detached or immutable target, relaxed-atomic memmove
when both sides are SharedArrayBuffer-backed, plain memmove otherwise. The
JS-side view clamping in copyImpl is retained: V8 clamps to the underlying
backing store, which for pooled Buffers is the whole shared pool rather
than the individual view.

buffer-copy.js, median of 30 interleaved runs, AMD EPYC 9135 x86-64
(all changes p < 0.001, Welch t = 19-50):

  partial=false bytes=8:    42.2 -> 62.4 Mops/s  (+48%)
  partial=true  bytes=8:    42.0 -> 62.8 Mops/s  (+49%)
  partial=false bytes=128:  42.2 -> 61.7 Mops/s  (+46%)
  partial=true  bytes=128:  42.0 -> 63.1 Mops/s  (+50%)
  partial=false bytes=1024: 35.1 -> 47.3 Mops/s  (+35%)
  partial=true  bytes=1024: 37.8 -> 55.1 Mops/s  (+46%)

The gain is largest for small/medium copies, where per-call overhead
dominates, and tapers for 1024-byte copies as the memmove itself grows.

Also inlines the former _copyActual helper (only caller was copyImpl) into
copyImpl, folding a redundant target.byteLength read.

Refs: nodejs#55422
Signed-off-by: Robert Nagy <ronagy@icloud.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ronag ronag force-pushed the buffer-copy-viewbytes branch from f511275 to afe87da Compare June 13, 2026 20:09
@ronag ronag marked this pull request as ready for review June 13, 2026 20:09
@ronag ronag requested review from joyeecheung, mcollina and targos June 13, 2026 20:11
@ronag

ronag commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

So this is basically what we tried to upstream to V8 but was rejected and turned into the slower CopyArrayBufferBytes.

@ronag ronag added the buffer Issues and PRs related to the buffer subsystem. label Jun 13, 2026
@joyeecheung

Copy link
Copy Markdown
Member

Can you explain why the upstream rejected this patch and why the reason wouldn't apply here?

@ronag

ronag commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

https://chromium-review.googlesource.com/c/v8/v8/+/7735151

They preferred an implementation that matched the spec.

@joyeecheung

Copy link
Copy Markdown
Member

From the thread it doesn't look like the point about not materialising views provides better performance was ever communicated during the code review in the upstream? Have you tried upstreaming something citing performance numbers of the two different versions?

@ronag

ronag commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

I can't spend more money on this. So either someone will have to help with upstreaming (I don't have anyone that can do V8 stuff), land this as is or close this. I can always land it on a fork.

I think the value of this is quite high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

buffer Issues and PRs related to the buffer subsystem. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants