buffer: more copy optimizations#63892
Conversation
Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte range from one ArrayBufferView to another. Unlike the existing v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the *views*: it resolves each view's data pointer directly (JSTypedArray::DataPtr / DataView::data_pointer) and reads the backing buffer's shared/immutable/ detached flags as plain field loads, without ever materializing or fetching the views' ArrayBuffers. This exists because materializing the ArrayBuffer is expensive. Profiling Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small copies the dominant native cost is ArrayBufferView::Buffer() -> JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call (not just first materialization) and incurred twice (source and target). Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of a small copy is spent turning two views into ArrayBuffers and querying them piecemeal, while the actual memmove is ~4%. Routing the node binding through CopyArrayBufferBytes forces all of that onto the embedder side; a view-level entry point folds it into a single call of cheap field reads. Byte ranges are clamped to both views' byte lengths. Nothing is copied when the source is detached/out-of-bounds or the target is detached/ out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes actually copied is returned. When both views are backed by a SharedArrayBuffer a relaxed-atomic memmove is used, honoring the SharedArrayBuffer memory model; any other combination performs a plain memmove on the backing store (matching CopyArrayBufferBytes). Carried as a floating patch; v8_embedder_string is bumped to -node.22 accordingly. It is the natural sibling of the CopyArrayBufferBytes API added in the preceding floating patch and touches nothing but the public ArrayBuffer/ArrayBufferView API. Refs: nodejs#55422 Signed-off-by: Robert Nagy <ronagy@icloud.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Review requested:
|
|
@targos @addaleax @joyeecheung This is a relatively trivial v8 patch. Would you be comfortable with floating this? I've already spent money on funding work to upstream |
|
I've tried the |
Route the native backing of Buffer.prototype.copy (CopyImpl, the `_copy` binding) through the new v8::ArrayBufferView::CopyArrayBufferViewBytes API instead of v8::ArrayBuffer::CopyArrayBufferBytes. The previous binding had to convert both views to ArrayBuffers (ArrayBufferView::Buffer()), read their byte offsets (ByteOffset()) and test shared-ness (IsSharedArrayBuffer()) before the copy -- around half a dozen separate V8 API calls per copy. The view-level API does all of that internally from the views' own fields in a single call, so the binding now just forwards the two views, the view-relative offsets and the length. Profiling on AMD EPYC 9135 (x86-64) attributed the small-copy cost almost entirely to that view->buffer conversion: ArrayBufferView::Buffer() / JSTypedArray::GetBuffer() alone was ~25% of runtime, paid every call and twice per copy. Resolving the buffer in JS instead (passing source.buffer/target.buffer to the binding) was measured and is worse: the typed-array `.buffer` getter is not JIT-inlined and dispatches through the CEntry trampoline to a C++ builtin, costing ~36%. The view-level copy keeps all existing semantics: byte-range clamping, no-op (0 bytes) on a detached or immutable target, relaxed-atomic memmove when both sides are SharedArrayBuffer-backed, plain memmove otherwise. The JS-side view clamping in copyImpl is retained: V8 clamps to the underlying backing store, which for pooled Buffers is the whole shared pool rather than the individual view. buffer-copy.js, median of 30 interleaved runs, AMD EPYC 9135 x86-64 (all changes p < 0.001, Welch t = 19-50): partial=false bytes=8: 42.2 -> 62.4 Mops/s (+48%) partial=true bytes=8: 42.0 -> 62.8 Mops/s (+49%) partial=false bytes=128: 42.2 -> 61.7 Mops/s (+46%) partial=true bytes=128: 42.0 -> 63.1 Mops/s (+50%) partial=false bytes=1024: 35.1 -> 47.3 Mops/s (+35%) partial=true bytes=1024: 37.8 -> 55.1 Mops/s (+46%) The gain is largest for small/medium copies, where per-call overhead dominates, and tapers for 1024-byte copies as the memmove itself grows. Also inlines the former _copyActual helper (only caller was copyImpl) into copyImpl, folding a redundant target.byteLength read. Refs: nodejs#55422 Signed-off-by: Robert Nagy <ronagy@icloud.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f511275 to
afe87da
Compare
|
So this is basically what we tried to upstream to V8 but was rejected and turned into the slower CopyArrayBufferBytes. |
|
Can you explain why the upstream rejected this patch and why the reason wouldn't apply here? |
|
https://chromium-review.googlesource.com/c/v8/v8/+/7735151 They preferred an implementation that matched the spec. |
|
From the thread it doesn't look like the point about not materialising views provides better performance was ever communicated during the code review in the upstream? Have you tried upstreaming something citing performance numbers of the two different versions? |
|
I can't spend more money on this. So either someone will have to help with upstreaming (I don't have anyone that can do V8 stuff), land this as is or close this. I can always land it on a fork. I think the value of this is quite high. |
Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte
range from one ArrayBufferView to another. Unlike the existing
v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the views: it
resolves each view's data pointer directly (JSTypedArray::DataPtr /
DataView::data_pointer) and reads the backing buffer's shared/immutable/
detached flags as plain field loads, without ever materializing or
fetching the views' ArrayBuffers.
This exists because materializing the ArrayBuffer is expensive. Profiling
Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small
copies the dominant native cost is ArrayBufferView::Buffer() ->
JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call
(not just first materialization) and incurred twice (source and target).
Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of
a small copy is spent turning two views into ArrayBuffers and querying
them piecemeal, while the actual memmove is ~4%. Routing the node binding
through CopyArrayBufferBytes forces all of that onto the embedder side; a
view-level entry point folds it into a single call of cheap field reads.
Byte ranges are clamped to both views' byte lengths. Nothing is copied
when the source is detached/out-of-bounds or the target is detached/
out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes
actually copied is returned. When both views are backed by a
SharedArrayBuffer a relaxed-atomic memmove is used, honoring the
SharedArrayBuffer memory model; any other combination performs a plain
memmove on the backing store (matching CopyArrayBufferBytes).
Carried as a floating patch; v8_embedder_string is bumped to -node.22
accordingly. It is the natural sibling of the CopyArrayBufferBytes API
added in the preceding floating patch and touches nothing but the public
ArrayBuffer/ArrayBufferView API.