optimized memcpy #18912

dweiller · 2024-02-13T09:10:37Z

This PR does a few things to improve memcpy performance:

the implementation is generic, whereas master (since f38d7a9) is adapted from an implementation designed for x86_64 avx2 machines
reduce the number of branches to handle small copy lengths
reduce the size of generated machine code

Here is a graph or the performance of master (commit 5cfcb01) and b7a887f in a microbenchmark. The benchmark times @memcpy for a specific length and alignment (mod 32) of source and destination 100 000 times in a loop; this is done for all 256 combinations of source and destination alignment and the average time per iteration across all alignment combinations is the reported result.

Note that this benchmark is not going to be particularly realistic for most circumstances as branch predictors will be perfectly trained.

In general we want to focus on optimizing small length copies as small lengths are the most frequent—some reported distributions can be found at here and this google paper (LLVM uses these for benchmarking). For small lengths, the above graph indicates between a modest and significant (up to around 3x) improvement, depending on the length.

Here is a graph showing performance across the distributions from the linked paper linked above (taken from the LLVM repo).

While this PR is not focussed on memmove I have also changed the ReleaseSmall implementation of memmove, as the new (since f38d7a9) memmove implementation is not handled well by LLVM in ReleaseSmall mode.

For code size of memcpy:

optimize mode	cpu	master `5cfcb01` (B)	`b7a887f` (B)
ReleaseFast	x86_64	1402	351
ReleaseFast	x86_64_v2	1402	351
ReleaseFast	x86_64_v3	1348	470
ReleaseFast	x86_64_v4	1348	470
ReleaseSmall	x86_64	1962	29
ReleaseSmall	x86_64_v2	1962	29
ReleaseSmall	x86_64_v3	1400	29
ReleaseSmall	x86_64_v4	1400	29

Note that the above column for master contains the size of the memmove implementation, the memcpy function on master compiles to 10 bytes of stack shuffling and a jump to memmove.

The ReleaseSmall size of memmove for b7a887f is 58 bytes for the cpus above.

I have only checked the performance of this change on my (x86_64, avx2, znver1) machine—it would be good to check on other architectures as well if someone with one would like to test it. The benchmarks used can be found here—the average and distrib ones were used to produce the above charts (check out the tools directory for data generation scripts).

A few other notes:

memcpy and memmove no longer share implementations
the copy loops are not manually unrolled (LLVM will unroll them on various targets).
testing manual unrolling on my machine has not indicated a significant difference in the distribution benchmark, but does show regressions in the hot loop benchmark
I have not benchmarked ReleaseSmall performance
For x86_64 machines that have ERMSB it is likely best to use rep movsb for copies over several hundred bytes (I don't have a suitable machine to test this)
For x86_64 machines that have FSRM it may be better to rep movsb for all/most cases (I don't have a suitable machine to test this)
When taking advantage of ERMSB and FSRM there would be the option of runtime detection—detection should have negligible performance impact as the branch should always be predicted correctly after the first few calls
recent arm (e.g. armv8.8) machines may want to use the instructions from the mops feature (similar to rep movsb from x86_64)
there may exist targets that would prefer to have a branch to select different instruction when aligned loads can be done in the copy loop

Other stuff that can be done (either before merging or as follow up issues):

check performance on other architectures and CPUs (would need other people to run benchmarks on their machines, or put some new benchmarks in gotta-go-fast)
investigate memmove performance and whether it makes sense to merge/share code with memcpy again
~~see if there is a reasonable way to do aligned vector moves for misaligned case (doubtful without inline assembly)~~ pretty sure this can't be done without @select taking a runtime mask or using inline assembly
benchmark copies with random sizes

Rexicon226 · 2024-02-13T09:23:56Z

Nice work! Could you share the benchmark script? I am really curious to see how those aligned loads/stores are emitted.

dweiller · 2024-02-13T09:35:45Z

Nice work! Could you share the benchmark script? I am really curious to see how those aligned loads/stores are emitted.

The benchmark script is just a pretty simple loop copying from one buffer to the other a bunch of times for each source alignment offset. I haven't done anything clever in the micro benchmark. The reason aligned vector moves are emitted is the use of @ptrCast to get pointers of type [*]CopyType/[*]const CopyType in lib/compiler_rt/memcpy.zig.

Here's the micro benchmark (doesn't look like I can colour it inside the fold):

pub fn main() !void {
    const allocator = std.heap.page_allocator;
const args = try std.process.argsAlloc(allocator);
defer allocator.free(args);
if (args.len > 3 or args.len < 2) return error.BadArgs;

const iterations = try std.fmt.parseInt(usize, args[1], 10);

const copy_len = if (args.len >= 3)
    try std.fmt.parseInt(usize, args[2], 10)
else
    std.mem.page_size;

std.debug.print("copying blocks of size {d}\n", .{std.fmt.fmtIntSizeBin(copy_len)});

const alignment = if (std.simd.suggestVectorLength(u8)) |vec_size|
    @alignOf(@Type(.{ .Vector = .{ .child = u8, .len = vec_size } }))
else
    @alignOf(usize);

var times: [alignment]u64 = undefined;

// preheat
try iteration(allocator, alignment, copy_len, alignment - 2, &times[0], iterations);

for (0..alignment) |s_offset| {
    try iteration(allocator, alignment, copy_len, s_offset, &times[s_offset], iterations);
    std.debug.print(
        "s_offset: {d}, time: {}\n",
        .{
            s_offset,
            std.fmt.fmtDuration(times[s_offset]),
        },
    );
}

var avg: u128 = 0;
for (times) |t| {
    avg += t;
}
avg /= times.len;

const final: u64 = @intCast(avg);

const throughput: u64 = @intCast((@as(u128, copy_len) * iterations * std.time.ns_per_s) / final);

std.debug.print("average time: {}, throughput: {d:.2}\n", .{
    std.fmt.fmtDuration(final),
    std.fmt.fmtIntSizeBin(throughput),
});

}
inline fn iteration(

allocator: Allocator,

comptime alignment: comptime_int,

copy_len: usize,

s_offset: usize,

p_time: *u64,

iterations: usize,

) !void {

const src = try allocator.alignedAlloc(u8, alignment, copy_len + s_offset);

defer allocator.free(src);
const dest = try allocator.alignedAlloc(u8, alignment, copy_len);
defer allocator.free(dest);

var timer = try std.time.Timer.start();

var i: usize = 0;
while (i < iterations) : (i += 1) {
    @memcpy(dest[0..copy_len], src[s_offset..][0..copy_len]);
}

const time = timer.read();

p_time.* = time;

}
const std = @import("std");

const Allocator = std.mem.Allocator;

Rexicon226 · 2024-02-13T10:20:14Z

Interesting. I have attempted to re-produce this and I see 2, maybe 3 issues.

This generates only like 1 pair of aligned loads, when you could be fitting in way more. See: https://zig.godbolt.org/z/YxEPvssMa
I simply cannot replicate your results using the benchmark script you provided. On my AVX2 machine, I see a consistent 30% worse perf on your function. I did these measurements by simply compiling a version of LLVM Zig with both compiler_rt versions. Furthermore, I wrote my own benchmark script, and also cannot replicate it. However there I was measuring how many cycles it takes to copy X amount of bytes. Very comparable results to the benchmark script your provided.
Note: all data collected was normalized and the mean was found with a rolling average of 50 data points over 10kb total range.
If you remember our discussion about this before, my personal memcpy function was faster, and that still seems to be the case. Please do not interpret the before sentence as criticism or containing any mal-intent, I am simply stating the facts I observe. I believe it is because my version contains more aligned vector loads squeezed in and does a better job at managing pre-fetching.

I recommend you use something like:

dest[0..v].* = @as(*const @Vector(v, u8), @alignCast(@ptrCast(aligned_src[offset..][0..v]))).*;

to generate your aligned vector loads, 😉

dweiller · 2024-02-13T12:41:08Z

1. This generates only like 1 pair of aligned loads, when you could be fitting in way more. See: https://zig.godbolt.org/z/YxEPvssMa

3. ... I believe it is because my version contains more aligned vector loads squeezed in and does a better job at managing pre-fetching.

I am not convinced this is the case. It is trivial to produce a longer sequence of aligned move operations by unrolling the loop - though this does neccesitate extra cleanup code - and when I did this in the past it did not have much impact on performance. Unless I get benchmark results clearly showing a win that seems generalizable I wouldn't want to do it; forcing this sort of unrolling seems more likely to trip up the optimizer and may only positively impact performance on the machine you run the benchmark on while degrading performance on other machines.

2. I simply cannot replicate your results using the benchmark script you provided.

If you let me know precisely how you compiled/ran it I will try on my machine as well. Particular things of interest would be how many iterations you did, the copy length, target cpu features, and optimize mode.

2. I see a consistent 30% worse perf on your function. I did these measurements by simply compiling a version of LLVM Zig with both compiler_rt versions.

Can you be more specific about what you did when you saw worse performance relative to master so I can investigate? I have also used compiling the zig compiler itself as a test and saw a minor (not really sure it was outside uncertainty) improvement in compile time.

2. Furthermore, I wrote my own benchmark script, and also cannot replicate it. However there I was measuring how many cycles it takes to copy X amount of bytes. Very comparable results to the benchmark script your provided.

Can you post your benchmark? I'm not sure what you mean by "...and also cannot replicate it" along with "Very comparable results to the benchmark script your provided." - those read as contradictory statements to me, so I must be misinterpretting something.

dweiller · 2024-02-13T12:45:32Z

Looks like this somehow broke a bunch of non-x86_64 targets in the linux CI. I wonder if on those targets LLVM is making memcpy produce a call to itself...

dweiller · 2024-02-13T13:42:48Z

I recommend you use something like:
dest[0..v].* = @as(*const @Vector(v, u8), @alignCast(@ptrCast(aligned_src[offset..][0..v]))).*;
to generate your aligned vector loads, 😉

I assume you mean @as(*Vector(V, u8), @alignCast(@ptrCast(dest[0..v]))).* = ...? Otherwise that is going to produce an unaligned store. Using this particular snippet to perform the cast (casting the dest as well) doesn't make a difference.

dweiller · 2024-02-13T13:59:11Z

I wonder if on those targets LLVM is making memcpy produce a call to itself...

It looks like this is the issue, at least for wasm32-wasi. I checked a test.wasm file produced by zig build test with wasm2wat and memcpy was indeed making a recursive call to itself. It's going to be pretty tedious if we have to rewrite memcpy for each target according to whether LLVM does this, or find a way to write it so it doesn't do this on all targets simultaneously.

If anyone knows of a way to trace wasm code produced back to source lines, similar to objdump --disassemble -S that would be helpful for working around this.

Jarred-Sumner · 2024-02-13T15:19:26Z

I suspect zig’s memcpy could get a whole lot faster than 30%-ish. I think it’s hard to rule out measurement noise without also having a complete benchmark snippet for others to run and on larger amounts of data. There must already be comprehensive memcpy benchmarks online somewhere you could copy from

dweiller · 2024-02-14T03:37:47Z

I suspect zig’s memcpy could get a whole lot faster than 30%-ish.

For small sizes this is certainly true, but for large copies I'm a bit skeptical, at least not without using inline asm with something like rep movsb, but that opens up a bit of a can of worms.

dweiller · 2024-02-14T06:41:35Z

I've made a bunch of improvements yielding both better performance and smaller code size. The table in the top post has been updated.

Edit: Not sure what the problem with riscv64-linux is in the CI - I've disassembled it and memcpy and is not doing a recursive call.

dweiller · 2024-02-15T04:25:50Z

Edit: Not sure what the problem with riscv64-linux is in the CI - I've disassembled it and memcpy and is not doing a recursive call.

This is wrong, recursive jalr instructions are generated for the case 16 <= copy_len < 32, to try and do a copy of length 16.

dweiller · 2024-02-19T05:36:31Z

The windows CI failure is CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool. Not sure why that it happening.

RossComputerGuy · 2024-02-19T05:55:06Z

@dweiller I just ran into that in a different PR as well, seems to be a fluke.

dweiller · 2024-03-15T23:37:56Z

The windows CI failure seems to be the compiler running out of memory - not sure why this would happen, it shouldn't be caused by this PR.

kprotty · 2024-05-10T00:47:18Z

A decent amount of the code could be simplified using comptime into something like:

// shared (dst, src)

copy(blk, offsets):
  for i in offsets: 
    dst[i..][0..blk].* = src[i..][0..blk].*

memcpy(len):
  if len == 0: return
  if len < 4: return copy(1, {0, len/2, len-1})

  v = max(4, suggestVectorSize(u8) or 0)
  inline for n in ctz(4)..ctz(v):
    if len <= 1 << (n+1): // @expect(false)
      return copy(1<<n, {0, len - (1<<n)})
      
  for i in 0..(len / v): copy(v, .{ i * v }) // 4-way auto-vec
  copy(v, .{ len - v })

But I think (at a higher level) we should go all the way: Facebook's folly memcpy is heavily optimized but also serves as their memmove since it's so similar. For sizes which cant be handled by a single copy above (anything after the inline for), they detect if src/dst overlap and switch to something similar but copied backwards.

Interestingly. they never use aligned loads since it doesn't seem worth it. Copy-forward does both load/store unaligned, but copy-backward uses aligned stores with @prefetch(dst_to_write, .{ .rw = .write, .locality = 3 }). They also switch to rep movsb (it's like builtin CPU memcpy) if the conditions align for it to be fast (len >= 1024, pointers dont overlap, ERMSB cpuid enabled).

The general strategy of having memcpy & memmove be the same thing would be nice (perf & maintenance wise). Fast memset is also the same as fast memcpy but replace loads of src with a pre-@splat block/vector/register of the same byte.

dweiller · 2024-05-10T01:11:15Z

A decent amount of the code could be simplified using comptime into something like:

This looks pretty nice, I'll try it out.

But I think (at a higher level) we should go all the way: Facebook's folly memcpy is heavily optimized but also serves as their memmove since it's so similar. For sizes which cant be handled by a single copy above (anything after the inline for), they detect if src/dst overlap and switch to something similar but copied backwards.

Our memcpy assumes that src/dst don't overlap, so a change like that would be a (non breaking, but significant) change in semantics that would affect performance.

Interestingly. they never use aligned loads since it doesn't seem worth it. Copy-forward does both load/store unaligned, but copy-backward uses aligned stores with @prefetch(dst_to_write, .{ .rw = .write, .locality = 3 }).

On my machine, IIRC, the aligned ops did affect performance, but this would also be something machine dependent. I have seen that the current wisdom seems to be that modern x86_64 doesn't really care (but for some reason haven't seen when this started being the case), but what about other platforms? At least for a general algorithm, I would think we should use aligned ops, and if/when we want to individually optimise different platforms we could use unaligned ops. I did also try it unrolling with prefetch, but didn't want to over-optimise for my machine - I can't recall how much difference it made for me.

They also switch to rep movsb (it's like builtin CPU memcpy) if the conditions align for it to be fast (len >= 1024, pointers dont overlap, ERMSB cpuid enabled).

I did some research into rep movsb and read that the benefit and breakpoint for when it's beneficial is highly machine dependent, so I thought I'd leave using it for a future enhancement.

Fast memset is also the same as fast memcpy but replace loads of src with a pre-@splat block/vector/register of the same byte.

I do actually have a memset branch locally as well - I could include it in this PR, but I haven't put as much work into it yet.

kprotty · 2024-05-10T02:48:16Z

Our memcpy assumes that src/dst don't overlap, so a change like that would be a (non breaking, but significant) change in semantics that would affect performance.

I rather mean to keep memcpy, the function with noalias and such, but have it just call memmove internally. This should keep the noalias optimizations applied by compiler replacing memcpy, but keep the vector-based / branch-optimized version at runtime.

On my machine, IIRC, the aligned ops did affect performance

Was this in relation to the results above? I think the benchmark could report throughput in cycles/byte rather than ns. This is something used in benchmarks like smhasher; hash functions do similar reading/branching opts and take a small amount of time for short sizes. Combine this with not clflushing dst/src and not cpu-core pinning, I'd assume ns-based time fluctuates more even with high iterations.

I only mention this as aligned loads didn't seem to have an effect on other benchmarks so trying to somehow discover/rationalize the difference in the results. TBF, also haven't tested on anything outside avx2, avx512f, and apple_m1.

rep movsb and read that the benefit and breakpoint for when it's beneficial is highly machine dependent

Indeed looks like it depends on micro-architecture specifically (being finnicky on zen2+). Seemed to be same speed as the normal vectorized loop at least on 5600x and 6900HS.

I do actually have a memset branch locally as well - I could include it in this PR

Yea a separate PR would be best. Just wanted to mention their similarity.

dweiller · 2024-05-10T03:40:58Z

I rather mean to keep memcpy, the function with noalias and such, but have it just call memmove internally. This should keep the noalias optimizations applied by compiler replacing memcpy, but keep the vector-based / branch-optimized version at runtime.

Ah, okay - I understand now.

On my machine, IIRC, the aligned ops did affect performance

Was this in relation to the results above? I think the benchmark could report throughput in cycles/byte rather than ns. This is something used in benchmarks like smhasher; hash functions do similar reading/branching opts and take a small amount of time for short sizes. Combine this with not clflushing dst/src and not cpu-core pinning, I'd assume ns-based time fluctuates more even with high iterations.

I think the benchmark has changed between when I tested (and the memcpy function has quite a bit too) - I'll re-check. Unless I'm missing something, I would be biased to keep the aligned strategy until/unless we diverge implementation based on whether a target has slow unaligned access or not, because I expect the impact on systems that don't care to be minimal (it costs one branch and one vector copy but this is only on large copies which are rare, and the effects of the increased code size), but systems that do have slower unaligned accesses will pay that cost proportional to the size of a long copy.

I can have the benchmark report cycles/byte, and do things like core-pinning and disabling frequency scaling next time I work on improving the benchmarking, but I think adding other benchmarks will probably take priority.

I only mention this as aligned loads didn't seem to have an effect on other benchmarks so trying to somehow discover/rationalize the difference in the results. TBF, also haven't tested on anything outside avx2, avx512f, and apple_m1.

One reason, could be the architecture - looking at Agner Fog's tables my machine (Zen 1) has worse latencies/throughputs on unaligned intstructions (at least for integer ones, can't remember at the moment if integer or float instructions were generated).

andrewrk · 2024-05-10T20:57:49Z

Before merging it would be good to:

It looks like this is not ready for merging then? Can you close it and open a new one when it's ready for review & merging? Otherwise we can use the issue tracker to discuss ideas, plans, strategies, etc.

dweiller · 2024-05-11T05:00:37Z

Before merging it would be good to:

It looks like this is not ready for merging then? Can you close it and open a new one when it's ready for review & merging? Otherwise we can use the issue tracker to discuss ideas, plans, strategies, etc.

I've just done some cleanup of commits and assuming it passes CI, I'm happy for it to be merged. The unchecked boxes are all things that are potential future work and could be broken out into new issues.

I would say the real question is what level of effort you want to see put into benchmarking before merging. I can certainly do more on my own - write more synthetic benchmarks, do some proper statistics, check some real-world impacts more carefully (I did see a small benefit in the zig compiler using the c backend, but that was a while ago and I don't remember the details) - but I think any serious benchmarking effort would require help from others (in particular to check on more machines/architectures) and may be better done as part of gotta-go-fast.

dweiller · 2024-05-11T08:58:58Z

Looking at the table above again - would it be better to not include the optimisations done in this PR on ReleaseSmall? I'll double check the number are still accurate since the llvm-18 upgrade, but assuming they are is that too many extra bytes for ReleaseSmall?

dweiller · 2024-11-30T07:06:15Z

I've just run the distribution benchmark to compare the current PR with https://github.com/Rexicon226/zig/tree/memcpy-go-brr (commit 4e0ce4b) which I assume is @Rexicon226's latest memcpy version and got these results on my machine:

I'm not sure what benchmark Rexicon was running in the past when he was comparing his version to old versions of this PR, so I can't check those results.

dweiller · 2024-12-15T14:50:32Z

A few other notes:

* the copy loops are not manually unrolled (LLVM will unroll them on various targets).

I think that manually unrolling is not a good idea at present as LLVM further unrolls the copy loops on various targets and does so too aggressively. It would be reasonable to do manual unrolling if we have @loopHint since we can do better unrolling via branchless tail cleanup.

andrewrk

I'm tired of waiting for @Rexicon226 so if you address this feedback I'll merge it. Thanks for working on this.

lib/compiler_rt/memcpy.zig

dweiller · 2025-01-17T15:10:41Z

I'm tired of waiting for @Rexicon226 so if you address this feedback I'll merge it. Thanks for working on this.

Should I also remove all the inline annotations from helper functions? They were originally added because without them (at least in some cases) LLVM would optimise into doing infinite recursive calls, but the new -fno-builtin behaviour should prevent this from happening. There shouldn't be any semantic reason to use inline fn, so the only consideration would be to ensure they don't get compiled to function calls. I assume release modes should always inline them, however, they do not get inlined if you compile in debug mode - is this an issue? I've got it in my head that compiler-rt is always compiled with -OReleaseFast or -OReleaseSmall, is that correct?

andrewrk · 2025-01-17T21:08:16Z

For compiler_rt it's important that all the intrinsics are lowered to exactly one function. This helps codegen be better since the calls are known leaf calls, and it interacts better with -ffunction-sections.

andrewrk · 2025-01-18T22:39:04Z

The commit f38d7a9 makes the memcpy implementation much more load-bearing, so it's important to rebase on top of latest master.

Can you update your perf measurements against latest master?

Can you take data points on aarch64 as well as x86_64?

dweiller · 2025-01-19T01:36:18Z

The commit f38d7a9 makes the memcpy implementation much more load-bearing, so it's important to rebase on top of latest master.

Can you update your perf measurements against latest master?

Can you take data points on aarch64 as well as x86_64?

I'll update and re-measure, but I don't have a aarch64 machine to benchmark on, I'll need someone else willing to run the benchmarks.

RossComputerGuy · 2025-01-19T01:55:33Z

but I don't have a aarch64 machine to benchmark on, I'll need someone else willing to run the benchmarks.

I volunteer, I've got 2 aarch64 systems and even a RISC-V system that can be used.

The new memcpy function aims to be more generic than the previous implementation which was adapted from an implementation optimized for x86_64 avx2 machines. Even on x86_64 avx2 machines this implementation should be generally be faster due to fewer branches in the small length cases and generating less machine code. Note that the new memcpy function no longer acts as a memmove.

dweiller · 2025-01-20T13:44:03Z

I have rebased and updated the OP to reflect the current state of the PR. Note that this PR now (i.e. commit b7a887f) splits memmove and memcpy implementations again, and also adds a memmove implementation for ReleaseSmall - if this PR doesn't get merged before the 0.14 release we will want to break that out into a separate change and merge that for 0.14.

but I don't have a aarch64 machine to benchmark on, I'll need someone else willing to run the benchmarks.

I volunteer, I've got 2 aarch64 systems and even a RISC-V system that can be used.

The benchmark I use in available at https://github.com/dweiller/zig-lib-bench if you can run the 'distrib' benchmark that would be great. If you have any issues feel free to raise issues there. It should be usable as is, but if you give me a few days I can try and make it a bit easier to use.

jacobly0 · 2025-01-20T19:15:18Z

For x86_64 machines that have FSRM it may be better to rep movsb for all/most cases (I don't have a suitable machine to test this)

Note that FSRM does not imply ERMSB. FSRM just means that you can use rep movsb for a single preceding or trailing unaligned segment.

jacobly0 · 2025-01-20T19:49:12Z

I'll update and re-measure, but I don't have a aarch64 machine to benchmark on, I'll need someone else willing to run the benchmarks.

Between ZSF and I, we have at least skylake, zen2, zen3, zen4, neoverse_n1, and apple_a15. If you provide specific instructions for producing a file for you to interpret and what cpus you need, I can do so.

andrewrk · 2025-01-20T20:00:03Z

Thanks for the rebase and the re-writeup. I'm now happy with this as-is. ARM perf can be a followup investigation.

Appreciate your patience with this one, @dweiller.

dweiller · 2025-01-21T04:14:37Z

I'm not sure what the CI failure on aarch64-windows is - as far as I see in the CI log there is just a exit 5 in the f32_passed_to_variadic_fn test case with no indicated cause. I don't know how variadic functions work under the hood - would it be using memcpy to pass the arguments?

andrewrk · 2025-01-21T04:19:17Z

I don't have any working explanation for that failure at the moment, although unfortunately almost anything could be explained by the bug that #22541 is addressing.

With 216e0f3 landed, once that PR is also landed I think we should be looking at a significantly more reliable CI.

dweiller · 2025-01-21T04:20:33Z

Ok - I'll just wait for that PR and rebase when it lands.

jacobly0 · 2025-01-25T06:08:19Z

lib/compiler_rt/memmove.zig

+        const first  = @as(*align(1) const @Vector(32, u8), @ptrCast(s - 32)).*;
+        const second = @as(*align(1) const @Vector(32, u8), @ptrCast(s - 64)).*;
+        const third  = @as(*align(1) const @Vector(32, u8), @ptrCast(s - 96)).*;
+        const fourth = @as(*align(1) const @Vector(32, u8), @ptrCast(s - 128)).*;
+
+        @as(*align(32) @Vector(32, u8), @alignCast(@ptrCast(d - 32))).*  = first;
+        @as(*align(32) @Vector(32, u8), @alignCast(@ptrCast(d - 64))).*  = second;
+        @as(*align(32) @Vector(32, u8), @alignCast(@ptrCast(d - 96))).*  = third;
+        @as(*align(32) @Vector(32, u8), @alignCast(@ptrCast(d - 128))).* = fourth;


This code is not valid, as there is no guarantee that @sizeOf(@Vector(32, u8)) == 32 for every target. Perhaps you meant [32]u8 instead of @Vector(32, u8) since @sizeOf([32]u8) == 32 is guaranteed for every target.

That was the pre-existing memmove - I just moved it. I've got a branch with a memmove implementation based on the memcpy from this PR which wouldn't have that issue, but haven't had time to do benchmarking for it yet.

I suppose the memcpy implementation in this PR may have a similar issue if std.simd.suggestVectorLength does not return a suitable length.

jacobly0 · 2025-01-25T14:01:30Z

lib/compiler_rt/memcpy.zig

+    assert(@sizeOf(Element) >= @alignOf(Element));
+    assert(std.math.isPowerOfTwo(@sizeOf(Element)));


All types have a size that is a multiple of its alignment, therefore the first condition isn't checking for anything. If you need a power of two, just use std.math.floorPowerOfTwo or something.

jacobly0 · 2025-01-25T14:04:54Z

lib/compiler_rt/memcpy.zig

+    @Type(.{ .vector = .{
+        .child = u8,
+        .len = vec_size,
+    } })


Suggested change

@Type(.{ .vector = .{

.child = u8,

.len = vec_size,

} })

@Vector(vec_size, u8)

lib/compiler_rt/memcpy.zig

dweiller force-pushed the memcpy-opt branch from 16f1c02 to 4cd6690 Compare February 13, 2024 14:37

dweiller force-pushed the memcpy-opt branch 3 times, most recently from 81a4c1c to 076afc6 Compare February 19, 2024 03:08

andrewrk mentioned this pull request Feb 28, 2024

Improve inflate performance for repetitive data #19113

Merged

dweiller force-pushed the memcpy-opt branch from b98565b to 9a82138 Compare March 15, 2024 12:14

andrewrk force-pushed the memcpy-opt branch from 9a82138 to a50599d Compare May 9, 2024 23:46

dweiller force-pushed the memcpy-opt branch from a50599d to 8121f4d Compare May 11, 2024 04:35

dweiller force-pushed the memcpy-opt branch from 8121f4d to 445438a Compare May 12, 2024 14:54

dweiller mentioned this pull request Jan 17, 2025

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

Open

andrewrk requested changes Jan 17, 2025

View reviewed changes

lib/compiler_rt/memcpy.zig Outdated Show resolved Hide resolved

lib/compiler_rt/memcpy.zig Outdated Show resolved Hide resolved

lib/compiler_rt/memcpy.zig Outdated Show resolved Hide resolved

lib/compiler_rt/memcpy.zig Outdated Show resolved Hide resolved

dweiller force-pushed the memcpy-opt branch from 9b8c38e to c3a3c30 Compare January 17, 2025 15:11

dweiller added 3 commits January 20, 2025 17:16

compiler-rt: move memmove back to memmove.zig

bf89482

compiler-rt: reduce memmove and memcpy size for ReleaseSmall

d6e1166

dweiller force-pushed the memcpy-opt branch from c3a3c30 to b7a887f Compare January 20, 2025 13:24

andrewrk enabled auto-merge January 20, 2025 20:00

andrewrk added the release notes This PR should be mentioned in the release notes. label Jan 20, 2025

andrewrk merged commit 18fcb3b into ziglang:master Jan 21, 2025
10 checks passed

dweiller deleted the memcpy-opt branch January 22, 2025 02:17

jacobly0 reviewed Jan 25, 2025

View reviewed changes

dweiller mentioned this pull request Jan 25, 2025

compiler-rt: memmove optimisation #22606

Open

2 tasks

marnix mentioned this pull request Jan 25, 2025

STM32 embedded debug binaries much larger with 0.14.0-dev.2851+b074fb7dd #22603

Open

jacobly0 reviewed Jan 25, 2025

View reviewed changes

dweiller mentioned this pull request Jan 26, 2025

compiler-rt memcpy followup #22615

Merged

		assert(@sizeOf(Element) >= @alignOf(Element));
		assert(std.math.isPowerOfTwo(@sizeOf(Element)));

optimized memcpy #18912

optimized memcpy #18912

Conversation

dweiller commented Feb 13, 2024 • edited Loading

Rexicon226 commented Feb 13, 2024

dweiller commented Feb 13, 2024

Rexicon226 commented Feb 13, 2024

dweiller commented Feb 13, 2024 • edited Loading

dweiller commented Feb 13, 2024

dweiller commented Feb 13, 2024 • edited Loading

dweiller commented Feb 13, 2024 • edited Loading

Jarred-Sumner commented Feb 13, 2024 • edited Loading

dweiller commented Feb 14, 2024

dweiller commented Feb 14, 2024 • edited Loading

dweiller commented Feb 15, 2024 • edited Loading

dweiller commented Feb 19, 2024 • edited Loading

RossComputerGuy commented Feb 19, 2024

dweiller commented Mar 15, 2024 • edited Loading

kprotty commented May 10, 2024 • edited Loading

dweiller commented May 10, 2024 • edited Loading

kprotty commented May 10, 2024

dweiller commented May 10, 2024 • edited Loading

andrewrk commented May 10, 2024

dweiller commented May 11, 2024 • edited Loading

dweiller commented May 11, 2024

dweiller commented Nov 30, 2024 • edited Loading

dweiller commented Dec 15, 2024 • edited Loading

andrewrk left a comment

Choose a reason for hiding this comment

dweiller commented Jan 17, 2025

andrewrk commented Jan 17, 2025

andrewrk commented Jan 18, 2025

dweiller commented Jan 19, 2025

RossComputerGuy commented Jan 19, 2025

dweiller commented Jan 20, 2025 • edited Loading

jacobly0 commented Jan 20, 2025 • edited Loading

jacobly0 commented Jan 20, 2025

andrewrk commented Jan 20, 2025

dweiller commented Jan 21, 2025 • edited Loading

andrewrk commented Jan 21, 2025

dweiller commented Jan 21, 2025

jacobly0 Jan 25, 2025

Choose a reason for hiding this comment

dweiller Jan 25, 2025

Choose a reason for hiding this comment

dweiller Jan 25, 2025

Choose a reason for hiding this comment

jacobly0 Jan 25, 2025

Choose a reason for hiding this comment

jacobly0 Jan 25, 2025

Choose a reason for hiding this comment

dweiller commented Feb 13, 2024 •

edited

Loading

dweiller commented Feb 13, 2024 •

edited

Loading

dweiller commented Feb 13, 2024 •

edited

Loading

dweiller commented Feb 13, 2024 •

edited

Loading

Jarred-Sumner commented Feb 13, 2024 •

edited

Loading

dweiller commented Feb 14, 2024 •

edited

Loading

dweiller commented Feb 15, 2024 •

edited

Loading

dweiller commented Feb 19, 2024 •

edited

Loading

dweiller commented Mar 15, 2024 •

edited

Loading

kprotty commented May 10, 2024 •

edited

Loading

dweiller commented May 10, 2024 •

edited

Loading

dweiller commented May 10, 2024 •

edited

Loading

dweiller commented May 11, 2024 •

edited

Loading

dweiller commented Nov 30, 2024 •

edited

Loading

dweiller commented Dec 15, 2024 •

edited

Loading

dweiller commented Jan 20, 2025 •

edited

Loading

jacobly0 commented Jan 20, 2025 •

edited

Loading

dweiller commented Jan 21, 2025 •

edited

Loading