-
Notifications
You must be signed in to change notification settings - Fork 10
Feature/permute rebase13 #18
base: master
Are you sure you want to change the base?
Conversation
…, and single prec permute in avx.h
Guarded permute in simd_common.h as it won't work for GPU architectures FIXME: need better guards for Clang + GPU targets Fixed: any_of all_of in HIP backend Fixed: potential mask overflow in cuda_warp
I am wondering about the constructors with a subset of elements. Since this is specifically for permute vectors, maybe a non-member function to create a permute vector would be better? I would actually prefer an interface where the permute would take a permute integer simd type of the same length as the simd value type, but I take it from you that that is too expansive since you have to convert it internally? We need to think about what kind of interface would be acceptable to the C++ standard. |
I can make a make_permute() function call. That is a nice way of solving that oddball constructor issue.
I do actually use simd<int, simd_abi::whatever> for the control, primarily for efficiency on GPUs where you
want to spend maybe one register per mask, not 32. So it is filled ‘via’ simd_storage on the host where you give
the full vector length permute. But the actual mask is a simd<int,…> like you desire.
We need to think about what kind of interface would be acceptable to the C++ standard.
Indeed, I never actually dared to think that far ahead.
Have a look here:
for AVX512:
https://github.com/bjoo/simd-math-testing/blob/master/test/avx512-tests/test_simd_avx512_permute.cpp
and for e.g. CUDA:
https://github.com/bjoo/simd-math-testing/blob/master/test/cuda-tests/test_simd_cuda_permute.cpp
I’ll go add the make_permute() function…
Best,
B
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
-----------------------------------------------------------------------------
Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave Suite 3, MS 12B2, Room F217
Newport News VA 23606, USA
Tel: +1-757-269-5339 email: bjoo AT jlab.org
-----------------------------------------------------------------------------
|
…X512 Removed weird simd<int,> initialization for AVX512 -- essentially the zeros get put into a simd_t::storage_type by make_permute now.
Hi, as suggested by @crtrott I have modified the interface by adding Actually another thing, now that this is kinda traits-y so we have
there is nothing in principle to make the 'int mask' a compile time template parameter i.e:
where however for warp-sizes of 64, that could get unwieldy. NB: SYCL has its own way of doing shuffles with its switch( control_value ) { This gets pretty tedious to write. Ditto for AVX2 Double Prec shuffles Let me know if you need anything else added to this. |
Hi All, Here are the additions for vector lane permute. Intrinsics for AVX (Single Prec), AVX512 (single prec and double prec), Generic for other CPUs, __shfl_sync for CUDA, __shfl for HIP.
I also came accross an overflow possibility with the CUDA masks when N=32 (got compiler warning) so I made that go away.
A couple of FIXME's still left in there as well as a funny for AVX512 where the permutexvar intrinsic looks at every second element of the permutation index table register, and intevening elements are zeroed out, hence the funny 8 element constructor for the simd<int,> type. Maybe worth renaming these to something like simd_permute_control_t in the future.
This was rebased onto master after #13 was merged.
The testing harness is at: https://github.com/bjoo/simd-math-testing.git and shows how the permutes are being called.