Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

lu1and10 · 2024-07-08T15:44:56Z

Following @mreineck 's idea in discussions #461, using the symmetry trick with simd evaluation for width greater than simd vec length.

For width less or equal than simd vec length, one may need to batch non-uniform points and do polynomial evaluations related to PR #467, this PR does 1 x w for polynomial kernel eval, in PR #467, one may do 2 x w, 4 x w, ... etc. polynomial evaluations(2, 4 are the number of non-uniform points batched for polynomial evaluations)
Interleaving 2D, 3D polynomial evaluations for saving registers loading coefficients(discussed in #461) is not done in this PR.

This PR only tries out the symmetry trick with current master branch code.
There are two versions to test out:

eval_kernel_vec_Horner utilizes the aligned store which needs some simd shuffles of two simd vecs before aligned store.
eval_kernel_vec_Horner_unaligned_store, this version comes from @mreineck code which does the simd vec calculations and then doing a loop for simd vec elements for storing.

In my limited tests, in most cases version 1 and version 2 are of similar speed. While on the amd genoa cpu nodes, version 1 seems to be faster than version 2. @mreineck do you have any idea about the speed of using aligned store when you implement the symmetry trick?

…n ducc)

mreineck · 2024-07-08T16:00:34Z

I wouldn't go to the trouble of exploiting this when kernel support is less than the SIMD vector length.

Overall I haven't benchmarked this component much, since it is almost always completely dominated by the application of the kernel to the uniform data (kernel generation is O(support*ndim), spreading/interpolation is O(support**ndim)). If it is noticeable at all, it will be for 1D transforms. The only real reason I implemented this is that it reduces the size of the Horner coefficient array,.

lu1and10 · 2024-07-08T16:06:50Z

I wouldn't go to the trouble of exploiting this when kernel support is less than the SIMD vector length.

Overall I haven't benchmarked this component much, since it is almost always completely dominated by the application of the kernel to the uniform data (kernel generation is O(support*ndim), spreading/interpolation is O(support**ndim)). If it is noticeable at all, it will be for 1D transforms. The only real reason I implemented this is that it reduces the size of the Horner coefficient array,.

Yes, it mostly matters in 1D. I mostly test the speed in 1D, without the symmetry trick, the original code on amd genoa in my test total time for spread 1d of width 15 is around 2s, with method 2 improves to 1.8, with method 1 improves to 1.6. It's seems to be a 10% difference, that's why I'm interested in understanding the aligned store and wondering if @mreineck you see the 10% difference with aligned store on amd genoa cpu. I don't see this difference on my old intel xeon cpu. It seems that unaligned or aligned store only differs in the speed when the unaligned store results in using two cache lines? The measure is using single core with single thread. I like eval_kernel_vec_Horner_unaligned_store more, it's neat, but I'm a bit bothered with the 10% diff in 1D on amd genoa cpu.

mreineck · 2024-07-08T16:26:27Z

Sorry, I only have access to slightly older hardware here and have never tested the code on AVX512, which is what you have on Genoa, as far as I can see.

I fear that doing really exhaustive tests (Intel/AMD, AVX, AVX2, AVX512, single-thread/parallel, etc.) will not show a clear winner in all cases, and having many separate code paths for different architectures is bad for maintainability, so I personally try not to spend too much time on this kind of details.

lu1and10 · 2024-07-08T16:31:12Z

and having many separate code paths for different architectures is bad for maintainability

Yes, that's why I want to choose only one of the two methods merging to the master.

…p PR flatironinstitute#471

lu1and10 · 2024-07-09T16:35:59Z

geona avx512 gcc11 M1e7_N1e6:

rome avx2 gcc11 M1e7_N1e6:

DiamonDinoia · 2024-07-09T19:49:38Z

src/spreadinterp.cpp

+
+    // process simd vecs
+    for (uint8_t i = 0; i < n_eval; i += simd_size) {
+      auto k_odd  = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i) : zerov;


this should be a constexpr lambda.

ahbarnett · 2024-07-11T16:20:15Z

@lu1and10 Libin, are we ready to merge this?

lu1and10 · 2024-07-11T17:08:48Z

@lu1and10 Libin, are we ready to merge this?

I'm doing a rerun of your performance script since master branch changes, do you want to double check and run your Julia performance script making sure everything is OK?

lu1and10 · 2024-07-11T18:46:34Z

@ahbarnett following is the new bench with julia bench script after the master branch merging interp and new compiler flags.

Compared to the plots above,

for avx2, the spreader speed in master branch and this PR is no long linear in w, possibly because the flags(master branch spreader code does not change). Using this PR, there is slow down on w==8 for rome(avx2, meaning with new flags kernel symmetry does not help with avx2(4 doubles simd size) instructions for w==8 ), it seems hurt by the new flags, without the flags, w==8 is not slower than the master. Interp 1D seems more speed up for large w with kernel symmetry after the master merging interp PR.

for avx512, it seems to be less impacted by the compiler flags.

geona avx512 gcc11 M1e7_N1e6:

rome avx2 gcc11 M1e7_N1e6:

DiamonDinoia · 2024-07-12T14:54:47Z

src/spreadinterp.cpp

+        simd_type k_odd, k_even, k_prev, k_sym = zerov;
+        for (uint8_t i = 0, offset = w - tail; i < (w + 1) / 2;
+             i += simd_size, offset -= simd_size) {
+          k_odd  = if_odd_degree ? simd_type::load_aligned(padded_coeffs[0].data() + i)


this should be:

k_odd = [i]() constexpr noexept{ if constexpr(if_odd_degree){ return simd_type::load_aligned(padded_coeffs[0].data() + i); } else return simd_type{0}; }();

DiamonDinoia · 2024-07-12T14:56:45Z

src/spreadinterp.cpp

+            const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i);
+            k_odd              = xsimd::fma(k_odd, z2v, cji_odd);
+            const auto cji_even =
+                simd_type::load_aligned(padded_coeffs[j + 1].data() + i);


is [j+1] safe here? The loop is up to nc, should not be up to (nc-1) or (nc&~1)

src/spreadinterp.cpp

DiamonDinoia · 2024-07-12T14:59:56Z

src/spreadinterp.cpp

+    if constexpr (use_ker_sym) {
+      static constexpr uint8_t tail          = w % simd_size;
+      static constexpr uint8_t if_odd_degree = ((nc + 1) % 2);
+      static const simd_type zerov(0.0);


in my experiments retuning {0} is often faster than saving it as a static const

DiamonDinoia · 2024-07-12T15:02:47Z

src/spreadinterp.cpp

+        static constexpr auto reverse_batch =
+            xsimd::make_batch_constant<xsimd::as_unsigned_integer_t<FLT>, arch_t,
+                                       reverse_index<simd_size>>();
+
+        // process simd vecs
+        for (uint8_t i = 0, offset = w - simd_size; i < w / 2;
+             i += simd_size, offset -= simd_size) {
+          auto k_odd  = if_odd_degree
+                            ? simd_type::load_aligned(padded_coeffs[0].data() + i)
+                            : zerov;
+          auto k_even = simd_type::load_aligned(padded_coeffs[if_odd_degree].data() + i);
+          for (uint8_t j = 1 + if_odd_degree; j < nc; j += 2) {
+            const auto cji_odd = simd_type::load_aligned(padded_coeffs[j].data() + i);
+            k_odd              = xsimd::fma(k_odd, z2v, cji_odd);
+            const auto cji_even =
+                simd_type::load_aligned(padded_coeffs[j + 1].data() + i);
+            k_even = xsimd::fma(k_even, z2v, cji_even);
+          }
+          // left part
+          xsimd::fma(k_odd, zv, k_even).store_aligned(ker + i);
+          // right part symmetric to the left part
+          if (offset >= w / 2) {
+            // reverse the order for symmetric part
+            xsimd::swizzle(xsimd::fma(k_odd, -zv, k_even), reverse_batch)
+                .store_aligned(ker + offset);
+          }
+        }
+      }
+    } else {
+      const simd_type zv(z);
+
+      for (uint8_t i = 0; i < w; i += simd_size) {
+        auto k = simd_type::load_aligned(padded_coeffs[0].data() + i);
+        for (uint8_t j = 1; j < nc; ++j) {
+          const auto cji = simd_type::load_aligned(padded_coeffs[j].data() + i);
+          k              = xsimd::fma(k, zv, cji);
+        }
+        k.store_aligned(ker + i);


there is quite a lot of repetition I think you could do something like w/2 + (tail>0) to merge most of the code.

Could you give more of a hint what you're suggesting here? Or maybe Libin sees it...

ahbarnett · 2024-07-16T00:52:35Z

@lu1and10 it would be great to incorporate Marco's review and have this merged for meeting tomorrow...

lu1and10 · 2024-07-16T01:09:37Z

@lu1and10 it would be great to incorporate Marco's review and have this merged for meeting tomorrow...

I did the change I understand, there are some reviews I don't understand, especially the unsafe loop @DiamonDinoia mentioned(#477 (comment)), we should discuss it tomorrow, if it's unsafe then there will be a bug.

ahbarnett · 2024-07-16T01:12:38Z

great - maybe you can sort it out tomorrow am. Thanks, Alex

…

On Mon, Jul 15, 2024 at 9:11 PM Libin Lu ***@***.***> wrote: @lu1and10 <https://github.com/lu1and10> it would be great to incorporate Marco's review and have this merged for meeting tomorrow... I did the change I understand, there are some reviews I don't understand, especially the unsafe loop @DiamonDinoia <https://github.com/DiamonDinoia> mentioned(#477 (comment) <#477 (comment)>), we should discuss it tomorrow, if it's unsafe then there will be a bug. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSSCDSRFD2HVFRRJFTZMRXGPAVCNFSM6AAAAABKRDMZUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZG44DGMRVGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

lu1and10 · 2024-07-18T11:15:39Z

oh no, @DiamonDinoia could you please keep the commit history when merging this PR to master, there are some commits I want to try later and keep in master(for example Martin's unaligned store version). In the PR #429, there seems more commits can be merged while not merged and now in master. I have been carefully with the commits and as less as possible so that the commits doesn't need to be squashed.

lu1and10 · 2024-07-18T12:30:11Z

oh no, @DiamonDinoia could you please keep the commit history when merging this PR to master, there are some commits I want to try later and keep in master(for example Martin's unaligned store version). In the PR #429, there seems more commits can be merged while not merged and now in master. I have been carefully with the commits and as less as possible so that the commits doesn't need to be squashed.

Ok, @DiamonDinoia. Done in #492 #493 , no need to worry about this, thanks.

lu1and10 added 7 commits July 2, 2024 14:42

test kernel sym with aligned store

026d8d3

Merge branch 'master' into ker-sym

da5d950

Merge branch 'master' into ker-sym

fb6dc75

Merge branch 'flatironinstitute:master' into ker-sym

dd54cc7

add Horner sym eval without explicit aligned store(Martin does this i…

1ae8821

…n ducc)

Merge branch 'ker-sym' of github.com:lu1and10/finufft into ker-sym

26be4b4

Merge branch 'flatironinstitute:master' into ker-sym

52bf0e6

revert passing simd_type to ker_eval in interp, this is done in inter…

4d25c51

…p PR flatironinstitute#471

DiamonDinoia self-requested a review July 9, 2024 19:48

DiamonDinoia reviewed Jul 9, 2024

View reviewed changes

lu1and10 added 4 commits July 11, 2024 12:44

clean up

dd5dd2a

removed unused declare

a7bb9b4

Conflicts resolved

c28ca33

add some comments

eefac07

Merge branch 'flatironinstitute:master' into ker-sym

7fe3c99

DiamonDinoia requested changes Jul 12, 2024

View reviewed changes

change to use fnma in sym part

c9fded5

lu1and10 and others added 2 commits July 16, 2024 12:55

cleanup a bit, a bit slower

d04ffcf

small fixes

6dcf55d

fixed compile flags that was breaking clang code

2f1f13f

DiamonDinoia mentioned this pull request Jul 17, 2024

Towards 2.3 #490

Closed

8 tasks

DiamonDinoia added this to the 3.0 milestone Jul 17, 2024

lu1and10 added 3 commits July 17, 2024 12:16

remove conditional declaration

9d9b64c

try to make -fmerge-all-constants work

5a68272

use auto for z2v

8747df5

DiamonDinoia merged commit 4ea0096 into flatironinstitute:master Jul 17, 2024
34 checks passed

This was referenced Jul 18, 2024

Revert "Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461" #492

Merged

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

lu1and10 commented Jul 8, 2024 •

edited

Loading

mreineck commented Jul 8, 2024 •

edited

Loading

lu1and10 commented Jul 8, 2024 •

edited

Loading

mreineck commented Jul 8, 2024

lu1and10 commented Jul 8, 2024 •

edited

Loading

lu1and10 commented Jul 9, 2024

DiamonDinoia Jul 9, 2024

ahbarnett commented Jul 11, 2024 •

edited

Loading

lu1and10 commented Jul 11, 2024

lu1and10 commented Jul 11, 2024 •

edited

Loading

DiamonDinoia Jul 12, 2024 •

edited

Loading

DiamonDinoia Jul 12, 2024

DiamonDinoia Jul 12, 2024

DiamonDinoia Jul 12, 2024

ahbarnett Jul 16, 2024

ahbarnett commented Jul 16, 2024

lu1and10 commented Jul 16, 2024

ahbarnett commented Jul 16, 2024 via email

lu1and10 commented Jul 18, 2024 •

edited

Loading

lu1and10 commented Jul 18, 2024

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

Horner's rule for polynomial evaluation with symmetry idea diccussed in discussions #461 #477

Conversation

lu1and10 commented Jul 8, 2024 • edited Loading

mreineck commented Jul 8, 2024 • edited Loading

lu1and10 commented Jul 8, 2024 • edited Loading

mreineck commented Jul 8, 2024

lu1and10 commented Jul 8, 2024 • edited Loading

lu1and10 commented Jul 9, 2024

DiamonDinoia Jul 9, 2024

Choose a reason for hiding this comment

ahbarnett commented Jul 11, 2024 • edited Loading

lu1and10 commented Jul 11, 2024

lu1and10 commented Jul 11, 2024 • edited Loading

DiamonDinoia Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

DiamonDinoia Jul 12, 2024

Choose a reason for hiding this comment

DiamonDinoia Jul 12, 2024

Choose a reason for hiding this comment

DiamonDinoia Jul 12, 2024

Choose a reason for hiding this comment

ahbarnett Jul 16, 2024

Choose a reason for hiding this comment

ahbarnett commented Jul 16, 2024

lu1and10 commented Jul 16, 2024

ahbarnett commented Jul 16, 2024 via email

lu1and10 commented Jul 18, 2024 • edited Loading

lu1and10 commented Jul 18, 2024

lu1and10 commented Jul 8, 2024 •

edited

Loading

mreineck commented Jul 8, 2024 •

edited

Loading

lu1and10 commented Jul 8, 2024 •

edited

Loading

lu1and10 commented Jul 8, 2024 •

edited

Loading

ahbarnett commented Jul 11, 2024 •

edited

Loading

lu1and10 commented Jul 11, 2024 •

edited

Loading

DiamonDinoia Jul 12, 2024 •

edited

Loading

lu1and10 commented Jul 18, 2024 •

edited

Loading