Lapack matrices #329

blegouix · 2024-03-06T08:24:38Z

Add new matrices necessary for the LAPACK backend available in #274

Matrices classes (including matrix_sparse) support strided multiple-rhs

include/ddc/kernels/splines/matrix.hpp

include/ddc/kernels/splines/view.hpp

tpadioleau

Review of matrix.hpp

include/ddc/kernels/splines/matrix.hpp

tpadioleau

Review of matrix_banded.hpp

include/ddc/kernels/splines/matrix_banded.hpp

tpadioleau · 2024-03-08T11:39:33Z

include/ddc/kernels/splines/matrix_banded.hpp

From what I understand of this matrix, it only uses the GPU as a storage place. No computation is done on the GPU right ? If it is the case, no need to store data on the GPU. Keep the template to indicate where should the input data of solve_inplace_method will be located but use directly host data for the lapack solver.

To summarize, the diff with the original implementation should be small, I suggest:

replace unique_ptr with a host View or DualView

if you use a DualView, get_element and set_element should work only on the host view

if you use a DualView, factorize will also synchronize device view

after calling factorize, get_element and set_element should not be called anymore

in solve_inplace_method make a host copy of b and use it in the lapack function

But I wanted Kokkos-kernel branch to be as small as possible to facilitate Yuuichi works, so it means I need again another MR to move the matrix on the GPU...? Cannot we just go simple with the three MR I have already opened ? Lapack solvers calls on CPU are just placeholders.

I don't like DualView very much as at the end we just need all the data on the GPU because that's where we will perform the operations (except factorization for all matrixes but sparse).

Abdelhadi also has his matrix on the GPU and fills it directly on the GPU (because his matrix changes and he needs performance) so using DualViews or host view would definitively break the alignment between his class and mine.

For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:

template <class ExecSpace> class Matrix_Banded : public Matrix { protected: int const m_kl; // no. of subdiagonals int const m_ku; // no. of superdiagonals int const m_c; // no. of columns in q Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices // TODO: double** Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation public: Matrix_Banded(int const mat_size, int const kl, int const ku) : Matrix(mat_size) , m_kl(kl) , m_ku(ku) , m_c(2 * kl + ku + 1) , m_ipiv("ipiv", mat_size) , m_q("q", m_c * mat_size) { assert(mat_size > 0); assert(m_kl >= 0); assert(m_ku >= 0); assert(m_kl <= mat_size); assert(m_ku <= mat_size); /* * Given the linear system A*x=b, we assume that A is a square (n by n) * with ku super-diagonals and kl sub-diagonals. * All non-zero elements of A are stored in the rectangular matrix q, using * the format required by DGBTRF (LAPACK): diagonals of A are rows of q. * q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows * for the superdiagonals. (The kl additional rows are needed for pivoting.) * The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j). */ Kokkos::deep_copy(m_q, 0.); } double get_element(int const i, int const j) const override { if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) { return m_q(j * m_c + m_kl + m_ku + i - j); } else { return 0.0; } } void set_element(int const i, int const j, double const aij) override { if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) { m_q(j * m_c + m_kl + m_ku + i - j) = aij; } else { assert(std::fabs(aij) < 1e-20); } } protected: int factorize_method() override { int info; int const n = get_size(); dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info); return info; } int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override { assert(b.stride(0) == 1); int const n_equations = b.extent(1); int const stride = b.stride(1); Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space> b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride)); auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view); for (int i = 0; i < n_equations; ++i) { Kokkos::deep_copy( Kokkos::subview(b_host, Kokkos::ALL, i), Kokkos::subview(b_view, Kokkos::ALL, i)); } int info; int const n = get_size(); dgbtrs_(&transpose, &n, &m_kl, &m_ku, &n_equations, m_q.data(), &m_c, m_ipiv.data(), b_host.data(), &stride, &info); for (int i = 0; i < n_equations; ++i) { Kokkos::deep_copy( Kokkos::subview(b_view, Kokkos::ALL, i), Kokkos::subview(b_host, Kokkos::ALL, i)); } return info; } };

I did the change but I am still wondering why we would want the matrix being on the host, it is in contradiction with the tendency "put everything on ExecSpace" I tried to follow with splines, and I think I will have to revert the change in one month or two when the solvers will be on GPU. I am ok to do small PRs when it is possible but undo what has been done to redo it later looks waste of time to me.

The goal is to have a robust implementation with a fast solve_inplace_method. Therefore, if solving the matrix is done on the CPU, it is better to keep the matrix on the CPU rather than going back and forth between the CPU and the GPU.

When we are able to solve on the GPU, the filling of the matrix on the GPU is still questionable. It might be faster to do it in parallel on the CPU, factorize again on the CPU and then push it to the GPU. Moreover the documentation of Kokkos does not mention if calling Kokkos::deep_copy in parallel is allowed.

The size of the matrix is quite small in the end, having a copy of it on the CPU should not be a problem.

…od and revert the "layout logic" of the tests

Co-authored-by: Thomas Padioleau <[email protected]>

tpadioleau · 2024-03-11T11:04:18Z

include/ddc/kernels/splines/matrix_banded.hpp

For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:

template <class ExecSpace> class Matrix_Banded : public Matrix { protected: int const m_kl; // no. of subdiagonals int const m_ku; // no. of superdiagonals int const m_c; // no. of columns in q Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices // TODO: double** Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation public: Matrix_Banded(int const mat_size, int const kl, int const ku) : Matrix(mat_size) , m_kl(kl) , m_ku(ku) , m_c(2 * kl + ku + 1) , m_ipiv("ipiv", mat_size) , m_q("q", m_c * mat_size) { assert(mat_size > 0); assert(m_kl >= 0); assert(m_ku >= 0); assert(m_kl <= mat_size); assert(m_ku <= mat_size); /* * Given the linear system A*x=b, we assume that A is a square (n by n) * with ku super-diagonals and kl sub-diagonals. * All non-zero elements of A are stored in the rectangular matrix q, using * the format required by DGBTRF (LAPACK): diagonals of A are rows of q. * q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows * for the superdiagonals. (The kl additional rows are needed for pivoting.) * The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j). */ Kokkos::deep_copy(m_q, 0.); } double get_element(int const i, int const j) const override { if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) { return m_q(j * m_c + m_kl + m_ku + i - j); } else { return 0.0; } } void set_element(int const i, int const j, double const aij) override { if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) { m_q(j * m_c + m_kl + m_ku + i - j) = aij; } else { assert(std::fabs(aij) < 1e-20); } } protected: int factorize_method() override { int info; int const n = get_size(); dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info); return info; } int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override { assert(b.stride(0) == 1); int const n_equations = b.extent(1); int const stride = b.stride(1); Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space> b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride)); auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view); for (int i = 0; i < n_equations; ++i) { Kokkos::deep_copy( Kokkos::subview(b_host, Kokkos::ALL, i), Kokkos::subview(b_view, Kokkos::ALL, i)); } int info; int const n = get_size(); dgbtrs_(&transpose, &n, &m_kl, &m_ku, &n_equations, m_q.data(), &m_c, m_ipiv.data(), b_host.data(), &stride, &info); for (int i = 0; i < n_equations; ++i) { Kokkos::deep_copy( Kokkos::subview(b_view, Kokkos::ALL, i), Kokkos::subview(b_host, Kokkos::ALL, i)); } return info; } };

include/ddc/kernels/splines/matrix.hpp

include/ddc/kernels/splines/matrix_center_block.hpp

include/ddc/kernels/splines/matrix_corner_block.hpp

Co-authored-by: Thomas Padioleau <[email protected]>

…ddc into lapack_matrices

tpadioleau · 2024-03-13T10:52:23Z

tests/splines/matrix.cpp

@@ -48,14 +46,14 @@ void check_inverse(ddc::DSpan2D matrix, ddc::DSpan2D inv)
        for (std::size_t j(0); j < N; ++j) {
            double id_val = 0.0;
            for (std::size_t k(0); k < N; ++k) {
-                id_val += matrix(i, k) * inv(j, k);
+                id_val += matrix(i, k) * inv(k, j);


Was it a silent bug ?

No, Emily's original matrixes classes are written using the non-conventionnal indices notation : first index being column number and second index being row number. I changed it in the current MR. So this is aligned with this change:

For example if you look there, this is unconventional:

ddc/include/ddc/kernels/splines/matrix.hpp

Line 70 in 417bd90

assert(int(bx.extent(1)) == m_n);

tests/splines/matrix.cpp

include/ddc/kernels/splines/matrix_sparse.hpp

CMakeLists.txt

tpadioleau · 2024-03-13T11:29:08Z

include/ddc/kernels/splines/matrix_periodic_banded.hpp

+        Kokkos::parallel_for(
+                "solve_lambda_section",
+                Kokkos::TeamPolicy<ExecSpace>(v.extent(1), Kokkos::AUTO),
+                KOKKOS_LAMBDA(
+                        const typename Kokkos::TeamPolicy<ExecSpace>::member_type& teamMember) {
+                    const int j = teamMember.league_rank();
+
+
+                    Kokkos::parallel_for(
+                            Kokkos::TeamThreadRange(teamMember, k_proxy),
+                            [&](const int i) {
+                                /// Upper diagonals in lambda
+                                for (int l = 0; l <= i; ++l) {
+                                    Kokkos::atomic_sub(&v(i, j), lambda_device(i, l) * u(l, j));
+                                }
+                                // Lower diagonals in lambda
+                                for (int l = i + 1; l < k_proxy + 1; ++l) {
+                                    Kokkos::atomic_sub(
+                                            &v(i, j),
+                                            lambda_device(i, l) * u(nb_proxy - 1 - k_proxy + l, j));
+                                }
+                            });
+                });


You seem to have changed your approach of parallelisation compared to calculate_delta_to_factorize. The approach in calculate_delta_to_factorize seems simpler, do you think the hierarchical parallelisation is worth it ?

calculate_delta_to_factorize is perfomed once on CPU during factorization. Here we need a performant matrix manipulation performed at each solver call.

Actually time spent in this kernel does not seem to be negligeable in the first profiling tests I did, even with such optimization.

I agree it needs to be efficient.

How far have you been in the analysis of performance ?

We need to know if we have coalescent accesses on GPU. Are u and v really non-contiguous ? if not, left or right ?

Atomics have a cost, are they really necessary here ?

What about the transfer to the device at the beginning of the function ?

Yes u and v are really non-contiguous because rhs structure is something like this:

uuuuuuuuuu uuuuuuuuuu uuuuuuuuuu vvvvvvvvvvvv vvvvvvvvvvvv

In Left layout.

Looking at it again, I don't think atomics are necessary as long as we keep the loops on l sequential (but if we use TeamThreads/ThreadVector we can still use a scalar buffer to compute the sum on l before doing the substraction). I will remove them.

In my original implementation lambda was on the device. So no, I did not look at the performance impact of it since the change.

Should I replace sequential for loops on l with ThreadVector actually ?

No i think this work would require its own PR to make a study of what execution policy better suits this loop.

Co-authored-by: Thomas Padioleau <[email protected]>

include/ddc/kernels/splines/matrix_sparse.hpp

…avoid race condition after atomics removal

EmilyBourne · 2024-04-08T15:57:09Z

include/ddc/kernels/splines/matrix.hpp

+        ddc::DSpan2D_left b_2d(b.data_handle(), b.extent(0), 1);
+        solve_transpose_inplace(b_2d);
        return b;
    }

-    virtual ddc::DSpan2D solve_multiple_inplace(ddc::DSpan2D const bx) const
+    virtual ddc::DSpan2D_stride solve_inplace(ddc::DSpan2D_stride const bx) const


What is the difference between DSpan2D_left and DSpan2D_stride?

The same between Kokkos::LayoutLeft and Kokkos::LayoutStride. 2D LayoutLeft is necessarily contiguous in left dimension and enforces extent_left==stride_right.

include/ddc/kernels/splines/matrix_banded.hpp

EmilyBourne · 2024-04-08T16:02:19Z

include/ddc/kernels/splines/matrix_banded.hpp

+        Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
+                b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));


Doesn't "ExecSpace" stand for "Execution Space"? I.e. the place (CPU/GPU) where the calculation will be performed? For Lapack functions you don't have a choice about this so I'm not clear on why you parametrise by this. Is it simply a roundabout way of passing the memory space?

Because Lapack is a placeholder for Kokkos-Kernel. In the current branch we create a mirror on the CPU to call the solver but in #276 we expect to solve on GPU.

include/ddc/kernels/splines/matrix_banded.hpp

include/ddc/kernels/splines/matrix_center_block.hpp

…ddc into lapack_matrices

tpadioleau · 2024-06-05T16:05:43Z

@blegouix Same question here about the branch

blegouix added 3 commits March 6, 2024 08:37

extract matrices from lapack branch

09a4c80

looks ok

f62cde8

align a comment

6717f00

blegouix self-assigned this Mar 6, 2024

blegouix commented Mar 6, 2024

View reviewed changes

include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved

blegouix commented Mar 6, 2024

View reviewed changes

include/ddc/kernels/splines/view.hpp Show resolved Hide resolved

blegouix added 2 commits March 6, 2024 13:13

forgotten reset

7d7aed9

Merge branch 'main' into lapack_matrices

2b2bf61

tpadioleau requested changes Mar 6, 2024

View reviewed changes

include/ddc/kernels/splines/matrix.hpp Show resolved Hide resolved

include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved

include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved

include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved

blegouix added 3 commits March 7, 2024 11:40

minor changes from Thomas' review

8b49216

privatize m_n

354023a

remove reset

7854cce

tpadioleau requested changes Mar 8, 2024

View reviewed changes

blegouix and others added 4 commits March 8, 2024 16:31

wip (problem with assert)

a2bcd08

replace double* with strided mdspan in argument of solve_inplace_meth…

e5f3249

…od and revert the "layout logic" of the tests

remove debug line

40d16f2

Update include/ddc/kernels/splines/matrix_banded.hpp

14b241b

Co-authored-by: Thomas Padioleau <[email protected]>

tpadioleau requested changes Mar 11, 2024

View reviewed changes

blegouix and others added 5 commits March 12, 2024 12:01

/!\ matrixes data moved to host

44533fb

Update include/ddc/kernels/splines/matrix_center_block.hpp

d2a165b

Co-authored-by: Thomas Padioleau <[email protected]>

Update include/ddc/kernels/splines/matrix_center_block.hpp

7e4c6c7

Co-authored-by: Thomas Padioleau <[email protected]>

remove KOKKOS_CLASS_LAMBDA

ac0a0df

Merge branch 'lapack_matrices' of github.com:Maison-de-la-Simulation/…

cc5e473

…ddc into lapack_matrices

tpadioleau reviewed Mar 13, 2024

View reviewed changes

tests/splines/matrix.cpp Outdated Show resolved Hide resolved

tpadioleau reviewed Mar 13, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_sparse.hpp Outdated Show resolved Hide resolved

tpadioleau reviewed Mar 13, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_sparse.hpp Outdated Show resolved Hide resolved

tpadioleau reviewed Mar 13, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

tpadioleau reviewed Mar 13, 2024

View reviewed changes

Update CMakeLists.txt

2d5e7b5

Co-authored-by: Thomas Padioleau <[email protected]>

blegouix and others added 2 commits March 15, 2024 08:20

Update include/ddc/kernels/splines/matrix.hpp

c0553b8

Co-authored-by: Thomas Padioleau <[email protected]>

Update include/ddc/kernels/splines/matrix_sparse.hpp

1bf5b2d

Co-authored-by: Thomas Padioleau <[email protected]>

blegouix commented Mar 15, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_sparse.hpp Outdated Show resolved Hide resolved

blegouix added 9 commits March 15, 2024 08:33

Update include/ddc/kernels/splines/matrix_sparse.hpp

ddf030e

remove atomics

bb79f3f

matrix tests on GPU

cc0b104

fix, revert the nesting of MatrixPeriodicBanded::lambda_transpose to …

64e0f19

…avoid race condition after atomics removal

remove debug line

36323c7

Merge branch 'main' into lapack_matrices

9c9bf29

Merge branch 'main' into lapack_matrices

955fb96

reuse headers

3afad7a

reuse cmake

e961cff

blegouix requested a review from tpadioleau March 29, 2024 11:11

blegouix mentioned this pull request Apr 8, 2024

Simplify Matrix API #336

Closed

EmilyBourne reviewed Apr 8, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_banded.hpp Outdated Show resolved Hide resolved

EmilyBourne reviewed Apr 8, 2024

View reviewed changes

EmilyBourne reviewed Apr 9, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_banded.hpp Outdated Show resolved Hide resolved

EmilyBourne reviewed Apr 9, 2024

View reviewed changes

include/ddc/kernels/splines/matrix_center_block.hpp Outdated Show resolved Hide resolved

blegouix added 7 commits April 15, 2024 14:17

Merge branch 'main' into lapack_matrices

a7d6f71

Update include/ddc/kernels/splines/matrix_banded.hpp

8624b0f

Merge branch 'lapack_matrices' of github.com:Maison-de-la-Simulation/…

14a621e

…ddc into lapack_matrices

camel case

0feb2b8

Merge branch 'main' into lapack_matrices

73c38c0

Merge branch 'main' into lapack_matrices

a5931ff

LAPACKE

3e68db6

blegouix closed this Jun 5, 2024

blegouix deleted the lapack_matrices branch June 5, 2024 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lapack matrices #329

Lapack matrices #329

blegouix commented Mar 6, 2024 •

edited

Loading

tpadioleau left a comment

tpadioleau left a comment

tpadioleau Mar 8, 2024

blegouix Mar 8, 2024 •

edited

Loading

tpadioleau Mar 11, 2024

blegouix Mar 12, 2024

tpadioleau Mar 13, 2024

tpadioleau Mar 11, 2024

tpadioleau Mar 13, 2024

blegouix Mar 13, 2024

blegouix Mar 13, 2024 •

edited

Loading

tpadioleau Mar 13, 2024

blegouix Mar 13, 2024 •

edited

Loading

tpadioleau Mar 13, 2024 •

edited

Loading

blegouix Mar 15, 2024 •

edited

Loading

blegouix Mar 15, 2024

tpadioleau Mar 15, 2024

EmilyBourne Apr 8, 2024

blegouix Apr 9, 2024 •

edited

Loading

EmilyBourne Apr 8, 2024

blegouix Apr 9, 2024

tpadioleau commented Jun 5, 2024

		Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
		b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));

Lapack matrices #329

Lapack matrices #329

Conversation

blegouix commented Mar 6, 2024 • edited Loading

tpadioleau left a comment

Choose a reason for hiding this comment

tpadioleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blegouix Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blegouix Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blegouix Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

tpadioleau Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

blegouix Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blegouix Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpadioleau commented Jun 5, 2024

blegouix commented Mar 6, 2024 •

edited

Loading

blegouix Mar 8, 2024 •

edited

Loading

blegouix Mar 13, 2024 •

edited

Loading

blegouix Mar 13, 2024 •

edited

Loading

tpadioleau Mar 13, 2024 •

edited

Loading

blegouix Mar 15, 2024 •

edited

Loading

blegouix Apr 9, 2024 •

edited

Loading