Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lapack matrices #329

Closed
wants to merge 36 commits into from
Closed

Lapack matrices #329

wants to merge 36 commits into from

Conversation

blegouix
Copy link
Collaborator

@blegouix blegouix commented Mar 6, 2024

Add new matrices necessary for the LAPACK backend available in #274

Matrices classes (including matrix_sparse) support strided multiple-rhs

@blegouix blegouix self-assigned this Mar 6, 2024
Copy link
Member

@tpadioleau tpadioleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of matrix.hpp

include/ddc/kernels/splines/matrix.hpp Show resolved Hide resolved
include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved
Copy link
Member

@tpadioleau tpadioleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of matrix_banded.hpp

include/ddc/kernels/splines/matrix_banded.hpp Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand of this matrix, it only uses the GPU as a storage place. No computation is done on the GPU right ? If it is the case, no need to store data on the GPU. Keep the template to indicate where should the input data of solve_inplace_method will be located but use directly host data for the lapack solver.

To summarize, the diff with the original implementation should be small, I suggest:

  • replace unique_ptr with a host View or DualView
  • if you use a DualView, get_element and set_element should work only on the host view
  • if you use a DualView, factorize will also synchronize device view
  • after calling factorize, get_element and set_element should not be called anymore
  • in solve_inplace_method make a host copy of b and use it in the lapack function

Copy link
Collaborator Author

@blegouix blegouix Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I wanted Kokkos-kernel branch to be as small as possible to facilitate Yuuichi works, so it means I need again another MR to move the matrix on the GPU...? Cannot we just go simple with the three MR I have already opened ? Lapack solvers calls on CPU are just placeholders.

I don't like DualView very much as at the end we just need all the data on the GPU because that's where we will perform the operations (except factorization for all matrixes but sparse).

Abdelhadi also has his matrix on the GPU and fills it directly on the GPU (because his matrix changes and he needs performance) so using DualViews or host view would definitively break the alignment between his class and mine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:

template <class ExecSpace>
class Matrix_Banded : public Matrix
{
protected:
    int const m_kl; // no. of subdiagonals
    int const m_ku; // no. of superdiagonals
    int const m_c; // no. of columns in q
    Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices
    // TODO: double**
    Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation

public:
    Matrix_Banded(int const mat_size, int const kl, int const ku)
        : Matrix(mat_size)
        , m_kl(kl)
        , m_ku(ku)
        , m_c(2 * kl + ku + 1)
        , m_ipiv("ipiv", mat_size)
        , m_q("q", m_c * mat_size)
    {
        assert(mat_size > 0);
        assert(m_kl >= 0);
        assert(m_ku >= 0);
        assert(m_kl <= mat_size);
        assert(m_ku <= mat_size);

        /*
         * Given the linear system A*x=b, we assume that A is a square (n by n)
         * with ku super-diagonals and kl sub-diagonals.
         * All non-zero elements of A are stored in the rectangular matrix q, using
         * the format required by DGBTRF (LAPACK): diagonals of A are rows of q.
         * q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows
         * for the superdiagonals. (The kl additional rows are needed for pivoting.)
         * The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j).
         */

        Kokkos::deep_copy(m_q, 0.);
    }

    double get_element(int const i, int const j) const override
    {
        if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
            return m_q(j * m_c + m_kl + m_ku + i - j);
        } else {
            return 0.0;
        }
    }

    void set_element(int const i, int const j, double const aij) override
    {
        if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
            m_q(j * m_c + m_kl + m_ku + i - j) = aij;
        } else {
            assert(std::fabs(aij) < 1e-20);
        }
    }

protected:
    int factorize_method() override
    {
        int info;
        int const n = get_size();
        dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info);
        return info;
    }

    int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override
    {
        assert(b.stride(0) == 1);
        int const n_equations = b.extent(1);
        int const stride = b.stride(1);

        Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
                b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));
        auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view);
        for (int i = 0; i < n_equations; ++i) {
            Kokkos::deep_copy(
                    Kokkos::subview(b_host, Kokkos::ALL, i),
                    Kokkos::subview(b_view, Kokkos::ALL, i));
        }
        int info;
        int const n = get_size();
        dgbtrs_(&transpose,
                &n,
                &m_kl,
                &m_ku,
                &n_equations,
                m_q.data(),
                &m_c,
                m_ipiv.data(),
                b_host.data(),
                &stride,
                &info);
        for (int i = 0; i < n_equations; ++i) {
            Kokkos::deep_copy(
                    Kokkos::subview(b_view, Kokkos::ALL, i),
                    Kokkos::subview(b_host, Kokkos::ALL, i));
        }
        return info;
    }
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the change but I am still wondering why we would want the matrix being on the host, it is in contradiction with the tendency "put everything on ExecSpace" I tried to follow with splines, and I think I will have to revert the change in one month or two when the solvers will be on GPU. I am ok to do small PRs when it is possible but undo what has been done to redo it later looks waste of time to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to have a robust implementation with a fast solve_inplace_method. Therefore, if solving the matrix is done on the CPU, it is better to keep the matrix on the CPU rather than going back and forth between the CPU and the GPU.

When we are able to solve on the GPU, the filling of the matrix on the GPU is still questionable. It might be faster to do it in parallel on the CPU, factorize again on the CPU and then push it to the GPU. Moreover the documentation of Kokkos does not mention if calling Kokkos::deep_copy in parallel is allowed.

The size of the matrix is quite small in the end, having a copy of it on the CPU should not be a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:

template <class ExecSpace>
class Matrix_Banded : public Matrix
{
protected:
    int const m_kl; // no. of subdiagonals
    int const m_ku; // no. of superdiagonals
    int const m_c; // no. of columns in q
    Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices
    // TODO: double**
    Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation

public:
    Matrix_Banded(int const mat_size, int const kl, int const ku)
        : Matrix(mat_size)
        , m_kl(kl)
        , m_ku(ku)
        , m_c(2 * kl + ku + 1)
        , m_ipiv("ipiv", mat_size)
        , m_q("q", m_c * mat_size)
    {
        assert(mat_size > 0);
        assert(m_kl >= 0);
        assert(m_ku >= 0);
        assert(m_kl <= mat_size);
        assert(m_ku <= mat_size);

        /*
         * Given the linear system A*x=b, we assume that A is a square (n by n)
         * with ku super-diagonals and kl sub-diagonals.
         * All non-zero elements of A are stored in the rectangular matrix q, using
         * the format required by DGBTRF (LAPACK): diagonals of A are rows of q.
         * q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows
         * for the superdiagonals. (The kl additional rows are needed for pivoting.)
         * The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j).
         */

        Kokkos::deep_copy(m_q, 0.);
    }

    double get_element(int const i, int const j) const override
    {
        if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
            return m_q(j * m_c + m_kl + m_ku + i - j);
        } else {
            return 0.0;
        }
    }

    void set_element(int const i, int const j, double const aij) override
    {
        if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
            m_q(j * m_c + m_kl + m_ku + i - j) = aij;
        } else {
            assert(std::fabs(aij) < 1e-20);
        }
    }

protected:
    int factorize_method() override
    {
        int info;
        int const n = get_size();
        dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info);
        return info;
    }

    int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override
    {
        assert(b.stride(0) == 1);
        int const n_equations = b.extent(1);
        int const stride = b.stride(1);

        Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
                b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));
        auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view);
        for (int i = 0; i < n_equations; ++i) {
            Kokkos::deep_copy(
                    Kokkos::subview(b_host, Kokkos::ALL, i),
                    Kokkos::subview(b_view, Kokkos::ALL, i));
        }
        int info;
        int const n = get_size();
        dgbtrs_(&transpose,
                &n,
                &m_kl,
                &m_ku,
                &n_equations,
                m_q.data(),
                &m_c,
                m_ipiv.data(),
                b_host.data(),
                &stride,
                &info);
        for (int i = 0; i < n_equations; ++i) {
            Kokkos::deep_copy(
                    Kokkos::subview(b_view, Kokkos::ALL, i),
                    Kokkos::subview(b_host, Kokkos::ALL, i));
        }
        return info;
    }
};

include/ddc/kernels/splines/matrix.hpp Show resolved Hide resolved
include/ddc/kernels/splines/matrix.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_center_block.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_center_block.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_corner_block.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_corner_block.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_corner_block.hpp Outdated Show resolved Hide resolved
include/ddc/kernels/splines/matrix_corner_block.hpp Outdated Show resolved Hide resolved
@@ -48,14 +46,14 @@ void check_inverse(ddc::DSpan2D matrix, ddc::DSpan2D inv)
for (std::size_t j(0); j < N; ++j) {
double id_val = 0.0;
for (std::size_t k(0); k < N; ++k) {
id_val += matrix(i, k) * inv(j, k);
id_val += matrix(i, k) * inv(k, j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it a silent bug ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Emily's original matrixes classes are written using the non-conventionnal indices notation : first index being column number and second index being row number. I changed it in the current MR. So this is aligned with this change:

image

Copy link
Collaborator Author

@blegouix blegouix Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if you look there, this is unconventional:

assert(int(bx.extent(1)) == m_n);

CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines 132 to 154
Kokkos::parallel_for(
"solve_lambda_section",
Kokkos::TeamPolicy<ExecSpace>(v.extent(1), Kokkos::AUTO),
KOKKOS_LAMBDA(
const typename Kokkos::TeamPolicy<ExecSpace>::member_type& teamMember) {
const int j = teamMember.league_rank();


Kokkos::parallel_for(
Kokkos::TeamThreadRange(teamMember, k_proxy),
[&](const int i) {
/// Upper diagonals in lambda
for (int l = 0; l <= i; ++l) {
Kokkos::atomic_sub(&v(i, j), lambda_device(i, l) * u(l, j));
}
// Lower diagonals in lambda
for (int l = i + 1; l < k_proxy + 1; ++l) {
Kokkos::atomic_sub(
&v(i, j),
lambda_device(i, l) * u(nb_proxy - 1 - k_proxy + l, j));
}
});
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have changed your approach of parallelisation compared to calculate_delta_to_factorize. The approach in calculate_delta_to_factorize seems simpler, do you think the hierarchical parallelisation is worth it ?

Copy link
Collaborator Author

@blegouix blegouix Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_delta_to_factorize is perfomed once on CPU during factorization. Here we need a performant matrix manipulation performed at each solver call.

Actually time spent in this kernel does not seem to be negligeable in the first profiling tests I did, even with such optimization.

Copy link
Member

@tpadioleau tpadioleau Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it needs to be efficient.

How far have you been in the analysis of performance ?

  • We need to know if we have coalescent accesses on GPU. Are u and v really non-contiguous ? if not, left or right ?
  • Atomics have a cost, are they really necessary here ?
  • What about the transfer to the device at the beginning of the function ?

Copy link
Collaborator Author

@blegouix blegouix Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes u and v are really non-contiguous because rhs structure is something like this:
uuuuuuuuuu
uuuuuuuuuu
uuuuuuuuuu
vvvvvvvvvvvv
vvvvvvvvvvvv

In Left layout.

  1. Looking at it again, I don't think atomics are necessary as long as we keep the loops on l sequential (but if we use TeamThreads/ThreadVector we can still use a scalar buffer to compute the sum on l before doing the substraction). I will remove them.
  2. In my original implementation lambda was on the device. So no, I did not look at the performance impact of it since the change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I replace sequential for loops on l with ThreadVector actually ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No i think this work would require its own PR to make a study of what execution policy better suits this loop.

Co-authored-by: Thomas Padioleau <[email protected]>
@blegouix blegouix mentioned this pull request Apr 8, 2024
Comment on lines +58 to +63
ddc::DSpan2D_left b_2d(b.data_handle(), b.extent(0), 1);
solve_transpose_inplace(b_2d);
return b;
}

virtual ddc::DSpan2D solve_multiple_inplace(ddc::DSpan2D const bx) const
virtual ddc::DSpan2D_stride solve_inplace(ddc::DSpan2D_stride const bx) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between DSpan2D_left and DSpan2D_stride?

Copy link
Collaborator Author

@blegouix blegouix Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same between Kokkos::LayoutLeft and Kokkos::LayoutStride. 2D LayoutLeft is necessarily contiguous in left dimension and enforces extent_left==stride_right.

Comment on lines +108 to +109
Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't "ExecSpace" stand for "Execution Space"? I.e. the place (CPU/GPU) where the calculation will be performed? For Lapack functions you don't have a choice about this so I'm not clear on why you parametrise by this. Is it simply a roundabout way of passing the memory space?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Lapack is a placeholder for Kokkos-Kernel. In the current branch we create a mirror on the CPU to call the solver but in #276 we expect to solve on GPU.

@blegouix blegouix closed this Jun 5, 2024
@tpadioleau
Copy link
Member

@blegouix Same question here about the branch

@blegouix blegouix deleted the lapack_matrices branch June 5, 2024 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants