-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lapack matrices #329
Lapack matrices #329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of matrix.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of matrix_banded.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand of this matrix, it only uses the GPU as a storage place. No computation is done on the GPU right ? If it is the case, no need to store data on the GPU. Keep the template to indicate where should the input data of solve_inplace_method
will be located but use directly host data for the lapack solver.
To summarize, the diff with the original implementation should be small, I suggest:
- replace unique_ptr with a host View or DualView
- if you use a DualView,
get_element
andset_element
should work only on the host view - if you use a DualView,
factorize
will also synchronize device view - after calling
factorize
,get_element
andset_element
should not be called anymore - in
solve_inplace_method
make a host copy ofb
and use it in the lapack function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I wanted Kokkos-kernel branch to be as small as possible to facilitate Yuuichi works, so it means I need again another MR to move the matrix on the GPU...? Cannot we just go simple with the three MR I have already opened ? Lapack solvers calls on CPU are just placeholders.
I don't like DualView very much as at the end we just need all the data on the GPU because that's where we will perform the operations (except factorization for all matrixes but sparse
).
Abdelhadi also has his matrix on the GPU and fills it directly on the GPU (because his matrix changes and he needs performance) so using DualViews or host view would definitively break the alignment between his class and mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:
template <class ExecSpace>
class Matrix_Banded : public Matrix
{
protected:
int const m_kl; // no. of subdiagonals
int const m_ku; // no. of superdiagonals
int const m_c; // no. of columns in q
Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices
// TODO: double**
Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation
public:
Matrix_Banded(int const mat_size, int const kl, int const ku)
: Matrix(mat_size)
, m_kl(kl)
, m_ku(ku)
, m_c(2 * kl + ku + 1)
, m_ipiv("ipiv", mat_size)
, m_q("q", m_c * mat_size)
{
assert(mat_size > 0);
assert(m_kl >= 0);
assert(m_ku >= 0);
assert(m_kl <= mat_size);
assert(m_ku <= mat_size);
/*
* Given the linear system A*x=b, we assume that A is a square (n by n)
* with ku super-diagonals and kl sub-diagonals.
* All non-zero elements of A are stored in the rectangular matrix q, using
* the format required by DGBTRF (LAPACK): diagonals of A are rows of q.
* q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows
* for the superdiagonals. (The kl additional rows are needed for pivoting.)
* The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j).
*/
Kokkos::deep_copy(m_q, 0.);
}
double get_element(int const i, int const j) const override
{
if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
return m_q(j * m_c + m_kl + m_ku + i - j);
} else {
return 0.0;
}
}
void set_element(int const i, int const j, double const aij) override
{
if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
m_q(j * m_c + m_kl + m_ku + i - j) = aij;
} else {
assert(std::fabs(aij) < 1e-20);
}
}
protected:
int factorize_method() override
{
int info;
int const n = get_size();
dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info);
return info;
}
int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override
{
assert(b.stride(0) == 1);
int const n_equations = b.extent(1);
int const stride = b.stride(1);
Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));
auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view);
for (int i = 0; i < n_equations; ++i) {
Kokkos::deep_copy(
Kokkos::subview(b_host, Kokkos::ALL, i),
Kokkos::subview(b_view, Kokkos::ALL, i));
}
int info;
int const n = get_size();
dgbtrs_(&transpose,
&n,
&m_kl,
&m_ku,
&n_equations,
m_q.data(),
&m_c,
m_ipiv.data(),
b_host.data(),
&stride,
&info);
for (int i = 0; i < n_equations; ++i) {
Kokkos::deep_copy(
Kokkos::subview(b_view, Kokkos::ALL, i),
Kokkos::subview(b_host, Kokkos::ALL, i));
}
return info;
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the change but I am still wondering why we would want the matrix being on the host, it is in contradiction with the tendency "put everything on ExecSpace" I tried to follow with splines, and I think I will have to revert the change in one month or two when the solvers will be on GPU. I am ok to do small PRs when it is possible but undo what has been done to redo it later looks waste of time to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to have a robust implementation with a fast solve_inplace_method
. Therefore, if solving the matrix is done on the CPU, it is better to keep the matrix on the CPU rather than going back and forth between the CPU and the GPU.
When we are able to solve on the GPU, the filling of the matrix on the GPU is still questionable. It might be faster to do it in parallel on the CPU, factorize again on the CPU and then push it to the GPU. Moreover the documentation of Kokkos does not mention if calling Kokkos::deep_copy
in parallel is allowed.
The size of the matrix is quite small in the end, having a copy of it on the CPU should not be a problem.
…od and revert the "layout logic" of the tests
Co-authored-by: Thomas Padioleau <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now the simpler approach should also be the more performant. I strongly advise we take it, here is the code:
template <class ExecSpace>
class Matrix_Banded : public Matrix
{
protected:
int const m_kl; // no. of subdiagonals
int const m_ku; // no. of superdiagonals
int const m_c; // no. of columns in q
Kokkos::View<int*, Kokkos::HostSpace> m_ipiv; // pivot indices
// TODO: double**
Kokkos::View<double*, Kokkos::HostSpace> m_q; // banded matrix representation
public:
Matrix_Banded(int const mat_size, int const kl, int const ku)
: Matrix(mat_size)
, m_kl(kl)
, m_ku(ku)
, m_c(2 * kl + ku + 1)
, m_ipiv("ipiv", mat_size)
, m_q("q", m_c * mat_size)
{
assert(mat_size > 0);
assert(m_kl >= 0);
assert(m_ku >= 0);
assert(m_kl <= mat_size);
assert(m_ku <= mat_size);
/*
* Given the linear system A*x=b, we assume that A is a square (n by n)
* with ku super-diagonals and kl sub-diagonals.
* All non-zero elements of A are stored in the rectangular matrix q, using
* the format required by DGBTRF (LAPACK): diagonals of A are rows of q.
* q has 2*kl rows for the subdiagonals, 1 row for the diagonal, and ku rows
* for the superdiagonals. (The kl additional rows are needed for pivoting.)
* The term A(i,j) of the full matrix is stored in q(i-j+2*kl+1,j).
*/
Kokkos::deep_copy(m_q, 0.);
}
double get_element(int const i, int const j) const override
{
if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
return m_q(j * m_c + m_kl + m_ku + i - j);
} else {
return 0.0;
}
}
void set_element(int const i, int const j, double const aij) override
{
if (i >= std::max(0, j - m_ku) && i < std::min(get_size(), j + m_kl + 1)) {
m_q(j * m_c + m_kl + m_ku + i - j) = aij;
} else {
assert(std::fabs(aij) < 1e-20);
}
}
protected:
int factorize_method() override
{
int info;
int const n = get_size();
dgbtrf_(&n, &n, &m_kl, &m_ku, m_q.data(), &m_c, m_ipiv.data(), &info);
return info;
}
int solve_inplace_method(ddc::DSpan2D_stride b, char const transpose) const override
{
assert(b.stride(0) == 1);
int const n_equations = b.extent(1);
int const stride = b.stride(1);
Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space>
b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride));
auto b_host = create_mirror_view(Kokkos::DefaultHostExecutionSpace(), b_view);
for (int i = 0; i < n_equations; ++i) {
Kokkos::deep_copy(
Kokkos::subview(b_host, Kokkos::ALL, i),
Kokkos::subview(b_view, Kokkos::ALL, i));
}
int info;
int const n = get_size();
dgbtrs_(&transpose,
&n,
&m_kl,
&m_ku,
&n_equations,
m_q.data(),
&m_c,
m_ipiv.data(),
b_host.data(),
&stride,
&info);
for (int i = 0; i < n_equations; ++i) {
Kokkos::deep_copy(
Kokkos::subview(b_view, Kokkos::ALL, i),
Kokkos::subview(b_host, Kokkos::ALL, i));
}
return info;
}
};
Co-authored-by: Thomas Padioleau <[email protected]>
Co-authored-by: Thomas Padioleau <[email protected]>
…ddc into lapack_matrices
@@ -48,14 +46,14 @@ void check_inverse(ddc::DSpan2D matrix, ddc::DSpan2D inv) | |||
for (std::size_t j(0); j < N; ++j) { | |||
double id_val = 0.0; | |||
for (std::size_t k(0); k < N; ++k) { | |||
id_val += matrix(i, k) * inv(j, k); | |||
id_val += matrix(i, k) * inv(k, j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it a silent bug ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example if you look there, this is unconventional:
assert(int(bx.extent(1)) == m_n); |
Kokkos::parallel_for( | ||
"solve_lambda_section", | ||
Kokkos::TeamPolicy<ExecSpace>(v.extent(1), Kokkos::AUTO), | ||
KOKKOS_LAMBDA( | ||
const typename Kokkos::TeamPolicy<ExecSpace>::member_type& teamMember) { | ||
const int j = teamMember.league_rank(); | ||
|
||
|
||
Kokkos::parallel_for( | ||
Kokkos::TeamThreadRange(teamMember, k_proxy), | ||
[&](const int i) { | ||
/// Upper diagonals in lambda | ||
for (int l = 0; l <= i; ++l) { | ||
Kokkos::atomic_sub(&v(i, j), lambda_device(i, l) * u(l, j)); | ||
} | ||
// Lower diagonals in lambda | ||
for (int l = i + 1; l < k_proxy + 1; ++l) { | ||
Kokkos::atomic_sub( | ||
&v(i, j), | ||
lambda_device(i, l) * u(nb_proxy - 1 - k_proxy + l, j)); | ||
} | ||
}); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have changed your approach of parallelisation compared to calculate_delta_to_factorize
. The approach in calculate_delta_to_factorize
seems simpler, do you think the hierarchical parallelisation is worth it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_delta_to_factorize
is perfomed once on CPU during factorization. Here we need a performant matrix manipulation performed at each solver call.
Actually time spent in this kernel does not seem to be negligeable in the first profiling tests I did, even with such optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it needs to be efficient.
How far have you been in the analysis of performance ?
- We need to know if we have coalescent accesses on GPU. Are
u
andv
really non-contiguous ? if not, left or right ? - Atomics have a cost, are they really necessary here ?
- What about the transfer to the device at the beginning of the function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes u and v are really non-contiguous because rhs structure is something like this:
uuuuuuuuuu
uuuuuuuuuu
uuuuuuuuuu
vvvvvvvvvvvv
vvvvvvvvvvvv
In Left layout.
- Looking at it again, I don't think atomics are necessary as long as we keep the loops on
l
sequential (but if we use TeamThreads/ThreadVector we can still use a scalar buffer to compute the sum onl
before doing the substraction). I will remove them. - In my original implementation lambda was on the device. So no, I did not look at the performance impact of it since the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I replace sequential for loops on l
with ThreadVector actually ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No i think this work would require its own PR to make a study of what execution policy better suits this loop.
Co-authored-by: Thomas Padioleau <[email protected]>
Co-authored-by: Thomas Padioleau <[email protected]>
Co-authored-by: Thomas Padioleau <[email protected]>
…avoid race condition after atomics removal
ddc::DSpan2D_left b_2d(b.data_handle(), b.extent(0), 1); | ||
solve_transpose_inplace(b_2d); | ||
return b; | ||
} | ||
|
||
virtual ddc::DSpan2D solve_multiple_inplace(ddc::DSpan2D const bx) const | ||
virtual ddc::DSpan2D_stride solve_inplace(ddc::DSpan2D_stride const bx) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between DSpan2D_left
and DSpan2D_stride
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same between Kokkos::LayoutLeft and Kokkos::LayoutStride. 2D LayoutLeft is necessarily contiguous in left dimension and enforces extent_left==stride_right
.
Kokkos::View<double**, Kokkos::LayoutStride, typename ExecSpace::memory_space> | ||
b_view(b.data_handle(), Kokkos::LayoutStride(get_size(), 1, n_equations, stride)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't "ExecSpace" stand for "Execution Space"? I.e. the place (CPU/GPU) where the calculation will be performed? For Lapack functions you don't have a choice about this so I'm not clear on why you parametrise by this. Is it simply a roundabout way of passing the memory space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Lapack is a placeholder for Kokkos-Kernel. In the current branch we create a mirror on the CPU to call the solver but in #276 we expect to solve on GPU.
…ddc into lapack_matrices
@blegouix Same question here about the branch |
Add new matrices necessary for the LAPACK backend available in #274
Matrices classes (including
matrix_sparse
) support strided multiple-rhs